Apache Hive Vectorization

Apache Hive Vectorization was introduced newly in Apache Hive to improve query performance. By default, the Apache Hive query execution engine processes one row of a table at a time. The one row of data goes through all the operators in the query before the next row is processed, resulting in very inefficient CPU usage. In vectorized query execution, data rows are batched together and represented as a set of column vectors. The basic idea of vectorized query execution is to process a batch of rows as an array of column vectors.

When query vectorization is enabled, the query engine processes vectors of columns, which greatly improves CPU utilization for typical query operations like scans, filters, aggregates, and joins.

Apache Hive Vectorization - How to Enable

Apache Hive Vectorization is enabled or disabled for all file formats by setting the hive.vectorized.execution.enabled property to true or false and making sure that no value is set for the hive.vectorized.input.format.excludes property. To ensure that query vectorization is used for the Parquet file format, you must make sure that the hive.vectorized.input.format.excludes property is not set to org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.

Hive query vectorization is enabled by default in CDH 6 and CDH 5,however, in CDH 5 vectorized query execution in Hive is only possible on ORC-formatted tables, which Cloudera recommends you do not use for overall compatibility with the CDH platform. Instead, Cloudera recommends that you use tables in the Parquet format because all CDH components support this format and it can be consumed by all CDH components.

Verifying a Query is Vectorized

To verify that a query is vectorized, use theEXPLAIN VECTORIZATIONstatement. This statement returns a query plan that shows how the Hive query execution engine processes your query and whether vectorization is being triggered.

Example of Verifying that Query Vectorization is Triggered for Your Query

This example uses the Hive tablep_clients, which uses the Parquet format and contains the following columns and data types:

To get the query execution plan for a query, enter the following commands in a Beeline session:
EXPLAIN VECTORIZATION SELECT COUNT(*) FROM p_clients WHERE ipoyear = 2009;

This command returns the following query execution plan:

Vectorization is explained in several parts of this query plan:

The PLAN VECTORIZATION section shows a high-level view of the vectorization status for the query. The enabled flag set to true means that vectorization is turned on and the enabledConditionsMet flag shows that it is enabled because the hive.vectorized.execution.enabled property is set to true. If vectorization is not enabled, the enabledConditionsNotMet flag shows why. Then in the STAGE PLANS section, the output shows the vectorization status for each task of query execution. For example, there might be multiple map and reduce tasks for a query and it is possible that only a subset of these tasks are vectorized. In the above example, the Stage-1 sub-section shows there is only one map task and one reduce task. The Execution mode sub-section of the map task shows whether the task is vectorized. In this case, vectorized displays, which means that the vectorizer was able to successfully validate and vectorize all of the operators for this map task. The Map Vectorization sub-section shows more details of map task vectorization. Specifically, the configurations that affect the map side vectorization are shown along with whether these configurations are enabled. If the configurations are enabled, they are listed for enabledConditionsMet. If the configurations are not enabled, they are listed for enabledConditionsNotMet as explained in the above PLAN VECTORIZATION section. In this example, it shows that the map side of query execution is enabled because the hive.vectorized.use.vectorized.input.format property is set to true. This section also contains details about input file format and adaptor settings used in the map side of query execution. The Reduce Vectorization sub-section shows that the reduce side of query execution was not vectorized because the hive.vectorized.execution.reduce.enabled property is set to false. This sub-section also shows that the execution engine is not set to Tez or Spark, which are needed for reduce side vectorization. In this particular example, to enable reduce side vectorization, the execution engine should be set to Spark and the hive.vectorized.execution.reduce.enabled property should be set to true. By using the EXPLAIN VECTORIZATION statement with your queries, you can find out before you deploy them whether vectorization will be triggered and what properties you must set to enable it.