Transpose & Pivot In Hive Query
Apache Hive does not have direct standard UDF for transposing rows into columns. Transpose & Pivot in Hive Query can be achieved using multi-stage process. You can use collect_list() or collect_set() function and merge the multiple rows into columns and then get the result.
collect_list() and collect_set() are part ofBuilt-in Aggregate Functions (UDAF). collect_list(col_name) returns a set of objects with duplicate elements eliminated. collect_set(col_name) returns a list of objects with duplicates. collect_set() was added from Hive 0.13.0. Refer the Apache Hive Language manual for details.
Assume we have the following data set
Create table table1 (`product_id` string,`category_code` string, `inventory_status` string,`quantity` int); INSERT INTO table1 values('prod-1','mobile','Available',10); INSERT INTO table1 values('prod-1','mobile','In-Transit',20); INSERT INTO table1 values('prod-2','laptop','Available',10); INSERT INTO table1 values('prod-2','laptop','Warehouse',20);
and it looks like this
Now if we have to transpose this column in such a way that all inventory status comes as column and inventory value also comes as the column.
Transpose & Pivot In Hive Query - collect_set()
Step 1: Use the case statement to build the column
select product_id, category_code, CASE WHEN inventory_status='Available' THEN inventory_status END AS condition_1, CASE WHEN inventory_status='Available' THEN quantity END AS condition_2, CASE WHEN inventory_status='In-Transit' THEN inventory_status END AS condition_3, CASE WHEN inventory_status='In-Transit' THEN quantity END AS condition_4 from table1
This will produce a result like this
We can see that there are null entries and each product is having multiple records. This is expected as we have not grouped this on the product yet and for each category.
Step 2: Now you can apply the collect_list() or collect_set() based on your need and apply group by on the column against which you want to collect the list which is available in row format
select product_id, category_code , collect_list[condition_1](0) as stock, collect_list[condition_2](0) as stock_qty, collect_list[condition_3](0) as shipping, collect_list[condition_4](0) as shipping_qty from ( select product_id, category_code, CASE WHEN inventory_status='Available' THEN inventory_status END AS condition_1, CASE WHEN inventory_status='Available' THEN quantity END AS condition_2, CASE WHEN inventory_status='In-Transit' THEN inventory_status END AS condition_3, CASE WHEN inventory_status='In-Transit' THEN quantity END AS condition_4 from table1 ) dummy_tble2 group by product_id, category_code
Finally, you will get the result
Transpose & Pivot In Hive Query - collect_list() -Performance Impact
collect_list() uses ArrayList, so the data will be kept in the same order they were added, to do that, the userneeds to use SORT BY clause in a subquery, don’t use ORDER BY, it will cause your query to execute in a non-distributed way.