Transpose & Pivot In Hive Query

Apache Hive does not have direct standard UDF for transposing rows into columns. Transpose & Pivot in Hive Query can be achieved using multi-stage process.  You can use collect_list() or collect_set() function and merge the multiple rows into columns and then get the result.

collect_list() and collect_set() are part of  Built-in Aggregate Functions (UDAF).  collect_list(col_name) returns a set of objects with duplicate elements eliminated. collect_set(col_name) returns a list of objects with duplicates. collect_set() was added from Hive 0.13.0.  Refer the Apache Hive Language manual for details.

Assume we have the following data set

Create table table1 (`product_id` string,`category_code` string, `inventory_status` string,`quantity` int);

INSERT INTO table1 values('prod-1','mobile','Available',10);
INSERT INTO table1 values('prod-1','mobile','In-Transit',20);
INSERT INTO table1 values('prod-2','laptop','Available',10);
INSERT INTO table1 values('prod-2','laptop','Warehouse',20);

and it looks like this

Now if we have to transpose this column in such a way that all inventory status comes as column and inventory value also comes as the column.

Transpose & Pivot In Hive Query – collect_set()

Step 1: Use the case statement to build the column

select
product_id,
category_code,
CASE WHEN inventory_status='Available' THEN inventory_status END AS condition_1,
CASE WHEN inventory_status='Available' THEN quantity END AS condition_2,
CASE WHEN inventory_status='In-Transit' THEN inventory_status END AS condition_3,
CASE WHEN inventory_status='In-Transit' THEN quantity END AS condition_4
from 
table1

This will produce a result like this

We can see that there are null entries and each product is having multiple records. This is expected as we have not grouped this on the product yet and for each category.

Step 2: Now you can apply the collect_list() or collect_set() based on your need and apply group by on the column against which you want to collect the list which is available in row format

select  
product_id,  
category_code , 
collect_list(condition_1)[0] as stock,
collect_list(condition_2)[0] as stock_qty,
collect_list(condition_3)[0] as shipping,
collect_list(condition_4)[0] as shipping_qty
from 
(
select
product_id,
category_code,
CASE WHEN inventory_status='Available' THEN inventory_status END AS condition_1,
CASE WHEN inventory_status='Available' THEN quantity END AS condition_2,
CASE WHEN inventory_status='In-Transit' THEN inventory_status END AS condition_3,
CASE WHEN inventory_status='In-Transit' THEN quantity END AS condition_4
from 
table1
) dummy_tble2
group by product_id, category_code

Finally, you will get the result

Transpose & Pivot In Hive Query – collect_list() – Performance Impact

collect_list() uses ArrayList, so the data will be kept in the same order they were added, to do that, the user needs to use SORT BY clause in a subquery, don’t use ORDER BY, it will cause your query to execute in a non-distributed way.

 

Comments are closed.