Apache Hive Analytical Functions

Apache Hive Analytical Functionsavailable since Hive 0.11.0, are a special group of functions that scan the multiple input rows to compute each output value. Apache Hive Analytical Functions are usually usedwith OVER, PARTITION BY, ORDER BY, and the windowing specification. Different from theregular aggregate functions used with the GROUP BY clause that is limited to one resultvalue per group, analytic functions operate on windows where the input rows are orderedand grouped using flexible conditions expressed through an OVER PARTITION clause.Though analytic functions give aggregate results, they do not group the result set. They return the group value multiple times with each record. The analytic functions offer greatflexibility and functionalities than the regular GROUP BY clause and make specialaggregations in Hive easier and powerful. The syntax for the analyze function is asfollows:

Function (arg1,..., argn)
OVER ([PARTITION BY ...]
[ORDER BY ....]
[window_clause])

The Function (arg1,…, argn) can be any function in the following list with examples:

Standard aggregations: This can be either COUNT(), SUM(), MIN(), MAX(), or AVG().
RANK: It ranks items in a group, such as finding the top N rows for specific conditions.
DENSE_RANK: It is similar to RANK, but leaves no gaps in the ranking sequence whenthere are ties. For example, if we rank a match using DENSE_RANK and had two playerstie for second place, we would see that the two players were in second place and thatthe next person is ranked as third. However, the RANK function would also rank twopeople in second place, but the next person would be in fourth place.
ROW_NUMBER: It assigns a unique sequence number starting from 1 to each rowaccording to the partition and order specification.
CUME_DIST: It computes the number of rows whose value is smaller or equal to thevalue of the total number of rows divided by the current row.
PERCENT_RANK: It is similar to CUME_DIST, but it uses rank values rather than rowcounts in its numerator as total number of rows - 1 divided by current rank - 1. Therefore, it returns the percent rank of a value relative to a group of values.
NTILE: It divides an ordered dataset into number of buckets and assigns anappropriate bucket number to each row. It can be used to divide rows into equal setsand assign a number to each row.
LEAD: The LEAD function, lead(value_expr[,offset[,default]]), is used to returndata from the next row. The number (value_expr) of rows to lead can optionally bespecified. If the number of rows (offset) to lead is not specified, the lead is one rowby default. It returns [,default] or null when the default is not specified and the lead for the current row extends beyond the end of the window.
LAG: The LAG function, lag(value_expr[,offset[,default]]), is used to accessdata from a previous row. The number (value_expr) of rows to lag can optionally bespecified. If the number of rows (offset) to lag is not specified, the lag is one row bydefault. It returns [,default] or null when the default is not specified and the lag forthe current row extends beyond the end of the window.
FIRST_VALUE: It returns the first result from an ordered set.
LAST_VALUE: It returns the last result from an ordered set. For LAST_VALUE, using thedefault windowing clause, the result can be a little unexpected. This is because the default windowing clause is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which in this example means the current row will always be the last value. Changing the windowing clause to RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING gives us the result we probably expected (see the last_valuecolumn in the following examples).

The [PARTITION BY …] statement is similar to the GROUP BY clause. It divides therows into groups containing identical values in one or more partitions by columns. Theselogical groups are known as partitions, which is not the same term used for partitiontables. Omitting the PARTITION BY statement applies the analytic operation to all the rowsin the table.

The [ORDER BY ….] clause is like the ORDER BY expr [ASC

DESC] clause. The ORDERBY clause is the same as the regular ORDER BY clause. It makes sure the rows produced bythe PARTITION BY clause are ordered by specifications, such as ascending or descendingorder.

Standards Drive Learning. All skill of arithmetic is teachable precisely because all the students and all the teachers know that they are adding and subtracting in "base ten". Shared standards make skill transferable. Standards make comparison possible. - Marcus Buckingham (First Break All The Rules)

14 Apr 2020

Apache Hive

« PySpark Tutorial Apache Hive Best Practice »

Topper Tips

Apache Hive Analytical Functions