Apache Hive Analytical Functions available since Hive 0.11.0, are a special group of functions that scan
the multiple input rows to compute each output value. Apache Hive Analytical Functions are usually used with OVER, PARTITION BY, ORDER BY, and the windowing specification. Different from the regular aggregate functions used with the GROUP BY clause that is limited to one result value per group, analytic functions operate on windows where the input rows are ordered and grouped using flexible conditions expressed through an OVER PARTITION clause. Though analytic functions give aggregate results, they do not group the result set. They
return the group value multiple times with each record. The analytic functions offer great flexibility and functionalities than the regular GROUP BY clause and make special aggregations in Hive easier and powerful. The syntax for the analyze function is as follows:
Function (arg1,..., argn) OVER ([PARTITION BY <...>] [ORDER BY <....>] [<window_clause>])
The Function (arg1,…, argn) can be any function in the following list with examples:
- Standard aggregations: This can be either COUNT(), SUM(), MIN(), MAX(), or AVG().
- RANK: It ranks items in a group, such as finding the top N rows for specific conditions.
- DENSE_RANK: It is similar to RANK, but leaves no gaps in the ranking sequence when there are ties. For example, if we rank a match using DENSE_RANK and had two players tie for second place, we would see that the two players were in second place and that the next person is ranked as third. However, the RANK function would also rank two people in second place, but the next person would be in fourth place.
- ROW_NUMBER: It assigns a unique sequence number starting from 1 to each row according to the partition and order specification.
- CUME_DIST: It computes the number of rows whose value is smaller or equal to the value of the total number of rows divided by the current row.
- PERCENT_RANK: It is similar to CUME_DIST, but it uses rank values rather than row counts in its numerator as total number of rows – 1 divided by current rank – 1. Therefore, it returns the percent rank of a value relative to a group of values.
- NTILE: It divides an ordered dataset into number of buckets and assigns an appropriate bucket number to each row. It can be used to divide rows into equal sets and assign a number to each row.
- LEAD: The LEAD function, lead(value_expr[,offset[,default]]), is used to return data from the next row. The number (value_expr) of rows to lead can optionally be specified. If the number of rows (offset) to lead is not specified, the lead is one row by default. It returns [,default] or null when the default is not specified and the
lead for the current row extends beyond the end of the window.
- LAG: The LAG function, lag(value_expr[,offset[,default]]), is used to access data from a previous row. The number (value_expr) of rows to lag can optionally be specified. If the number of rows (offset) to lag is not specified, the lag is one row by default. It returns [,default] or null when the default is not specified and the lag for the current row extends beyond the end of the window.
- FIRST_VALUE: It returns the first result from an ordered set.
- LAST_VALUE: It returns the last result from an ordered set. For LAST_VALUE, using the default windowing clause, the result can be a little unexpected. This is because the default windowing clause is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which in this example means the current row will always be the last value. Changing the windowing clause to RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING gives us the result we probably expected (see the last_value column in the following examples).
The [PARTITION BY <…>] statement is similar to the GROUP BY clause. It divides the rows into groups containing identical values in one or more partitions by columns. These logical groups are known as partitions, which is not the same term used for partition tables. Omitting the PARTITION BY statement applies the analytic operation to all the rows in the table.
The [ORDER BY <….>] clause is like the ORDER BY expr [ASC|DESC] clause. The ORDER BY clause is the same as the regular ORDER BY clause. It makes sure the rows produced by the PARTITION BY clause are ordered by specifications, such as ascending or descending order.
Standards Drive Learning. All skill of arithmetic is teachable precisely because all the students and all the teachers know that they are adding and subtracting in “base ten”. Shared standards make skill transferable. Standards make comparison possible. – Marcus Buckingham (First Break All The Rules)