Apache Spark

Apache Spark is an exciting technology that is rapidly superseding Hadoop’s MapReduce as the preferred big data processing platform. Hadoop is an open source, distributed, Java computation framework consisting of the Hadoop Distributed File System (HDFS) and MapReduce, its execution engine. Spark is similar to Hadoop in that it’s a distributed, general-purpose computing platform. But Spark’s unique design, which allows for … read the rest

Transpose & Pivot In Hive Query

Apache Hive does not have direct standard UDF for transposing rows into columns. Transpose & Pivot in Hive Query can be achieved using multi-stage process.  You can use collect_list() or collect_set() function and merge the multiple rows into columns and then get the result.

collect_list() and collect_set() are part of  Built-in Aggregate Functions (UDAF).  collect_list(col_name) returns a set of objects … read the rest

Apache Hive Analytical Functions

Apache Hive Analytical Functions available since Hive 0.11.0, are a special group of functions that scan
the multiple input rows to compute each output value. Apache Hive Analytical Functions are usually used with OVER, PARTITION BY, ORDER BY, and the windowing specification. Different from the regular aggregate functions used with the GROUP BY clause that is limited to one … read the rest

Apache Hive Vectorization

Apache Hive Vectorization was introduced newly in Apache Hive to improve query performance. By default, the Apache Hive query execution engine processes one row of a table at a time. The one row of data goes through all the operators in the query before the next row is processed, resulting in very inefficient CPU usage. In vectorized query execution, … read the rest

Apache Hive Cheat Sheet

Apache Hive Cheat Sheet is a summary of all functions and syntax for big data engineers and developers reference. It is divided into 5 parts.

Apache Hive Cheat Sheet – Query Syntax

Apache Hive Cheat Sheet – Metadata

Apache Hive Cheat Sheet – Query Compatibility

Apache Hive Cheat Sheet – Command Line

Apache Hive Cheat Sheet – Shell &  CLI

read the rest

Apache Hive CLI vs Beeline

Apache Hive development has shifted from the original Hive server (HiveServer1) to the new server (HiveServer2), and hence users and developers need to move to the new access tool. However, there’s more to this process than simply switching the executable name from “hive” to “beeline”.  Apache Hive was a heavyweight command-line tool that accepted the command and runs … read the rest

Immutability and RDD Interface in Spark

Immutability and RDD Interface in Spark are key concepts and it must be understood in detail. Spark defines an RDD interface with the properties that each type of RDD must implement. These properties include the RDD’s dependencies and information about data locality that are needed for the execution engine to compute that RDD. Since RDDs are statically typed and immutable, calling a transformation on … read the rest

Spark In-Memory Persistence and Memory Management

Spark In-Memory Persistence and Memory Management must be understood by engineering teams. Spark’s performance advantage over MapReduce is greatest in use cases involving repeated computations. Much of this performance increase is due to Spark’s use of in-memory persistence. Rather than writing to disk between each pass through the data, Spark has the option of keeping the data on the executors loaded into memory. That way, … read the rest