pySpark on Windows

pySpark on Windows can be installed using two different ways. Since spark is a distributed compute engine, it also works stand alone. Most of the developer who are familiar with working jupyter notebood prefer to use jupyter notebook and it has to be integrated with pySpark.

pySpark Jupiter Notebook

There are other sets of python developers who prefer to … read the rest

PySpark Tutorial

In this PySpark Tutorial, we will understand why PySpark is becoming popular among data engineers and data scientist. This PySpark Tutorial will also highlight the key limilation of PySpark over Spark written in Scala (PySpark vs Spark Scala). The PySpark is actually a Python API for Spark and helps python developer/community to collaborat with Apache Spark using … read the rest

Data Lineage Vs Data Provenance

Data Lineage and Data Provenance are not the same thing. Many data engineer and architect use them interchangible but they are two different concept and has its separate meaning.

What is Data Provenance?

Data Provenance (or Data Provenance Document) captures inputs, entity, system and processes that influence the data of interest. This in effect provide a historical record of data … read the rest

What is data lineage

What is data lineage and why it is important. Data lineage is nothing but its origins and transformation that data goes through with time. Data lineage can also be expressed as the life cycle and end to end flow the data. This lifecycle includes the origin of the data, how it moves from source to destination (or one point to … read the rest

What is Apache NiFi

Apache NiFi is a software project from the Apache Software Foundation designed to automate the flow of data between software systems (file system, RDBMS, APIs etc in and out) . It is based on the “NiagaraFiles” software previously developed by the NSA (National Security Agency), which is also the source of a part of its present … read the rest

Transpose & Pivot In Hive Query

Apache Hive does not have direct standard UDF for transposing rows into columns. Transpose & Pivot in Hive Query can be achieved using multi-stage process.  You can use collect_list() or collect_set() function and merge the multiple rows into columns and then get the result.

collect_list() and collect_set() are part of  Built-in Aggregate Functions (UDAF).  collect_list(col_name) returns a set of objects … read the rest

Apache Hive Analytical Functions

Apache Hive Analytical Functions available since Hive 0.11.0, are a special group of functions that scan
the multiple input rows to compute each output value. Apache Hive Analytical Functions are usually used with OVER, PARTITION BY, ORDER BY, and the windowing specification. Different from the regular aggregate functions used with the GROUP BY clause that is limited to one … read the rest

Apache Hive Vectorization

Apache Hive Vectorization was introduced newly in Apache Hive to improve query performance. By default, the Apache Hive query execution engine processes one row of a table at a time. The one row of data goes through all the operators in the query before the next row is processed, resulting in very inefficient CPU usage. In vectorized query execution, … read the rest