pySpark becomes a popular choice for many data engineers and becomes the mainstream technology for big data and machine learning projects. pySpark is not a programming language. It is a wrapper (or abstraction) to help python developers to write spark code. Apache Spark is developed in Scala (a functional Programming Language) and to make Apache Spark more accessible to the … read the rest

pySpark on Windows

pySpark on Windows can be installed using two different ways. Since spark is a distributed compute engine, it also works stand alone. Most of the developer who are familiar with working jupyter notebood prefer to use jupyter notebook and it has to be integrated with pySpark.

pySpark Jupiter Notebook

There are other sets of python developers who prefer to … read the rest

PySpark Tutorial

In this PySpark Tutorial, we will understand why PySpark is becoming popular among data engineers and data scientist. This PySpark Tutorial will also highlight the key limilation of PySpark over Spark written in Scala (PySpark vs Spark Scala). The PySpark is actually a Python API for Spark and helps python developer/community to collaborat with Apache Spark using … read the rest

Data Lineage Vs Data Provenance

Data Lineage and Data Provenance are not the same thing. Many data engineer and architect use them interchangible but they are two different concept and has its separate meaning.

What is Data Provenance?

Data Provenance (or Data Provenance Document) captures inputs, entity, system and processes that influence the data of interest. This in effect provide a historical record of data … read the rest

What is Apache NiFi

Apache NiFi is a software project from the Apache Software Foundation designed to automate the flow of data between software systems (file system, RDBMS, APIs etc in and out) . It is based on the “NiagaraFiles” software previously developed by the NSA (National Security Agency), which is also the source of a part of its present … read the rest

Apache Spark

Apache Spark is an exciting technology that is rapidly superseding Hadoop’s MapReduce as the preferred big data processing platform. Hadoop is an open source, distributed, Java computation framework consisting of the Hadoop Distributed File System (HDFS) and MapReduce, its execution engine. Spark is similar to Hadoop in that it’s a distributed, general-purpose computing platform. But Spark’s unique design, which allows for … read the rest