pySpark becomes a popular choice for many data engineers and becomes the mainstream technology for big data and machine learning projects. pySpark is not a programming language. It is a wrapper (or abstraction) to help python developers to write spark code. Apache Spark is developed in Scala (a functional Programming Language) and to make Apache Spark more accessible to the … read the rest
pySpark on Windows can be installed using two different ways. Since spark is a distributed compute engine, it also works stand alone. Most of the developer who are familiar with working jupyter notebood prefer to use jupyter notebook and it has to be integrated with pySpark.
There are other sets of python developers who prefer to … read the rest
pySpark Interactive Shell on Windows Pre-requisit
So the first thing when we talk about Spark is to make sure that … read the rest
In this PySpark Tutorial, we will understand why PySpark is becoming popular among data engineers and data scientist. This PySpark Tutorial will also highlight the key limilation of PySpark over Spark written in Scala (PySpark vs Spark Scala). The PySpark is actually a Python API for Spark and helps python developer/community to collaborat with Apache Spark using … read the rest
Data Lineage and Data Provenance are not the same thing. Many data engineer and architect use them interchangible but they are two different concept and has its separate meaning.
What is Data Provenance?
Data Provenance (or Data Provenance Document) captures inputs, entity, system and processes that influence the data of interest. This in effect provide a historical record of data … read the rest
What is data lineage and why it is important. Data lineage is nothing but its origins and transformation that data goes through with time. Data lineage can also be expressed as the life cycle and end to end flow the data. This lifecycle includes the origin of the data, how it moves from source to destination (or one point to … read the rest
How to perform minus operation on a date type or timestamp time.
Assume that you have following data set and you would like to perform minus/plus operation to the date/timestemp field.
id | cr_date
1 | 2017-03-17 11:12:00
2 | 2017-03-17 15:10:00
You first convert the field to unix timestemp and then call minus operation or plus operation and then … read the rest
Apache NiFi is a software project from the Apache Software Foundation designed to automate the flow of data between software systems (file system, RDBMS, APIs etc in and out) . It is based on the “NiagaraFiles” software previously developed by the NSA (National Security Agency), which is also the source of a part of its present … read the rest
Many times you get access to a Unix or Linux box via terminal. Before start using the terminal, you may want to know the Unix flavor. This article will help you with
Since Linux is an open source operating system and comes free unless using enterprise one. There are many variants … read the rest
Apache Spark is an exciting technology that is rapidly superseding Hadoop’s MapReduce as the preferred big data processing platform. Hadoop is an open source, distributed, Java computation framework consisting of the Hadoop Distributed File System (HDFS) and MapReduce, its execution engine. Spark is similar to Hadoop in that it’s a distributed, general-purpose computing platform. But Spark’s unique design, which allows for … read the rest