pySpark becomes a popular choice for many data engineers and becomes the mainstream technology for big data and machine learning projects. pySpark is not a programming language. It is a wrapper (or abstraction) to help python developers to write spark code. Apache Spark is developed in Scala (a functional Programming Language) and to make Apache Spark more accessible to the … read the rest
pySpark on Windows can be installed using two different ways. Since spark is a distributed compute engine, it also works stand alone. Most of the developer who are familiar with working jupyter notebood prefer to use jupyter notebook and it has to be integrated with pySpark.
There are other sets of python developers who prefer to … read the rest
pySpark Interactive Shell on Windows Pre-requisit
So the first thing when we talk about Spark is to make sure that … read the rest
In this PySpark Tutorial, we will understand why PySpark is becoming popular among data engineers and data scientist. This PySpark Tutorial will also highlight the key limilation of PySpark over Spark written in Scala (PySpark vs Spark Scala). The PySpark is actually a Python API for Spark and helps python developer/community to collaborat with Apache Spark using … read the rest
Data Lineage and Data Provenance are not the same thing. Many data engineer and architect use them interchangible but they are two different concept and has its separate meaning.
What is Data Provenance?
Data Provenance (or Data Provenance Document) captures inputs, entity, system and processes that influence the data of interest. This in effect provide a historical record of data … read the rest
What is data lineage and why it is important. Data lineage is nothing but its origins and transformation that data goes through with time. Data lineage can also be expressed as the life cycle and end to end flow the data. This lifecycle includes the origin of the data, how it moves from source to destination (or one point to … read the rest
How to perform minus operation on a date type or timestamp time.
Assume that you have following data set and you would like to perform minus/plus operation to the date/timestemp field.
id | cr_date
1 | 2017-03-17 11:12:00
2 | 2017-03-17 15:10:00
You first convert the field to unix timestemp and then call minus operation or plus operation and then … read the rest
Apache NiFi is a software project from the Apache Software Foundation designed to automate the flow of data between software systems (file system, RDBMS, APIs etc in and out) . It is based on the “NiagaraFiles” software previously developed by the NSA (National Security Agency), which is also the source of a part of its present … read the rest
Many times you get access to a Unix or Linux box via terminal. Before start using the terminal, you may want to know the Unix flavor. This article will help you with
Since Linux is an open source operating system and comes free unless using enterprise one. There are many variants … read the rest
One of the most well-known differences between managing UNIX-like systems and Windows systems is the Windows Registry. Chef has resources for creating, modifying, and deleting Windows Registry keys. Beware that these operations are nonreversible (there is no implicit backup of values, so it may be worth preparing a backup before modifying values), and that they can potentially … read the rest