Apache Spark

Apache Spark is an exciting technology that is rapidly superseding Hadoops MapReduce as the preferred big data processing platform. Hadoop is an open source, distributed, Java computation framework consisting of the Hadoop Distributed File System (HDFS) and MapReduce, its execution engine. Spark is similar to Hadoop inthat its a distributed, general-purpose computing platform. But Sparks unique design, which allows for keeping large amounts of data in memory, offers tremendous performance improvements. Apache Spark programs can be 100 times faster than their
MapReduce counterparts.

By supporting Python, Java, Scala, and, most recently, R, Spark is open to a wide range of users: to the science community that traditionally favors Python and R, to the still-widespread Java community, and to people using the increasingly popular Scala, which offers functional programming on the Java virtual machine (JVM).

Spark combines MapReduce-like capabilities for batch programming, real-time data-processing functions, SQL-like handling of structured data, graph algorithms, and machine learning, all in a single framework. This makes it a one-stop shopfor most of your big data-crunching needs. Its no wonder, then, that Spark is one of the busiest and fastest-growing Apache Software Foundation projects today

Apache Spark & Hadoop Framework

The Hadoop framework, with its HDFS as storage and MapReduce data-processing engine, was the first that brought distributed computing to the masses. Apache Hadoop solved the three main problems facing any distributed data-processing endeavor:
1) ParallelizationHow to perform subsets of the computation simultaneously
2) DistributionHow to distribute the data
3) Fault toleranceHow to handle component failure

On top of that, Hadoop clusters are often made of commodity hardware, which makes Apache Hadoop easy to set up. Thats why the last decade saw its wide adoption.

Hadoop's MapReduce computing has shortcomings. MapReduce job results need to be stored in HDFS before they can be used by another job. For this reason, MapReduce is inherently bad with iterative algorithms.

What Apache Spark Bring to Table?

Apache Sparks core concept is an in-memory execution model that enables caching job data in memory instead of fetching it from disk every time, as MapReduce does. This can speed the spark job execution up to 100 times by bringing parallel computing, compared to the same jobs in Hadoop Map-Reduce; it has the biggest effect on iterative algorithms such as machine learning, graph algorithms, and other types of workloads that need to reuse data.

Apache Spark - Ease of Use

Spark supports the Scala, Java, Python, and R programming languages, so its accessible to a much wider audience. Spark shell (read-eval-print loop [REPL]) offers an interactive console that can be used for experimentation and idea testing. By contrast, the following is all it takes for the same Spark program written in Scala:

val spark = SparkSession.builder().appName("Spark wordcount")
val file = spark.sparkContext.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1)).countByKey()
counts.saveAsTextFile("hdfs://...")

Additional PySpark Resource & Reading Material

PySpark Frequentl Asked Question

Refer our PySpark FAQ space where important queries and informations are clarified. It also links to important PySpark Tutorial apges with-in site.

PySpark Examples Code

Find our GitHub Repository which list PySpark Example with code snippet

PySpark/Spark Related Interesting Blogs

Here are the list of informative blogs and related articles, which you might find interesting

31 Dec 2018

Apache Spark

« Immutability And Rdd Interface In Spark Chef And Microsoft Windows »

Topper Tips