Apache Spark

Apache Spark is an exciting technology that is rapidly superseding Hadoop’s MapReduce as the preferred big data processing platform. Hadoop is an open source, distributed, Java computation framework consisting of the Hadoop Distributed File System (HDFS) and MapReduce, its execution engine. Spark is similar to Hadoop in that it’s a distributed, general-purpose computing platform. But Spark’s unique design, which allows for … read the rest

Immutability and RDD Interface in Spark

Immutability and RDD Interface in Spark are key concepts and it must be understood in detail. Spark defines an RDD interface with the properties that each type of RDD must implement. These properties include the RDD’s dependencies and information about data locality that are needed for the execution engine to compute that RDD. Since RDDs are statically typed and immutable, calling a transformation on … read the rest

Spark In-Memory Persistence and Memory Management

Spark In-Memory Persistence and Memory Management must be understood by engineering teams. Spark’s performance advantage over MapReduce is greatest in use cases involving repeated computations. Much of this performance increase is due to Spark’s use of in-memory persistence. Rather than writing to disk between each pass through the data, Spark has the option of keeping the data on the executors loaded into memory. That way, … read the rest

Spark Model of Parallel Computing

Spark Model of Parallel Computing and sometimes also called RDD is an important API. Spark Model of Parallel Computing internally uses RDD and part of Spark Core library.

Spark allows users to write a program for the driver (or master node) on a cluster computing system that can perform operations on data in parallel. Spark represents large datasets as RDDs—immutable, distributed collections of … read the rest

How Apache Spark Works

Why you should be worried about how Apache Spark works? To get the most out of Spark, it is important to understand some of the principles used to design Spark and, at a cursory level, how Spark programs are executed this article introduces the overall design of Spark as well as its place in the big data ecosystem. Spark is often considered … read the rest