Apache Spark is an open source clustering framework for batch and stream processing. The framework originated at the AMPLab in UC Berkeley in 2009, became an Apache project in 2013, and emerged as one of the organization’s top priorities in 2014. It is currently supported by Databricks, which was founded by many of the original creators of Spark.
At the heart of Spark is the concept of the Resilient Distributed Dataset (RDD), a programming abstraction that is an immutable collection of objects, able to be split across a computing cluster. Operations on the RDDs can therefore also be split, leading to highly parallelizable processing. RDDs can be created from simple text files, SQL databases, NoSQL stores (such as Cassandra and Riak), Hadoop InputFiles, or even programmatically. Much of the Spark Core API is built on the RDD concept, enabling traditional map and reduce functionality, but also providing built-in support for joining data sets, filtering, sampling, and aggregation.