Get started with Apache Spark

Apache Spark is an open source clustering framework for batch and stream processing. The framework originated at the AMPLab in UC Berkeley in 2009, became an Apache project in 2013, and emerged as one of the organization’s top priorities in 2014. It is currently supported by Databricks, which was founded by many of the original creators of Spark.

At the heart of Spark is the concept of the Resilient Distributed Dataset (RDD), a programming abstraction that is an immutable collection of objects, able to be split across a computing cluster. Operations on the RDDs can therefore also be split, leading to highly parallelizable processing. RDDs can be created from simple text files, SQL databases, NoSQL stores (such as Cassandra and Riak), Hadoop InputFiles, or even programmatically. Much of the Spark Core API is built on the RDD concept, enabling traditional map and reduce functionality, but also providing built-in support for joining data sets, filtering, sampling, and aggregation.

To read this article in full or to leave a comment, please click here

from InfoWorld Big Data


Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s