Apache Spark Machine Learning Blueprints
上QQ阅读APP看书,第一时间看更新

Spark overview and Spark advantages

In this section, we provide an overview of the Apache Spark computing platform and a discussion about some advantages of utilizing Apache Spark, in comparison to using other computing platforms like MapReduce. Then, we briefly discuss how Spark computing fits modern machine learning and big data analytics.

After this section, readers will form a basic understanding of Apache Spark as well as a good understanding of some important machine learning benefits from utilizing Apache Spark.

Spark overview

Apache Spark is a computing framework for the fast processing of big data. This framework contains a distributed computing engine and a specially designed programming model. Spark was started as a research project at the AMPLab of the University of California at Berkeley in 2009, and then in 2010 it became fully open sourced as it was donated to the Apache Software Foundation. Since then, Apache Spark has experienced exponential growth, and now Spark is the most active open source project in the big data field.

Spark's computing utilizes an in-memory distributed computational approach, which makes Spark computing among the fastest, especially for iterative computation. It can run up to 100 times faster than Hadoop MapReduce, according to many tests that have been performed.

Apache Spark has a unified platform, which consists of the Spark core engine and four libraries: Spark SQL, Spark Streaming, MLlib, and GraphX. All of these four libraries have Python, Java and Scala programming APIs.

Besides the above mentioned four built-in libraries, there are also tens of packages available for Apache Spark, provided by third parties, which can be used for handling data sources, machine learning, and other tasks.

Spark overview

Apache Spark has a 3 month circle for new releases, with Spark version 1.6.0 released on January 4 of 2016. Apache Spark release 1.3 had DataFrames API and ML Pipelines API included. Starting from Apache Spark release 1.4, the R interface (SparkR) is included as default.

Note

To download Apache Spark, readers should go to http://spark.apache.org/downloads.html.

To install Apache Spark and start running it, readers should consult its latest documentation at http://spark.apache.org/docs/latest/.

Spark advantages

Apache Spark has many advantages over MapReduce and other big data computing platforms. Among them, the distinguished two are that it is fast to run and fast to write.

Overall, Apache Spark has kept some of MapReduce's most important advantages like that of scalability and fault tolerance, but extended them greatly with new technologies.

In comparison to MapReduce, Apache Spark's engine is capable of executing a more general Directed Acyclic Graph (DAG) of operators. Therefore, when using Apache Spark to execute MapReduce-style graphs, users can achieve higher performance batch processing in Hadoop.

Apache Spark has in-memory processing capabilities, and uses a new data abstraction method, Resilient Distributed Dataset (RDD), which enables highly iterative computing and reactive applications. This also extended its fault tolerance capability.

At the same time, Apache Spark has made complex pipeline representation easy with only a few lines of code needed. It is best known for the ease with which it can be used to create algorithms that capture insight from complex and even messy data, and also enable users to apply that insight in-time to drive outcomes.

As summarized by the Apache Spark team, Spark enables:

  • Iterative algorithms in Machine Learning
  • Interactive data mining and data processing
  • Hive-compatible data warehousing that can run 100x faster
  • Stream processing
  • Sensor data processing

To a practical data scientist working with the above, Apache Spark easily demonstrates its advantages when it is adopted for:

  • Parallel computing
  • Interactive analytics
  • Complex computation

Most users are satisfied with Apache Spark's advantages in speed and performance, but some also noted that Apache Spark is still in the process of maturing.

Note

http://www.svds.com/use-cases-for-apache-spark/ has some examples of materialized Spark benefits.