Asegura tu viaje en la nube con nosotros

What is Apache Spark-Hadoop?

Apache Spark is a light and fast cluster computing technology, intended for fast computation. It is based on Hadoop MapReduce and it covers the MapReduce model to professionally use it for additional types of computations, which comprises collaborative queries and stream processing. The main feature of Spark is its in-memory cluster computing that surges the processing speed of an application. Spark is intended to cover a wide variety of assignments such as batch requests, iterative algorithms, interactive queries and streaming.


Apache Spark has become the engine to improve numerous of the competences of the ever-present Apache Hadoop setting. For Big Data, Apache Spark meets a lot of requirements and runs natively on Apache Hadoop’s YARN. By running Apache Spark in Apache Hadoop environment, gain all the safety, governance, and scalability characteristic to that platform. Apache Spark is also very well integrated with Apache Hive and advances access to all Apache Hadoop tables utilizing integrated security.

The engine was established at the University of California, Berkeley’s AMPLab and was given to Apache Software Foundation in 2013.

Spark jobs can be coded in Java, Scala, Python, R, and SQL. It delivers out of the box libraries for Machine Learning, Graph Processing, Streaming and SQL like data-processing. It also has an optimized engine for general execution graph.

Using Apache Spark, we attain a high data processing speed of about 100x faster in memory and 10x faster on the disk. This is made likely by dropping the number of read-write to disk.


Apache Spark is intended on two main abstractions:

1. Resilient Distributed Dataset (RDD):

RDD is an unchallengeable (read-only), fundamental collection of elements or items that can be functioned on many devices at the same time (parallel processing).

2. Directed Acyclic Graph (DAG):

Directed Acyclic Graph is the scheduling layer of Apache Spark Architecture that implements stage-oriented scheduling.

Advantages of Spark

• Spark delivers a unified platform for batch processing, structured data handling, streaming.
• Compared with map-reduce of Hadoop, the spark code is much easy to write and use.
• The most significant feature of Spark, it summaries the parallel programming aspect. Spark core summaries the complexities of distributed storage, computation, and parallel programming.

Spark Libraries:

1. Spark SQL

Spark SQL provides a SQL-like interface to do processing of structured data.

2. Spark Streaming

Spark Streaming is suited for applications which contract in data flowing in real-time, like processing Twitter feeds.

3. Spark MLlib

MLlib is short for Machine Learning Library which Spark delivers.

4. Spark MLlib

MLlib is short for Machine Learning Library which Spark offers.

Official Web site

About Apache Hadoop