{"id":2205,"date":"2019-09-21T17:41:15","date_gmt":"2019-09-21T17:41:15","guid":{"rendered":"https:\/\/nub8.net\/?p=2205"},"modified":"2026-04-08T14:06:30","modified_gmt":"2026-04-08T14:06:30","slug":"what-is-apache-spark-hadoop","status":"publish","type":"post","link":"https:\/\/www.nub8.net\/es\/what-is-apache-spark-hadoop\/","title":{"rendered":"What is Apache Spark-Hadoop?"},"content":{"rendered":"<p>Apache Spark is a light and fast cluster computing technology, intended for fast computation. It is based on Hadoop MapReduce and it covers the MapReduce model to professionally use it for additional types of computations, which comprises collaborative queries and stream processing. The main feature of Spark is its in-memory cluster computing that surges the processing speed of an application. Spark is intended to cover a wide variety of assignments such as batch requests, iterative algorithms, interactive queries and streaming.<\/p>\n<p><img decoding=\"async\" class=\"wp-image-7526 aligncenter\" src=\"https:\/\/nub8.net\/wp-content\/uploads\/2019\/09\/hadoop-300x130.png\" alt=\"\" width=\"314\" height=\"136\" \/><\/p>\n<h2><strong>Features<\/strong><\/h2>\n<p>Apache Spark has become the engine to improve numerous of the competences of the ever-present Apache Hadoop setting. For Big Data, Apache Spark meets a lot of requirements and runs natively on Apache Hadoop\u2019s YARN. By running Apache Spark in Apache Hadoop environment, gain all the safety, governance, and scalability characteristic to that platform. Apache Spark is also very well integrated with Apache Hive and advances access to all Apache Hadoop tables utilizing integrated security.<\/p>\n<p>The engine was established at the University of California, Berkeley&#8217;s AMPLab and was given to Apache Software Foundation in 2013.<\/p>\n<p>Spark jobs can be coded in Java, Scala, Python, R, and SQL. It delivers out of the box libraries for Machine Learning, Graph Processing, Streaming and SQL like data-processing. It also has an optimized engine for general execution graph.<\/p>\n<p>Using Apache Spark, we attain a high data processing speed of about 100x faster in memory and 10x faster on the disk. This is made likely by dropping the number of read-write to disk.<\/p>\n<h2><strong>Arquitectura<\/strong><\/h2>\n<p>Apache Spark is intended on two main abstractions:<\/p>\n<h2><strong>1. Resilient Distributed Dataset (RDD): <\/strong><\/h2>\n<p>RDD is an unchallengeable (read-only), fundamental collection of elements or items that can be functioned on many devices at the same time (parallel processing).<\/p>\n<h2><strong>2. Directed Acyclic Graph (DAG): <\/strong><\/h2>\n<p>Directed Acyclic Graph is the scheduling layer of Apache Spark Architecture that implements stage-oriented scheduling.<\/p>\n<h2><strong>Advantages of Spark<\/strong><\/h2>\n<p>\u2022 Spark delivers a unified platform for batch processing, structured data handling, streaming.<br \/>\n\u2022 Compared with map-reduce of Hadoop, the spark code is much easy to write and use.<br \/>\n\u2022 The most significant feature of Spark, it summaries the parallel programming aspect. Spark core summaries the complexities of distributed storage, computation, and parallel programming.<\/p>\n<h2><strong>Spark Libraries:<\/strong><\/h2>\n<h2><strong>1. Spark SQL<\/strong><\/h2>\n<p>Spark SQL provides a SQL-like interface to do processing of structured data.<\/p>\n<h2><strong>2. Spark Streaming<\/strong><\/h2>\n<p>Spark Streaming is suited for applications which contract in data flowing in real-time, like processing Twitter feeds.<\/p>\n<h2><strong>3. Spark MLlib<\/strong><\/h2>\n<p>MLlib is short for Machine Learning Library which Spark delivers.<\/p>\n<h2><strong>4. Spark MLlib<\/strong><\/h2>\n<p>MLlib is short for Machine Learning Library which Spark offers.<\/p>\n<p><a href=\"https:\/\/nub8.net\/wp-content\/uploads\/2019\/09\/spark.apache.org\">Official Web site<\/a><\/p>\n<p><a href=\"https:\/\/www.nub8.net\/es\/apache-hadoop\/\"> About Apache Hadoop<\/a><\/p>","protected":false},"excerpt":{"rendered":"<p>Apache Spark is a light and fast cluster computing technology, intended for fast computation. It is based on Hadoop MapReduce and it covers the MapReduce model to professionally use it for additional types of computations, which comprises collaborative queries and stream processing. The main feature of Spark is its in-memory cluster computing that surges the [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":7526,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[31],"tags":[],"class_list":["post-2205","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-big-data"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.nub8.net\/es\/wp-json\/wp\/v2\/posts\/2205","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.nub8.net\/es\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.nub8.net\/es\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.nub8.net\/es\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.nub8.net\/es\/wp-json\/wp\/v2\/comments?post=2205"}],"version-history":[{"count":1,"href":"https:\/\/www.nub8.net\/es\/wp-json\/wp\/v2\/posts\/2205\/revisions"}],"predecessor-version":[{"id":11907,"href":"https:\/\/www.nub8.net\/es\/wp-json\/wp\/v2\/posts\/2205\/revisions\/11907"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.nub8.net\/es\/wp-json\/"}],"wp:attachment":[{"href":"https:\/\/www.nub8.net\/es\/wp-json\/wp\/v2\/media?parent=2205"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.nub8.net\/es\/wp-json\/wp\/v2\/categories?post=2205"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.nub8.net\/es\/wp-json\/wp\/v2\/tags?post=2205"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}