Best Big Data framework: Apache Spark Vs Hadoop Mapreduce

Spark and Hadoop are popular and well-known Apache projects in the Big Data World and we decided to take a closer look at both of them. However before we start it`s worth mentioning that direct comparison of Hadoop and Spark is difficult because they do many of the same things, but are also non-overlapping in some areas. There are business applications where Hadoop outperforms the newcomer Spark, but Spark has its place in the big data space because of its speed and its ease of use.

Spark is a standalone solution in itself. However, it can also run in Hadoop clusters through YARN. Spark is basically a cluster-computing framework, which means that it competes more with MapReduce than with the entire Hadoop ecosystem.

In fact, the key difference between Spark and MapReduce lies in the approach to processing: Spark can do it in-memory, while Hadoop MapReduce has to read from and write to a disk. As a result, the speed of processing differs significantly – Spark may be up to 100 times faster. However, the volume of data processed also differs: Hadoop MapReduce is able to work with far larger data sets than Spark.

Spark vs Hadoop MapReduce: Performance

Spark is lightning fast cluster computing tool. Apache Spark runs applications up to 100x faster in memory and 10x faster on disk than Hadoop. Because of reducing the number of read/write cycle to disk and storing intermediate data in-memory Spark makes it possible.

MapReduce reads and writes from disk, as a result, it slows down the processing speed.

Spark vs Hadoop MapReduce: Ease of Use

Spark is well known for its performance, but it’s also somewhat well known for its ease of use : its abstraction (RDD) enables a user to process data using high-level operators. It also provides rich APIs in Java, Scala, Python, and R.

Hadoop MapReduce is written in Java, which is comparatively difficult to program. Although Pig makes it easier, it mandates some effort to learn the syntax.

When it comes to installation and maintenance, Spark is not bound to Hadoop. However, both Spark and Hadoop MapReduce are included in distributions by Cloudera (CDH 5) and Hortonworks (HDP 2.2).

Spark vs Hadoop MapReduce: Data Processing

Hadoop MapReduce is an excellent batch processing engine. It follows sequential steps. MapReduce performs batch processing by reading data from the cluster, performing its operation on the data, writing the results back to the cluster, reading updated data from the cluster, performing the next data operation, writing those results back to the cluster and so on.

Spark also performs operations similar to Hadoop MapReduce, but it performs all the tasks in one step and in memory. It reads data from the cluster, performs its operation on the data, and then writes it back to the cluster. Additionally, Spark includes its own graph computation library, namely GraphX. GraphX enables users to view data as graphs and collections.

Spark vs Hadoop MapReduce: Cost

Both Hadoop MapReduce and Spark are Apache projects and they are free and open source software products. However there are costs associated with running either platform in personnel and in hardware.

MapReduce uses definite amounts of memory as its processing is disk-based. Therefore, it requires organizations running MapReduce to purchase faster disks and a lot of disk space. MapReduce also needs more systems to distribute the disk I/O over multiple systems.

Spark, on the other hand, requires a lot of memory but can be managed with a standard amount of disk that runs at standard speed. However, Spark systems incur more cost because of the large amounts of RAM required to run everything in memory.

Spark vs Hadoop MapReduce Real-Time Analysis

Real-time data analysis means processing data generated by the real-time event streams coming in at the rate of millions of events per second, Twitter data for instance. The strength of Spark lies in its abilities to support streaming of data along with distributed processing.

MapReduce fails when it comes to real-time data processing as it was designed to perform batch processing on voluminous amounts of data.

Spark vs Hadoop Machine Learning

Spark has MLlib – a built-in machine learning library, while Hadoop needs a third-party to provide it. MLlib has out-of-the-box algorithms that also run in memory. Besides, there is a possibility of tuning and adjusting them.

Conclusion

Although the existence of two Big Data frameworks is often seen as a battle for dominance of the one over the other, it is important to understand that they are not competing with one another, in fact, they complement each other quite well. And when making the final decision businesses should consider each framework from the perspective of their particular needs.

Anna Kozik

Business Development Manager

%d bloggers like this: