The Evolution of Hadoop

Ashwin Ramachandran | November 19, 2020

When Hadoop was initially released in 2006, its value proposition was revolutionary—store any type of data, structured or unstructured, in a single repository free of limiting schemas, and process that data at scale across a compute cluster built of cheap, commodity servers. Gone were the days of trying to scale up a legacy data warehouse on-premises built on expensive hardware. Processing more data was as simple as adding a node in the cluster. As the variety and velocity of data continued to proliferate, Hadoop provided a mechanism to leverage all of that data to answer pressing business questions.

We’ve come a long way since Hadoop since burst on to the scene, and as we look at the cloud transformation organizations are embarking on, we at Precisely would like to trace how Hadoop has transformed since it first burst on the scene, and where we see it going.

Early days

Hadoop’s initial form was quite simple: a resilient distributed filesystem, HDFS, tightly coupled with a batch compute model, MapReduce, to process the data stored in the distributed file system. Users would write MapReduce programs in Java to read, process, sort, aggregate, and manipulate data to derive key insights. While impressive, the ongoing challenge of finding developers comfortable writing Java MapReduce code, and the inherent complexity of doing so, led to the release of query engines like Hive and Impala. With these technologies, users familiar with SQL could leverage the power of Hadoop without the need to understand MapReduce code.

Download the eBook

A Data Integrator’s Guide to Successful Big Data Projects

Learn how to successfully tackle big data. This eBook will guide you through the ins and outs of building successful big data projects on a solid foundation of data integration.

Download

Apache Spark joins the party

Hadoop took a significant step forward with the release of YARN in 2012 as an “operating system” of sorts for the platform. YARN’s introduction decoupled MapReduce from Hadoop as the only available data processing paradigm. This was a monumental step forward, as it signaled Hadoop’s shift from being a single product to an ecosystem with a variety of different tools in the stack.

As Hadoop was maturing, Apache Spark was being developed at Berkeley. Designed as a scalable compute framework for memory-intensive workloads, with no native storage, Spark was a natural fit within the Hadoop ecosystem. Paired with Hadoop’s HDFS for data storage, Spark became a natural compute alternative to MapReduce for workloads within Hadoop. This allowed users to leverage Spark for machine-learning applications, accelerated ETL workloads, and stream processing with the utilization of Spark streaming. Clearly, Hadoop was growing to accommodate a wider variety of workloads.

Hadoop in age of the cloud

This brings us to the cloud transformation of today. While there has been significant consolidation in the Hadoop vendor market over the past five years, there are still a variety of Hadoop offerings available to organizations. AWS, Azure, and GCP all have their own Hadoop-as-a-service offerings (EMR, HDInsight, and Dataproc, respectively).

On the other end of the spectrum, Cloudera offers the Cloudera Data Platform (CDP) across datacenter, private cloud, and public cloud. What’s especially remarkable about CDP and its architecture, is that it is Hadoop reimagined in a cloud-native context. Gone are the classic Hadoop zoo animals, and in its place are business use cases and experience tailored towards those use cases. In the curated experiences, users have the ability to spin up data warehouses (based on Hive), machine learning experiences (based on Spark), and more. All of these experiences are powered by Kubernetes, offering users the scalability, compute isolation, and ease of deployment that users expect in the cloud.

It’s clear that Hadoop and its definition have continued to evolve since the platform’s introduction nearly 15 years ago. What started as a purely on-premises offering built on HDFS and MapReduce is now entirely re-imagined within the cloud, with Kubernetes, cloud object storage, Spark, and more now in the ecosystem. Clearly, Hadoop has grown to meet the needs of the cloud opportunity, and it will be extremely exciting to see where it goes in the next 15 years.

Learn how to unleash the power of data – Read our eBook: A Data Integrator’s Guide to Successful Big Data Projects

A Data Integrator’s Guide to Successful Big Data Projects

Learn how to successfully tackle big data. This eBook will guide you through the ins and outs of building successful big data projects on a solid foundation of data integration.

Download

The Evolution of Hadoop

Early days

A Data Integrator’s Guide to Successful Big Data Projects

Apache Spark joins the party

Hadoop in age of the cloud

A Data Integrator’s Guide to Successful Big Data Projects

Related posts

ETL Best Practices for Optimal Integration

The Data Integration Solution Checklist: Top 10 Considerations

Streaming Data Pipelines: What Are They and How to Build One

Let's Talk