Testing the Waters: How to Get a Hadoop Data Lake Set Up Right
Researching how to leverage big data quickly yields results about data lakes. Essentially, a data lake is merely a data storage solution that allows you to store large quantities of data in its original, unaltered format. By nature, it accepts all forms of structured, semi-structured, and unstructured data. While a data lake can be built on a variety of different architectures, Hadoop is a popular option.
Hadoop has some distinct advantages, including the fact that it’s so widely used. Most of the analytical tools you’ll want to use with your data lake are Hadoop compatible. It’s called the Hadoop ecosystem, and it includes a wide plethora of products like Stream, Spark, Storm, etc.
Your research will also yield no shortage of warnings about data swamps. This is the derogatory term used to describe a data lake that’s poorly designed, making it impossible (or very nearly so) to retrieve the data with normal queries. It’s a quite real phenomenon, and wise ones will heed the warnings. The only sure way to keep a data lake from devolving into a data swamp is to endow the data with rich metadata. But that is also problematic from a technological perspective.
But before diving into the itemized checklist, it might be helpful to take in a broader view of how to make sure your Hadoop data lake is set up right the first time.
Define Use Cases for the Data
The best thing about a data lake is that you don’t have to specify the uses for the data ahead of time. The worst thing about a data lake is having to specify uses for the data ahead of time.
As is often the case with technology, the key selling point for a data lake can easily become its worst feature. One of the benefits usually touted about data lakes is that you don’t have to know exactly how the data will inevitably be used beforehand. Well, there’s both truth and myth in this statement.
You do need to have some use cases in mind, because this helps you structure and implement the data lake in a way that later queries and analysis will work as intended. Use cases for a data lake don’t have to be as honed and refined as those when setting up a traditional SQL-type database, but you do need to conduct due diligence in regards to discovery. What are some potential use cases for your big data initiative? Set up the data lake according to your identified potential use cases.
Data Management & Security
Historically, one of the main complaints about Hadoop has been the lack of security. It is true that Hadoop doesn’t come with the most secure default settings, it actually has pretty good security features that the user can activate.
Before populating your data lake, take time to define and establish the data management strategy and security protocols. The Hadoop ecosystem provides additional tools that you can add on top of the basic framework for even greater security and manageability.
It’s also essential that you keep a good inventory of your data as it is loaded into the data lake. For example, when you add in regulated data or sensitive data (like intellectual property or proprietary secrets), you need to be sure the security of your Hadoop architecture is adjusted to accommodate the greater security needs.
Choose Your Weapons
Don’t be caught in a gunfight with a knife. Be sure you’re selecting weapons that work with your data and needs. For example, if you don’t need streaming data, there’s no need to include streaming tools.
The basic framework of Hadoop is HDFS. This is your essential distributed file storage solution that uses MapReduce as its processing engine, but not much more. To build a data lake with Hadoop, you’ll need to add in more weapons. The Hadoop ecosystem is huge and growing continually. For instance, Apache Flink is extraordinarily promising, but it’s just beginning to dawn on the Hadoop horizon.
Most organizations in the process of building a brand new Hadoop data lake are better served by some of the old standbys. You can add and modify your infrastructure as you get accustomed to both Hadoop and utilizing a data lake (which is a bit different from querying the old SQL DB). Your options are many and growing, but here are some basic Hadoop weapons all newbies should consider.
- YARN – This product helps allocate resources within the Hadoop environment.
- HBase – This is a columnar NoSQL database (NoSQL is incredibly popular within the Hadoop ecosystem. HBase is just one NoSQL DB option).
- Hive – This product allows your HDFS to act more like SQL. It’s ideal for those who are used to SQL.
- Impala, HAWQ and Tez – These products are massively parallel processing engines that deliver ultra-fast response.
- Pig, Flume and Sqoop – Have you gotten used to these crazy names yet? We haven’t either. But these three piggies came to market as data ingestion tools for the Hadoop ecosystem.
- Kafka – This is an extremely useful and powerful distributed messaging service.
- Storm – A distributed stream processing engine (Hint: you only need streaming products if you’re actually streaming data. Not all big data analytics operations even need streaming capabilities).
- Spark – This product is a speedy little general-purpose data processing engine.
Obviously, a comprehensive listing of the entire Hadoop ecosystem is beyond the scope of this article, but these are some of the most popular tools to consider.
Now it’s time to begin populating your Hadoop data lake. Precisely offers a line of products that are immensely helpful in populating your Hadoop data lake, whether you’re offloading data from a mainframe or streaming data from other sources. We also have tools to help you get your data into AWS Marketplace or another cloud-based solution.
Quality, Quality, Quality
The age-old computing concept, “Garbage in, garbage out,” still applies today. Whether you’re populating a cute little DB behind your CMS system or building an enormous Hadoop data lake, you have to have data quality policies and procedures in place. Otherwise, none of your analytics projects will be worth the time it takes to query the data.
Data lakes, by nature, accept data from a variety of disparate sources – your legacy software system, your desktop and mobile users, and even IoT devices. There has to be some sort of quality control in place to make sure the data lake doesn’t fill up with useless, redundant, or corrupted information. Establish some kind of workflows to assure the quality of the data as it is ingested into the Hadoop environment. Sounds like a lot, doesn’t it? Don’t worry. There are great tools to help.
To learn more read our eBook The Hidden Data Lake Issue That is Hiding in Plain Sight.