How to Clean Big Data for Trusted Insights
You’ve probably witnessed it, and maybe are doing it. Many organizations are just dumping as much data into a data lake as they can, trying to get to every data source in an enterprise and putting all the data into the data lake. We see it here at Precisely, with the vast amount of data from mainframe and other sources heading to the data lake for analytics and other use cases. But what if you can’t trust the data because it has errors in it, duplicated data (like customer records!), and generally just “dirty data.”
You need to clean it! Trillium DQ for Big Data does just that. To get more insight into the challenges the product helps tackle to clean big data, let’s use a real-world data quality example, creating a single customer view or any entity, like supplier, product, etc.
Parsing and standardization are first steps to clean big data
There are a series of data quality steps to be taken to clean big data and to de-duplicate the data to get a single view. To create a single view of customer or product for instance, we need to have everything in a standard format to get the best match. Let’s talk through these steps, parsing and standardization.
Let’s use a simple example of parsing and standardization.
100 St. Mary St.
As humans, we know that is an address, and it is 100 Saint Mary Street, because we understand the position. Did you know that postal address formats can vary from country to country? For example, in Argentina, the house number comes after the street name.
Now think about all the different formats for names, company names, addresses, product names, and other inventoried items such as books, toys, automobiles, computers, manufacturing parts, etc.
Now think about different languages.
Read our eBook
Governing Volume: Ensuring Trust and Quality in Big Data
Discover how a strong focus on data quality spanning the people, processes and technology of your organization will help ensure quality and trust in your analytics that drive business decisions.
Next step: Data matching for the best single records
Once we have all this data in a common, standard format we can then match. But this can even be complex. You can’t rely on customer IDs for instance to ensure de-duplicated data. Think about how many different ways customers are represented in each of your sources systems that have been polluting the data lake, matching is a hard problem.
Think about a name, Josh Rogers (We’ll pick on our CEO). The name could be in many different formats – or even misspelled – across your source systems and now in the data lake:
J. Rogers
Josh Rodgers
Joseph Rogers
For marketing analysts, they have a new product to promote and must make sure they’re targeting the right customer/prospect. If Josh lives in a small town in zip code 60451 (New Lenox, IL), he’s probably the only one on that street.
But if his zip code is 10023 (upper west side of NYC), there might be more than one person with that name at that address (think about the name Bob Smith!). Matching is a complex problem, especially dealing with the data volumes in a data lake.
The last step is to commonize and survive the best fields to make up the best single record.
Now, let’s run this in big data
Creating the single best, enriched record is exactly what Trillium DQ for Big Data does. The product allows the user to create and test the steps above locally, then leverage Precisely’s technology to execute them in big data frameworks such as Hadoop MapReduce or Spark. The user doesn’t need to know these frameworks, and it’s also future-proofed for new ones which we all know are coming.
So, what makes Trillium DQ for Big Data different?
- The product has more matching capabilities than any other technology that ensure you get that single view
- For those postal addresses, we have world-wide postal coverage for addresses and geocoding (latitude/longitude)
- Performance and scalability for execution in big data environment on a large and growing volume of data
For more information, read our eBook: Governing Volume: Ensuring Trust and Quality in Big Data