Data Integrity for AI: What’s Old is New Again
Artificial Intelligence (AI) is all the rage, and rightly so. By now most of us have experienced how Gen AI and the LLMs (large language models) that fuel it are primed to transform the way we create, research, collaborate, engage, and much more.
Yet along with the AI hype and excitement comes very appropriate sanity-checks asking whether AI is ready for prime-time. Can AI’s responses be trusted? Does the LLM capture all the relevant data and context required for it to deliver useful insights? Can it do it without bias? (Not to mention the crazy stories about Gen AI making up answers without the data to back it up!) Are we allowed to use all the data, or are there copyright or privacy concerns?
These are all big questions about the accessibility, quality, and governance of data being used by AI solutions today. But these are far from new challenges. This is history repeating itself again and again, and many of the best practices defined over the past three decades can and should be applied to the new challenges AI introduces today.
Disclaimer: Throughout this post, I discuss a variety of complex technologies but avoid trying to explain how these technologies work. Apologies in advance to anyone who cringes at my gross oversimplifications! The goal of this post is to understand how data integrity best practices have been embraced time and time again, no matter the technology underpinning.
In the beginning, there was a data warehouse
The data warehouse (DW) was an approach to data architecture and structured data management that really hit its stride in the early 1990’s. The simple idea was, hey – how can we get more value from the transactional data in our operational systems spanning finance, sales, customer relationship management, and other siloed functions. There was no easy way to consolidate and analyze this data to more effectively manage our business.
The magic of the data warehouse was figuring out how to get data out of these transactional systems and reorganize it in a structured way optimized for analysis and reporting. The ETL (extract, transform, and load) technology market also boomed as the means of accessing and moving that data, with the necessary translations and mappings required to get the data out of source schemas and into the new DW target schema.
But simply moving the data wasn’t enough. Each source system had their own proprietary rules and standards around data capture and maintenance, so when trying to bring different versions of similar data together – such as customer, address, product, or financial data, for example – there was no clear way to reconcile these discrepancies.
Which of course led to the adoption of data quality software as part of a data warehousing environment – with the goal of executing rules to profile cleanse, standardize, reconcile, enrich, and monitor the data entering the DW to ensure it was fit for purpose.
But as more data was loaded into these data warehouses from more systems, it became unwieldly and challenging for analysts – and the IT teams supporting them – to deliver reporting requirements without a massive amount of preparation and support. Data marts soon evolved as a core part of a DW architecture to eliminate this noise. Data marts involved the creation of built-for-purpose analytic repositories meant to directly support more specific business users and reporting needs (e.g., financial reporting, customer analytics, supply chain management).
And then a wide variety of business intelligence (BI) tools popped up to provide last mile visibility with much easier end user access to insights housed in these DWs and data marts. But those end users weren’t always clear on which data they should use for which reports, as the data definitions were often unclear or conflicting. Business glossaries and early best practices for data governance and stewardship began to emerge.
This is of course an over-simplification of the data warehousing journey, but as data warehousing has moved to the cloud and business intelligence has evolved into powerful analytics and visualization platforms – the foundational best practices shared here still apply today.
eBook
Trusted AI 101: Tips for Getting Your Data AI-Ready
Future-proof your AI today with data integrity. It’s time to maximize the potential of your artificial intelligence (AI) initiatives. Get inspired by valuable AI use cases, and find out how to overcome bias, inaccurate results, and other top challenges. Technology-driven insights and capabilities depend on trusted data.
Next a slight detour from data when Search exploded
There was plenty of innovation and advancement in data aggregation and analytics throughout the 1990’s and 2000’s. But the Internet and search engines becoming mainstream enabled never-before-seen access to unstructured content and not just structured data. I didn’t want to skip this important information management milestone in history, but content classification and governance created so many new disciplines and technologies and leads down a completely different path – so I’m not going to go there!
Then came Big Data and Hadoop!
The traditional data warehouse was chugging along nicely for a good two decades until, in the mid to late 2000’s, enterprise data hit a brick wall. That wall was famously coined by then-Gartner analyst Doug Laney as the 3 V’s (volume, velocity, and variety). Data volumes continued to grow exponentially where Gigabytes quickly became Terabytes which then quickly grew into Petabytes. The demand for higher data velocity, faster access and analysis of data as it’s created and modified without waiting for slow, time-consuming bulk movement, became critical to business agility. And the more sources of data continued to expand, moving beyond mainframes and relational databases to semi-structured and unstructured data sources spanning social feeds, device data, and many other varieties, made it impossible to manage in the same old data warehouse architectures. The DW costs were skyrocketing, and it was nearly impossible to keep up with the scaling requirements.
The big data boom was born, and Hadoop was its poster child. The promise of Hadoop was that organizations could securely upload and economically distribute massive batch files of any data across a cluster of computers. It was very promising as a way of managing data’s scale challenges, but data integrity once again became top of mind.
Just like in the data warehouse journey, the quality and consistency of the data flowing through Hadoop became a massive barrier to adoption. Deploying upstream data profiling, validation, and cleansing rules was required to ensure garbage wasn’t coming in, and suddenly organizations were discussing their plans for big data governance when they had yet to figure out how to implement little data governance.
Which turned into data lakes and data lakehouses
Poor data quality turned Hadoop into a data swamp, and what sounds better than a data swamp? A data lake! Like the data warehouse, a data lake promised to be your source of truth data repository for all this raw data, but unlike the DW it captured not just structured data, but raw semi-structured and unstructured data as well. But this wide variety of data structures made it difficult to enable business end users to perform reporting and analytics, so just like data warehouses spawned data marts, data lakes spawned fit-for-purpose data lakehouses that delivered more structured formats to support self-service business reporting needs.
Once again, garbage in-garbage out became a reality. You should ensure the integrity of your data before you move it into any environment, and you should perform necessary quality controls and governance before you share it with end users. It doesn’t matter what technology environment or architecture you’re leveraging – it’s a truism that can’t be ignored.
And now we have AI – and more specifically LLMs
AI has been, and will continue to be, a transformative technology. Organizations large and small are in the very early days setting company guidelines and policies on how to safely and effectively leverage AI-fueled innovation. We’ll also likely see more governmental oversight and regulations emerge as questions regarding copyrights and fair use on what information is considered public that can be used to train these models vs what is considered private or proprietary.
Policy and regulation aside, these GenAI and adjacent AI innovations repeat the cycle of information consolidation and consumption that began 30 years (and about 1300 words ago!). Specifically,
- The LLM is the new data warehouse. It’s still all about consolidating vast amounts of disparate data and information into a central knowledge repository that can be the (aspirational) single source of truth for reporting, analytics, and ultimately high-confidence business decision-making.
- The SLM (small language model) is the new data mart. LLMs, especially public ones, often contain too much irrelevant information, aren’t trained to support specific goals, and put internal sensitive data at risk. Many organizations are looking to either build private LLMs or SMLs that are developed to support specific functional or organizational goals that can safely consolidate both public and internal private data.
- The myriad prompt-based GenAI tools are the new BI and Search. BI tools were all about a query and a result from a structured data set. GenAI has already shattered what the average person has been trained to expect from more than 20 years of Internet search experiences. Search allows for a user input that returns a (mysteriously) prioritized list of thousands or more search results that users scroll and click through until they find the answer that suits them. Now, GenAI allows for a plain language prompt by the user and promises to return a single, accurate answer to their question.
- Data management best practices haven’t changed. Data integration best practices are required to build and train the LLM or SLM with the necessary information and context. Data quality best practices are still required to ensure the data used is consistent, clean, enriched, and fit-for-purpose.
Data governance remains the most important and least mature reality. I’ve been evangelizing data governance best practices for almost three decades – well before data governance was common terminology or even a technology market. Massive amounts of market education and best practices have been shared by practitioners, thought leaders, consultants, industry analysts, vendors, and many more. Yet, doing the hard work of making sure your data is properly categorized, defined, monitored, secured, and ultimately fit for the intended use your business requires remains too low of a priority.
AI is not going to fix or dismiss the need for proper data governance. Adoption of AI will magnify the need exponentially, and those that recognize and invest in building out these core data management disciplines today will be the ultimate winners in the race to derive the most value from AI tomorrow.
For a deeper dive, including practical tips and best practices, read the full eBook: Trusted AI 101: Tips for Getting Your Data AI-Ready.