Can AI solve IT’s eternal data problem?
Can AI solve IT’s eternal data problem?
Artificial intelligence and machine learning already deliver plenty of practical value to enterprises, from fraud detection to chatbots to predictive analytics. But the audacious creative writing skills of ChatGPT have raised expectations for AI/ML to new heights. IT leaders can’t help but wonder: Could AI/ML finally be ready to go beyond point solutions and address core enterprise problems?
Take the biggest, oldest, most confounding IT problem of all: Managing and integrating data across the enterprise. Today, that endeavor cries out for help from AI/ML technologies, as the volume, variety, variability, and distribution of data across on-prem and cloud platforms climb an endless exponential curve. As Stewart Bond, IDC’s VP of data integration and intelligence software, puts it: “You need machines to be able to help you to manage that.”
Can AI/ML really help impose order on data chaos? The answer is a qualified yes, but the industry consensus is that we’re just scratching the surface of what may one day be achievable. Integration software incumbents such as Informatica, IBM, and SnapLogic have added AI/ML capabilities to automate various tasks, and a flock of newer companies such as Tamr, Cinchy, and Monte Carlo put AI/ML at the core of their offerings. None come close to delivering AI/ML solutions that automate data management and integration processes end-to-end.
That simply isn’t possible. No product or service can reconcile every data anomaly without human intervention, let alone reform a muddled enterprise data architecture. What these new AI/ML-driven solutions can do today is reduce manual labor substantially across a variety of data wrangling and integration efforts, from data cataloging to building data pipelines to improving data quality.
Those can be noteworthy wins. But to have real, lasting impact, a CDO (chief data officer) approach is required, as opposed to the impulse to grab integration tools for one-off projects. Before enterprises can prioritize which AI/ML solutions to apply where, they need a coherent, top-down view of their entire data estate—customer data, product data, transaction data, event data, and so on—and a complete understanding of metadata defining those data types.
The scope of the enterprise data problem
Most enterprises today maintain a vast expanse of data stores, each one associated with its own applications and use cases—a proliferation that cloud computing has exacerbated, as business units quickly spin up cloud applications with their own data silos. Some of those data stores may be used for transactions or other operational activities, while others (mainly data warehouses) serve those engaged in analytics or business intelligence.
To further complicate matters, “every organization on the planet has more than two dozen data management tools,” says Noel Yuhanna, a VP and principal analyst at Forrester Research. “None of those tools talk to each other.” These tools handle everything from data cataloging to MDM (master data management) to data governance to data observability and more. Some vendors have infused their wares with AI/ML capabilities, while others have yet to do so.
At a basic level, the primary purpose of data integration is to map the schema of various data sources so that different systems can share, sync, and/or enrich data. The latter is a must-have for developing a 360-degree view of customers, for example. But seemingly simple tasks such as determining whether customers or companies with the same name are the same entity—and which details from which records are correct—require human intervention. Domain experts are often called upon to help establish rules to handle various exceptions.
Those rules are typically stored within a rules engine embedded in integration software. Michael Stonebraker, one of the inventors of the relational database, is a founder of Tamr, which has developed an ML-driven MDM system. Stonebraker offers a real-world example to illustrate the limitations of rules-based systems: a major media company that created a “homebrew” MDM system that has been accumulating rules for 12 years.
“They’ve written 300,000 rules,” says Stonebraker. “If you ask somebody, how many rules can you grok, a typical number is 500. Push me hard and I’ll give you 1,000. Twist my arm and I’ll give you 2,000. But 50,000 or 100,000 rules is completely unmanageable. And the reason that there are so many rules is there are so many special cases.”
Anthony Deighton, Tamr’s chief product officer, claims that his MDM solution overcomes the brittleness of rules-based systems. “What’s nice about the machine learning based approach is when you add new sources, or more importantly, when the data shape itself changes, the system can adapt to those changes gracefully,” he says. As with most ML systems, however, ongoing training using large quantities of data is required, and human judgment is still needed to resolve discrepancies.
AI/ML is not a magic bullet. But it can provide highly valuable automation, not only for MDM, but across many areas of data integration. To take full advantage, however, enterprises need to get their house in order.
Weaving AI/ML into the data fabric
“Data fabric” is the operative phrase used to describe the crazy quilt of useful data across the enterprise. Scoping out that fabric begins with knowing where the data is—and cataloging it. That task can be partially automated using the AI/ML capabilities of such solutions as Informatica’s AI/ML-infused CLAIRE engine or IBM’s Watson Knowledge Catalog. Other cataloging software vendors include Alation, BigID, Denodo, and OneTrust.
Gartner research director Robert Thanaraj’s message to CDOs is that “you need to architect your fabric. You buy the necessary technology components, you build, and you orchestrate in accordance with your desired outcomes.” That fabric, he says, should be “metadata-driven,” woven from a compilation of all the salient information that surrounds enterprise data itself.
His advice for enterprises is to “invest in metadata discovery.” This includes “the patterns of people working with people in your organization, the patterns of people working with data, and the combinations of data they use. What combinations of data do they reject? And what patterns of where the data is stored, patterns of where the data is transmitted?”
Jittesh Ghai, the chief product officer of Informatica, says Informatica’s CLAIRE engine can help enterprises derive metadata insights and act upon them. “We apply AI/ML capabilities to deliver predictive data… by linking all of the dimensions of metadata together to give context.” Among other things, this predictive data intelligence can help automate the creation of data pipelines. “We auto generate mapping to the common elements from various source items and adhere it to the schema of the target system.”
IDC’s Stewart Bond notes that the SnapLogic integration platform has similar pipeline functionality. “Because they’re cloud-based, they look at… all their other customers that have built up pipelines, and they can figure out what is the next best Snap: What’s the next best action you should take in this pipeline, based on what hundreds or thousands of other customers have done.”
Bond observes, however, that in both cases recommendations are being made by the system rather than the system acting independently. A human must accept or reject those recommendations. “There’s not a lot of automation happening there yet. I would say that even in the mapping, there’s still a lot of opportunity for more automation, more AI.”
Improving data quality
According to Bond, where AI/ML is having the most impact is in better data quality. Forrester’s Yuhanna agrees: “AI/ML is really driving improved quality of data,” he says. That’s because ML can discover and learn from patterns in large volumes of data and recommend new rules or adjustments that humans lack the bandwidth to determine.
High-quality data is essential for transaction and other operational systems that handle vital customer, employee, vendor, and product data. But it can also make life much easier for data scientists immersed in analytics.
It’s often said that data scientists spend 80 percent of their time cleaning and preparing data. Michael Stonebraker takes issue with that estimate: He cites a conversation he had with a data scientist who said she spends 90% of her time identifying data sources she wants to analyze, integrating the results, and cleaning the data. She then spends 90% of the remaining 10% of time fixing cleaning errors. Any AI/ML data cataloging or data cleansing solution that can give her a chunk of that time back is a game changer.
Data quality is never a one-and-done exercise. The ever-changing nature of data and the many systems it passes through have given rise to a new category of solutions: data observability software. “What this category is doing is observing data as it’s flowing through data pipelines. And it’s identifying data quality issues,” says Bond. He calls out the startups Anomolo and Monte Carlo as two players who claim to be “using AI/ML to monitor the six dimensions of data quality”: accuracy, completeness, consistency, uniqueness, timeliness, and validity.
If this sounds a little like the continuous testing essential to devops, that’s no coincidence. More and more companies are embracing dataops, where “you’re doing continuous testing of the dashboards, the ETL jobs, the things that make those pipelines run and analyze the data that’s in those pipelines,” says Bond. “But you also add statistical control to that.”
The hitch is that observing a problem with data is after the fact. You can’t prevent bad data from getting to users without bringing pipelines to a screeching halt. But as Bond says, when dataops team member applies a correction and captures it, “then a machine can make that correction the next time that exception occurs.”
More intelligence to come
Data management and integration software vendors will continue to add useful AI/ML functionality at a rapid clip—to automate data discovery, mapping, transformation, pipelining, governance, and so on. Bond notes, however, that we have a black box problem: “Every data vendor will say their technology is intelligent. Some of it is still smoke and mirrors. But there is some real AI/ML stuff happening deep within the core of these products.”
The need for that intelligence is clear. “If we’re going to provision data and we’re going to do it at petabyte scale across this heterogeneous, multicloud, fragmented environment, we need to apply AI to data management,” says Informatica’s Ghai. Ghai even has an eye toward OpenAI’s GPT-3 family of large language models. “For me, what’s most exciting is the ability to understand human text instruction,” he says.
No product, however, possesses the intelligence to rationalize data chaos—or clean up data unassisted. “A fully automated fabric is not going to be possible,” says Gartner’s Thanaraj. “There has to be a balance between what can be automated, what can be augmented, and what could be compensated still by humans in the loop.”
Stonebraker cites another limitation: the severe shortage in AI/ML talent. There’s no such thing as a turnkey AI/ML solution for data management and integration, so AI/ML expertise is necessary for proper implementation. “Left to their own devices, enterprise people make the same kinds of mistakes over and over again,” he says. “I think my biggest advice is if you’re not facile at this stuff, get a partner that knows what they’re doing.”
The flip side of that statement is that if your data architecture is basically sound, and you have the talent available to ensure you can deploy AI/ML solutions correctly, a substantial amount of tedium for data stewards, analysts, and scientists can be eliminated. As these solutions get smarter, those gains will only increase.