{"id":69,"date":"2022-06-24T08:30:00","date_gmt":"2022-06-24T08:30:00","guid":{"rendered":"https:\/\/baecke.io\/?p=69"},"modified":"2022-06-24T08:30:00","modified_gmt":"2022-06-24T08:30:00","slug":"data-foundation-problem-enterprise-ai-fails-before-production","status":"publish","type":"post","link":"https:\/\/baecke.io\/?p=69","title":{"rendered":"The Data Foundation Problem: Why Enterprise AI Initiatives Fail Before They Reach Production"},"content":{"rendered":"<h2>The Model Works. The Data Does Not.<\/h2>\n<p>The most expensive misconception in enterprise AI is that AI is primarily a model problem. It is not. The models work. The open source ecosystem has made capable model architectures available to any team with a GPU and a weekend. The cloud AI services from the major hyperscalers have lowered the barrier to production-quality model deployment to the point where many use cases can be addressed with API calls rather than custom training.<\/p>\n<p>The constraint that stops enterprise AI programmes before they reach production is almost always the data. Specifically, it is the gap between the data that exists in the enterprise and the data that a production AI system requires: available at the quality, consistency, volume, and accessibility that makes a model operationally useful rather than technically impressive in a controlled environment.<\/p>\n<p>This gap is not technical. The tools to address it exist. It is organisational: the result of years of data governance designed for reporting rather than for machine learning, data ownership distributed across business functions that protect their data as a competitive asset rather than share it as a collective capability, and data infrastructure built for historical analysis rather than for the real-time, high-availability pipelines that production AI requires.<\/p>\n<h2>The Four Components of the Data Foundation Problem<\/h2>\n<p>The data foundation problem in enterprise AI has four distinct components, each requiring a different intervention to address.<\/p>\n<p>The first is fragmented data ownership. In most large enterprises, the data required for a useful AI model is distributed across multiple systems, owned by multiple teams, and governed by multiple sets of access policies that were designed with regulatory compliance in mind rather than analytical accessibility. A model that predicts customer churn needs data from the CRM, the billing system, the support ticketing platform, and the usage telemetry system. Each of those systems is likely owned by a different team, governed by different access policies, and formatted according to different conventions. Assembling a training dataset requires a political negotiation as much as a technical one.<\/p>\n<p>The second is inconsistent data quality. Production AI models fail in characteristic ways when trained on poor-quality data: they learn the noise as well as the signal, they overfit to the artefacts of data entry practices rather than the underlying business patterns, and they perform well on historical evaluation data while failing on live production data where the quality issues are different from the ones in the training set. Most enterprise data has never been subjected to the quality standards that machine learning requires, because it was never designed to be used this way. The reporting use cases that most enterprise data was originally built for are much more tolerant of quality issues than training a model is.<\/p>\n<p>The third is governance models that treat data access as a compliance function. Enterprise data governance in most large organisations was built to manage regulatory risk: to ensure that sensitive data is not accessed by people or systems without the appropriate authorisation, and to provide audit trails that demonstrate compliance. These are legitimate and important objectives. The problem is that governance frameworks built primarily for compliance tend to make data access slow, restrictive, and difficult to scale. The data scientist who needs access to twelve months of customer transaction data for a model training exercise faces a governance process designed for the compliance officer who needs to prove that access to that data was controlled, not for the data scientist who needs that access in a week rather than a quarter.<\/p>\n<p>The fourth is data infrastructure built for reporting rather than machine learning. The data warehouse is designed for structured queries that retrieve specific slices of historical data for analysis. Machine learning has different requirements: large volumes of data delivered efficiently for batch training, real-time feature computation for inference in production, feature stores that make engineered features reproducible and reusable across models, and data versioning that enables model reproducibility and debugging. Most enterprise data infrastructure does not support these requirements natively, and retrofitting it is a significant engineering investment that AI programmes consistently underestimate.<\/p>\n<h2>The Data Readiness Assessment<\/h2>\n<p>Before committing AI project budgets, organisations should apply a data readiness assessment that addresses each of these components explicitly.<\/p>\n<p>On fragmentation: can the data required for this model be assembled from existing systems in a form that is complete enough to train a useful model? Who owns each required data source, and what is the governance process for obtaining access? Has the data required for the model ever been joined across systems before, or will this require new data integration work?<\/p>\n<p>On quality: what are the known quality issues in each data source, and how severe are they relative to the requirements of the model? Is there a data quality baseline, and if so, how was it established? Has anyone profiled the data for the specific quality dimensions that matter for this use case: completeness, consistency, accuracy, timeliness?<\/p>\n<p>On governance: what is the existing data access policy for each required data source, and is that policy compatible with the access patterns that training and serving a model requires? Who has the authority to grant the access the model requires, and what is the realistic timeline for navigating that process?<\/p>\n<p>On infrastructure: does the organisation have a data pipeline capable of delivering the required data for model training and inference at the required volume and latency? If not, what is the scope of the infrastructure investment required, and is it included in the AI programme budget?<\/p>\n<p>The answers to these questions determine the true scope of the AI programme. In the majority of enterprise cases, the answers reveal that a meaningful fraction of the programme budget and timeline should be allocated to data infrastructure and governance before the model development work can begin in earnest.<\/p>\n<h2>The Investment That Actually Unlocks AI Value<\/h2>\n<p>The data foundation investment is not glamorous. It does not produce a demo. It does not generate the kind of executive enthusiasm that a model that correctly predicts customer behaviour creates. But it is the investment that determines whether the model that generates executive enthusiasm will ever be used by a production system or whether it will join the growing library of impressive proofs of concept that organisations cannot explain why they have not deployed.<\/p>\n<p>The investment has three components. Data access infrastructure: the pipelines, the feature stores, the data contracts between systems, and the real-time serving infrastructure that production AI requires. Data quality standards with enforcement: not just a policy that says data should be accurate, but automated quality checks in the pipelines that catch quality degradation before it reaches the training process. And data governance frameworks that enable controlled access at the pace that AI development requires: not a governance bypass, but a governance process designed for analytical use cases rather than exclusively for compliance ones.<\/p>\n<p>The organisations that make this investment before their AI programmes rather than in response to production failures build a foundation that compounds. Every model they train benefits from the same data infrastructure. Every data quality improvement they make improves every model that uses that data. The governance framework they build for one use case scales to twenty.<\/p>\n<p>The organisations that skip it fund the infrastructure investment eventually, under worse conditions, with a failing AI programme providing the forcing function. The data foundation problem does not go away because you chose not to address it before you started. It goes away when you build the foundation. The question is whether you build it as a planned investment or as an emergency remediation.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The majority of enterprise AI initiatives fail not because the models don&#8217;t work but because the data required to train and operationalise them isn&#8217;t available in the required quality, consistency, or accessibility.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5],"tags":[],"class_list":["post-69","post","type-post","status-publish","format-standard","hentry","category-business-value"],"_links":{"self":[{"href":"https:\/\/baecke.io\/index.php?rest_route=\/wp\/v2\/posts\/69","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/baecke.io\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/baecke.io\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/baecke.io\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/baecke.io\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=69"}],"version-history":[{"count":0,"href":"https:\/\/baecke.io\/index.php?rest_route=\/wp\/v2\/posts\/69\/revisions"}],"wp:attachment":[{"href":"https:\/\/baecke.io\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=69"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/baecke.io\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=69"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/baecke.io\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=69"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}