The Data Problem Stalling AI

AI efforts can fail to move out of the lab if organizations don’t carefully manage access to data throughout the development and production life cycle.

Reading Time: 20 min 

Topics

Permissions and PDF Download

Image courtesy of Michael Austin/theispot.com

A large North American hospital was excited about the potential of an AI-enabled system that would improve patient care. As it was planning to move from concept to building a prototype, it discovered that the data required to build and operate the system was scattered across 20 legacy systems and retrieving that data would be too complex. The project had to be scrapped.

Advanced analytics and artificial intelligence promise to generate insights that will help organizations stay competitive. Their ability to do that is heavily dependent on the availability of good data, but sometimes organizations just don’t have the data to make AI work.

We recently studied how organizations move their AI initiatives from R&D, lablike settings into production and the problems they encounter in doing so. The research is based on interviews with key AI leaders and informants in six North American companies of different sizes and operating in different industries. A key finding is that, although many people focus on the accuracy and completeness of data to determine its quality (see “What Is Good Data?”), the degree to which it is accessible by machines — one of the dimensions of data quality — appears to be a bigger challenge in taking AI out of the lab and into the business. More important, we found that data accessibility is too often treated exclusively as an IT problem. In reality, our analysis reveals that it is a management problem aggravated by misconceptions about the nature and the role of data accessibility in AI.

Data accessibility is not about the properties of data itself; it is about having the required elements in place for machines to get the data. Although organizations are inundated with data, access to it remains a challenge that is exacerbated in the context of AI development and operations for two interrelated reasons. First, AI programs usually involve diverse groups of stakeholders with diverging interests regarding data accessibility. Second, a typical AI development life cycle tends to undermine the importance of data accessibility.

AI Stakeholders Differ on Data Accessibility

At the core of most data accessibility issues is the fact that AI initiatives involve vastly different groups of actors who have divergent interests, views, and influence on the nature and the role of data accessibility. For instance, business leaders typically engage at the beginning and end of the process — helping to define the use cases for AI and taking advantage of the final product — but they tend not to think about how the data is accessed. “Businesses always think they have [the data they need for AI],” said the vice president of product delivery at an AI consultancy. “They want to start fast, and then we open the hood,” he noted, laughing. “We get PDFs, we get Excel spreadsheets, and then we need to collect all of this and just [apply optical character recognition] and process it. It’s never easy.”

Meanwhile, data scientists who develop, test, and maintain models, and scientific advisers who may work with them, are primarily focused on obtaining the data required for model development. Like business stakeholders, their interest in data accessibility is low.

Data engineers, who build the infrastructure required to generate the data used in data scientists’ models, are moderately concerned with data accessibility. But they typically assume — sometimes incorrectly — that data extracted from operational systems for prototype development is readily accessible for production use as well.

Data accessibility is a bigger issue for software engineers, who are responsible for packaging the AI into a product or service that must be able to source data in a production environment. And while members of the IT function are rarely considered key players in AI initiatives, they support the technological infrastructure required by AI (including data). Their work helps enforce compliance with security policies and governance mechanisms that safeguard technological and data assets.

Each of these stakeholders has an important role to play. At the same time, their vision of data accessibility is limited to their immediate responsibilities. For example, the AI lead of a large financial institution told us that his team needs to source large quantities of data from operational systems. However, many of those systems run on mainframes and were never built to support these data access requirements while simultaneously supporting regular operations. When IT staff, whose responsibility is to keep those operational systems up and running, hear the data access requirements for his AI projects, they are less than receptive. In one instance, he told us, their answer was, “I don’t want fresh-out-of-school geeks to come and retrieve 15 terabytes per day, because everything will crash.”

The AI Life Cycle Undermines Data Accessibility

In addition to the issue of stakeholder diversity, the typical life cycle of AI initiatives pushes teams to focus on the rapid and iterative development of models. This delays important conversations on data accessibility, especially those related to the actual implementation of AI within the organization. During this process, the nature of data accessibility shifts from being disconnected to being connected to the organization’s data management structures, mechanisms, and technological infrastructure. The involvement of key stakeholders changes across AI development phases as AI moves from a mere idea to an actual product or service in use in the organization. (See “Stakeholders and Data in the AI Life Cycle.” ) To understand why data accessibility is so often overlooked, we need to examine each of the five phases of the typical AI life cycle we observed in all six organizations we studied.

Phase 1: Ideation. The ideation phase serves as a filter to identify potential high-level business cases for AI in the organization. Most conversations during this phase are between managers, business consultants, and scientific advisers (who are sometimes also full-time academics). The goal is to create a meeting space for business and science. The resulting business cases should look promising and feasible. In AI consulting companies, this crucial first step serves to educate clients on the potential of AI. During this phase, however, the emphasis is on data existence rather than data accessibility. Discussions revolve around business objectives and the application of AI models to address the organization’s current problems.

Phase 2: Blueprint. Not all use cases generated during the ideation phase will be selected for implementation within a given period because of priorities, resource constraints, or a lack of potential value. During the blueprint phase, a comprehensive use case is generated. This includes details such as clear and measurable business objectives, an action plan that outlines specific AI techniques, and the data elements that should be available to feed AI. During the blueprint phase, data accessibility is still assessed solely on the existence of data, because sights are set on the next phase of the process, which is to build a working prototype. The underlying assumption is that if the data is there, that’s good enough, because it allows the team to move forward.

Phase 3: Proof of concept. During the proof-of-concept phase, data scientists build one or more models to implement the agreed-upon use cases. Most of the work is focused on iteratively creating, training, and testing models to measure their relative performance against one another and to see whether AI actually lives up to expectations with new input. Data is extracted from source systems and transformed by data engineers so that it complies with the format and accuracy requirements of the models under construction. Although the solution may ultimately be delivered through an application with a user interface or tightly integrated within the organization’s business processes (to alter a credit application process in a bank, for example), the proof-of-concept phase typically does not focus on those efforts just yet. Similarly, teams focus on getting the data to advance their work in the short term, giving little consideration to how data will eventually be accessed once the AI goes into production.

Phase 4: Minimum viable product. Once a variant of the proof of concept demonstrates sufficient value, it is refined into a minimum viable product, or MVP. At this point, data scientists and data engineers step back and software engineers take over, given that the AI will eventually leave the lab, be deployed within the organization’s infrastructure, and get integrated with other production systems, if applicable. An unintended consequence of the strong focus on model development in the previous phases is that considerations regarding the accessibility of data in production have taken a back seat. Once software engineers and IT staff become more involved in discussions about the specifications and the integration of the solution to be delivered, questions related to data accessibility may reveal that a crucial feature used by a model requires significant, unplanned work.

Phase 5: Production. In this last phase, the refined MVP that contains the AI is released into production and must now be fed with data sourced directly from production systems. Data may need to be pulled from multiple systems and transformed to generate the required input for the model to support the business case in production. Whether this happens in real time or in batches (for example, to retrain and retest a model at frequent intervals), this is where the real issues related to AI integration emerge, especially with respect to the organization’s data infrastructure. If data cannot be provided, extracted, and integrated by autonomous systems at the required volume or velocity (due to legacy systems, for instance), the AI may lose all of its potential value.

Four Misconceptions About Data Accessibility for AI

In addition to understanding the different roles and phases of AI development and their impact on data accessibility, it is helpful to understand some basic misconceptions about the nature of data and how it is perceived in many organizations.

Misconception No. 1: Data accessibility is a technical issue. Technology problems, while often complex, can usually be fixed with the right talent and resources. Participants in our research argued that data accessibility is really a management issue that involves technology. AI solutions must start with a clear understanding that complete, accurate, and timely data has no value if it cannot be retrieved quickly and easily. The fact that data is located somewhere across multitudes of databases and spreadsheets does not necessarily mean that it is accessible. Sometimes data accessibility issues exist because data governance or security policies restrict access.

Competing priorities between the business and IT staffs have existed for decades. When you add the priorities of AI teams to the mix, things quickly become messy. If data accessibility is treated only as a technical problem, AI products and services may remain stuck at the proof-of-concept stage until data accessibility challenges are addressed by other teams, causing delays and incurring additional costs. Or they may not live up to their full potential due to missing data that was left out because it was either too complex or too costly to retrieve. In both cases, AI will fail to deliver on its promises, not because of AI models but because of data accessibility.

Misconception No. 2: Data is merely a byproduct of operations. This misconception is often seen in organizations where analytics and AI efforts sit apart from operations — and where AI’s potential to improve or revolutionize processes across the organization has not yet been recognized. As a result, operational systems (such as enterprise resource planning and customer relationship management) consume and produce data, but there is no understanding of the potential value of this data for AI. If analytics or AI teams want to use data from operations, they have to retrieve and leverage it on their own, similar to what traditional data warehouse teams have done for many years.1

Where this misconception prevails, data can be plentiful within the organization but underused by AI. This typically happens because the digital traces of business processes are often fragmented across operational systems, making it challenging to retrieve the data required to re-create a coherent portrait of those processes. In short, the strategic potential of data as an input for value creation is underexploited.

Misconception No. 3: Data accessibility can be addressed in the later phases of the AI life cycle. The five phases of the AI life cycle are designed to push AI teams to work in an agile mode, especially during the proof-of-concept and MVP phases. The very nature of AI as an uncertain endeavor lends itself well to this approach. Teams must be able to experiment with models and pivot on emergent results to find the optimal solution to the organization’s problem. Unfortunately, this also encourages teams to focus almost exclusively on the scientific portion of AI work for the better part of the first three phases. The stakeholders involved during the ideation, blueprint, and proof-of-concept phases are not the ones who deal with data accessibility issues. Data engineers are primarily concerned with creating flat files that data scientists can use to build and train models, and any means within their reach to generate those files — including hacks, work-arounds, and simulated data — is considered fair game.

For an AI-enabled system to add value within the organization, it has to be packaged as a product or service that can be integrated with the organization’s infrastructure. Often, integration concerns are addressed late in the life cycle (see “Stakeholders and Data in the AI Life Cycle.”). Software engineers and IT staff thus become the bearers of bad news. When companies don’t address data accessibility early on, they often end up incurring additional, unforeseen costs. Additionally, projects can stall while the priorities of other stakeholders (usually the IT staff) are shuffled unexpectedly to address data accessibility issues. In some instances, AI initiatives can even fail to materialize in production.

Misconception No. 4: Data in the lab and data in operations are the same. Companies are becoming highly skilled at building AI-enabled proofs of concept. However, the real test is whether they can move past the controlled lab environments of the proof-of-concept phase to the messy production environments. Often, the assumption is that the data retrieval process for the proof-of-concept phase can be replicated at little to no cost once the AI moves through MVP and then into production. But recall that data in the proof of concept comes from a few flat data files that were specifically created — often from historical data snapshots — for the purpose of building and testing models.

In the production phase, AI must be connected to multiple live systems that retrieve the input needed to perform its work, sometimes in real time. The features of the data that need to be extracted may be the same, but the way the data is accessed and retrieved is very different. For example, the volume and velocity requirements of data for operations may vary considerably from what is needed to retrain models. In fact, some of the AI consulting businesses we studied purposefully limit their mandates to the development of proofs of concept to avoid the issue of data in production altogether.

When organizations assume that data in the lab and data in production are one and the same, they hide a large part of the complexity of data accessibility. This means that AI initiatives may be quick to start but take considerable, unplanned time and effort to operate in production.

How to Manage Data Accessibility for AI

Data accessibility issues can affect the success of AI in an organization. To alleviate them, we offer three recommendations to better manage data accessibility for AI: Develop stakeholders’ understanding of data accessibility as a business issue, acknowledge the value of organizational data for AI, and consider data accessibility throughout the AI life cycle.

Promote data accessibility as a business issue first and a technology issue second. All stakeholders in AI initiatives must develop a shared understanding of data accessibility as an integral part of data quality, affecting not just IT but also operations and requiring attention throughout the AI life cycle. Stakeholders need to pool their role-specific knowledge about data accessibility in order to build a common understanding of it as a business issue.

Changing how we think about data accessibility can take time and require conversations and collaboration that didn’t occur before. In one of the AI consultancies we studied, data accessibility has become part of the early, high-level discussions that staff members have with their clients and is incorporated in the ideation phase of the AI life cycle. In other cases, ongoing conversations among stakeholders ensure that alignment between the needs of AI teams and the organization’s resources (such as the IT staff) is established and maintained over time. Simply establishing data accessibility as an important business issue at the strategic level will likely not be enough. Ongoing effort and attention are required. Otherwise, data accessibility problems will remain simply technology problems, landing in the IT staff’s backlog of things to fix — if possible.

This also means educating AI team members on the importance of identifying and raising data accessibility issues to management. The technological fix for a data accessibility issue may be simple, but it may require going through a lengthy approval process, and security policies may render data inaccessible. In these cases, there is no technological fix, and the only possible solution, if the business case formulated in the ideation phase supports it, is to engage in meaningful discussions regarding relaxing some aspect of a security policy to support the work of the AI team.

Consider any data as a potential candidate for AI. Data accessibility does not matter for just current AI business cases. The diverse applications of AI to many problems faced by organizations mean that any data has the potential to serve as valuable input for an AI initiative. A key element to improving data accessibility throughout the organization is to move beyond the conception that data is solely the byproduct of operations. In other words, the fact that some data has reached the end of its useful life cycle for the execution of a given process does not mean that it cannot contribute to creating value as an input for AI. In one of our cases, years of operational logs routinely collected by heating, ventilation, and air conditioning systems now serve as the input for the creation of preventive-maintenance models.

The vision of a data-driven culture in which employees rely on data to guide their decisions usually focuses on the end product — the use of extracted data — and not the process required to bring the data to these employees. Business lines must understand that their data output potentially feeds input for AI. For example, the work logs produced by traveling service employees are traditionally used to monitor productivity and to ensure that service call quotas are met. But if organizations have access to fine-grained, retrospective data on the type and duration of service calls, they can use this data as an input for AI to optimize and personalize scheduling based on employee expertise. The cross-functional awareness of the dual role of data as both output (in this case, the end time of a call for traveling service employees) and input (the duration of service calls, used by AI to optimize scheduling) can influence the selection of a solution or a vendor, or the configuration of a system.

The most successful business cases we studied were those where operational processes were built with the idea that their supporting systems would eventually serve data to AI. In one instance, the AI lead at a large financial institution told us that process reengineering and system upgrades (such as migrating to cloud-based services) are important requirements to support the incorporation of AI into existing business processes. A critical element supporting this achievement is the use of governance mechanisms that make data retrieval and access easy for both humans and machines.

Address data accessibility at the onset of AI initiatives. The iterative model development in AI life cycles does not preclude thinking about data accessibility early in AI initiatives and bringing in the right expertise near the start. In some of our cases, this meant enlisting the participation of software engineers and IT employees during the blueprint phase so that the high-level parameters of the final AI-embedded product or service would be widely known and concerns about data accessibility could be raised accordingly. More important, this will ensure that the future integration of AI within the organization’s infrastructure is taken into account while minimizing surprises later in the process. To that end, we encourage managers to make a clear distinction between the task of getting data to build AI and that of making data accessible in production. It’s fine to build AI in a controlled lab environment, but that doesn’t mean that its future use in production should be abstracted away.

A key benefit of this approach is that it allows part of the work to be performed in parallel. For instance, data engineers can be encouraged to have discussions with the IT staff early on to establish a data road map. By the MVP phase of the life cycle, most of the data engineering pipelines will be ready to connect to the production infrastructure. Another possible pattern is staggering tasks related to data accessibility, data engineering, and model building across different iterations, similar to what has been proposed in data-intensive projects.2 This enables synchronicity across activities while incorporating a certain degree of lag that can permit adjustments if needed. Even if the AI initiative does not move past the proof-of-concept or MVP phase after all these efforts, enhanced data accessibility at the organizational level will invariably be useful for future AI initiatives.

The view that data is a key corporate asset has become widespread among business leaders, as has the expectation that the AI-powered systems consuming that data will drive new competitive advantage. But not infrequently, the devil is in the details of implementation. A lack of understanding by all stakeholders of the full dimensions of data quality, and the siloing of AI initiatives away from operations, can limit the impact of AI projects or derail them altogether. Enterprises gaining the most significant benefits from AI understand that to push it outside of R&D and integrate it into their operations, they need to value data as input as much as output and give data accessibility the attention it deserves.

Topics

References

1. R. Kimball and M. Ross, “The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling,” 3rd ed. (Indianapolis: John Wiley & Sons, 2013).

2. R. Hughes, “Agile Data Warehousing Project Management: Business Intelligence Systems Using Scrum” (Waltham, Massachusetts: Morgan Kaufmann, 2013); and K. Collier, “Agile Analytics: A Value-Driven Approach to Business Intelligence and Data Warehousing” (Boston: Addison-Wesley, 2011).

i. R.Y. Wang and D.M. Strong, “Beyond Accuracy: What Data Quality Means to Data Consumers,” Journal of Management Information Systems 12, no. 4 (spring 1996): 5-33; L.L. Pipino, Y.W. Lee, and R.Y. Wang, “Data Quality Assessment,” Communications of the ACM 45, no. 4 (April 2002): 211-218; and B. Baesens, R. Bapna, J.R. Marsden, et al., “Transformational Issues of Big Data and Analytics in Networked Business,” MIS Quarterly 40, no. 4 (December 2016): 807-818.

Reprint #:

62209

More Like This

Add a comment

You must to post a comment.

First time here? Sign up for a free account: Comment on articles and get access to many more articles.