Deodorizing Your Data
To deal with “smelly” data, try refactoring your analytics processes.
Topics
Competing With Data & Analytics
In programming, people describe their sense of underlying fragility in code by noting that it “smells bad.” A bad code smell is an amalgam of signals (such as size, complexity, duplication) that combine to indicate larger, deeper problems. Similarly, analytics in organizations can stink — and it takes more than a spritz of air freshener to solve the problem.
So how do you know if your analytics needs deodorizing? Symptoms abound. Poking around with data may uncover more quality problems than it does insightful answers. Results may be sketchy and change drastically under the least bit of scrutiny. Different parts of the organization have multiple versions of the truth. Analysis that should be routine is repeatedly done ad hoc each time, requiring duplication of effort with each iteration.
Certainly bad data smells are not intended — but they can be prevented by understanding how they develop. Some of the factors contributing to smelly data include:
- Complex realities. Analytics compiles data that are snapshots taken in a complex world — and these snapshots don’t always fit into well-structured or clean models. Furthermore, that world continues to change, even if systems don’t. For example, evolving businesses and requirements led to “14 separate health plans with inconsistent approaches to defining similar types of data” at the health care provider WellPoint, according to a recent MIT SMR case study. Each system likely made sense in isolation or at the time it was developed — but any later attempt to generate analytical results must synthesize each of these disparate sources of data.
- Acquisitions. Organizations often grow through acquisition of other, previously independent organizations, each with idiosyncratic systems. Tom Fontanella, senior IS director at Sanofi, reports that a master data management project that Genzyme undertook before being acquired by Sanofi found that “30-day payment terms [were] expressed as Net 30, 30, 30 Day, 30days, LC30, 030NL …” due to a series of acquisitions over a number of years, often in different areas of the world. This lack of consistency meant that the data on 30-day payment was squirreled away under a range of labels — a malodorous situation indeed.
- Urgency. Operational pressure can be intense.