The Cost of Dirty Data
One of Kenway’s core capabilities is Information Insight. Using a combination of our Data Governance, Data Management and Business Intelligence services, we try to help our clients better leverage their data as an asset. The reason that we group these three services together, instead of presenting them separately, is because of our belief that you must have all three to create a robust data environment, which in turn increases our clients’ return on investment (ROI) on their data initiatives. Let’s take a step back and define some terms.
ROI: Regarding Information Insight engagements, our clients’ primary investment is in the effort and time it takes for Kenway to collect, clean and present data according to clients’ requirements, along with any technology expenditures (e.g. environments, software, etc.). The return on this investment is generally comprised of the speed, quality and value of the environments, visualizations and other deliverables that enable a client to make faster and more accurate decisions.
Robust Data Environment: A data environment is considered robust when it equally addresses elements of data governance, data management and business intelligence. This provides an organization the ability to understand:
- From where their key data components are coming
- How it is transformed and transferred between environments
- How users can access, analyze, and ultimately, make decisions with the data
As you can see below, a key aspect of a robust data environment is balance. Missing aspects of any of these disciplines hampers the value of your environment as a whole.
As a budding data analysis provider, whose work has mostly focused on data visualization efforts, I see an obvious question embedded in the above visual. Is it possible to quantify the impact of the presence of data governance on the number of hours it takes to complete a data visualization project? Specifically, if there is a lack of available, clean and trusted data—that is, we have a lot of “dirty data”—how does that impact the effort required for the project? Further, how does this impact the ROI of the project?
To try and answer this question, my colleague, Jon Chua, and I kicked-off our quest with some basic data collection. We first took an inventory of all the data visualization projects Kenway has done over the last six years. We then looked at the number of hours it took to complete each project and proceeded to dig into the details by considering the size and complexity of the scope of each of the projects.
By reaching out to the resources that were involved in the projects, we got an understanding of the current state of the clients’ data governance maturity level at the beginning of each project. Furthermore, we also tried to get a sense of the scope of the project by asking how many dashboards were required and the relative complexity of them.
Having this data in place allowed us to perform some basic exploration. We started by simply plotting the hours taken to complete a project against the number of dashboard views required by the client.
Typically, we would expect this to be a smooth and steadily rising line where more dashboard views were related to more hours. However, as you can see in the above graph, that is not the case. We notice a multi-modal distribution, which suggests there are other factors, such as data governance, that contribute to accurately estimating the number of hours.
To try and weigh the impact of data governance, we ran a regression analysis to see the relationship between hours required and the number of dashboards with the presence of data governance. Utilizing R to perform this analysis, we got a resulting model of:
Interpreting the above model, we see that there are at least 98 hours required to complete a project when the required number of visualizations to complete is zero irrespective of the presence/absence of data governance. We can think of this number as a baseline effort required to complete the up-front tasks necessary to create a visualization, such as onboarding, gathering requirements, defining scope, etc.
Next, consider the scenario where the client has formalized data governance. They have the roles and responsibilities defined for data ownership and stewardship, they have a clear understanding of their data sources, and they have mechanisms (i.e. policies, procedures and enforcement) in place to ensure that data quality standards are upheld during data entry. In this case, we observe that the total hours required for each additional dashboard is, on average, 13 () hours
On the other side, the absence of data governance can be incredibly costly. In such a scenario, the number of hours required to create a dashboard is almost tripled to 38. In other words, we observe that, on average, each dashboard required an additional 25 hours ( when data governance was not in place.
Poor data governance, or the absence of it, is one of the leading causes of “dirty data.” Per our analysis, we believe that the effort to collect, catalog and cleanse the data in the absence of data governance has a significant, measurable impact on the overall effort of creating data visualizations. Based on these findings, the up-front cost of setting up a data governance framework at your organization has clear benefits when you factor in the future cost savings!