You likely see instances of “Dirty Data” in your business, but what about the instances you can't see?

Many companies struggle with reporting, budgeting, and forecasting because of data having various problems that makes it hard to use or impossible to use – hence the term “Dirty Data.”

Sometimes, dirty data is easy to spot – for example, you spot a data point on a chart that is significantly different from all other data points, and ultimately learn this was caused by a keying error. Other times, dirty data is buried so deeply it takes a lot of digging to uncover.

This month, I've been looking “under the hood” of a client's main business processing system. Why? Because an analysis they need to do, which should be straightforward in most systems, has turned out to be much more challenging than even I expected. So, I'm going to take you “behind the scenes” with a client to show how this particular type of dirty data originated and why it was so hard to find.

Background

The best way I've found to describe what this client does is compare it to an organization that's familiar to many business persons – the American Institute of CPA's (AICPA). Key functions of the AICPA (and similar professional organizations) are:

  • Provides services to its members.
  • Develops the Uniform CPA Exam, which is part of each state's licensing requirements to become a CPA.
  • Creates specialty certifications, and handles the related application and examination requirements.
  • Advocates to elected officials on behalf of the profession and the public.
  • Maintains a foundation to support charitable initiatives.

This client is basically the AICPA for its profession, except it does not provide membership and the related services. Its focus is on two types of “products” – examinations and certifications – and handles over 70,000 applications for these products every year.

Prior to implementing their current system in 2022, this client was on two different platforms to run its day-to-day examination and certification operations. Neither was well-suited to the volume or complexity of these processes. Thus, in 2020, they selected a different system – basically a turbocharged Help Desk system – as the platform for a new business system

A New Business System

The problem with a Help Desk system? Its main focus is on internal processes, and it generally has a number of predefined steps through which a ticket is routed. True, users outside a company can generally log a help desk ticket, yet the work of resolving the issue is done inside the company and the resolution is reported to the user.

Now, imagine building an Order Entry system on top of a Help Desk system. Here’s a partial list of required new functionality:

  • Additional information has to be presented to an external user to let them select what they want to purchase.
  • Depending on the purchase, the user might need to upload a transcript or list of continuing education courses.
  • To process a user’s payment, the system has to set up the purchase, securely connect the user directly to a “Merchant Provider” to take the credit card payment information and process the charge, then wait for the Merchant Provider to report the transaction status and provide a confirmation code, and finally store these two pieces of information to complete the transaction.
  • An order acknowledgement and payment receipt have to be sent to the user.
  • The order then has to be routed internally for review and processing.

If the order includes an examination, the system also needs to hand off information on the applicant and the desired examination to a testing company, which then communicates directly with the applicant to schedule the examination.

At the appropriate time, the testing company reports back to our client – the applicant's score if the exam was taken, or the applicant's status as “Canceled” or “No Show” if the scheduled test date has passed.

The Danger of Software Customizations

Developing order processing capabilities and web-enabled data transfers with third parties on top of a help desk system is a major Software Customization project. In fact, it's such a major customization that it should be treated as a Software Development project. As I've previously described, such projects should include these major stages involving people with the right skill set at the right times:

  • Functional Design – End User and Business Analyst
  • Technical Design – Business Analyst and Systems Analyst
  • Application Development – Systems Analyst and Application Developer
  • End User Feedback – All these personnel plus other End Users
  • Acceptance Testing – All these personnel plus other End Users
  • Rollout, Shakedown, and Support – All these personnel plus other End Users

In this software development process, the communication chain looks like this:

  • End User → Business Analyst → Systems Analyst → Application Developer.

Each person in the chain receives information from the prior person, organizes it and fleshes it out to meet the needs of the next person in the chain, and then goes over this information with the next person.

For my client's project, the communication chain was basically this (in reality, there were multiple End Users on the project):

  • End User → Application Developer.

If both persons had been well-versed in how business processes need to be documented and reviewed to ensure all necessary situations are covered, and familiar with the workings of modern ERP (Enterprise Resource Planning) systems, things could have gone well. Unfortunately, neither person had these key areas of knowledge, so – without the skills of a Business Analyst and a Systems Analyst on the customization team – the project limped along.

When these customizations were completed and the system was finally rolled out, bugs quickly surfaced and required a lot of effort to address.

Remaining Limitations

Even after the bugs were addressed, there were two major limitations – one known that affected productivity, and the other unknown that produced dirty data:

  • Inability to Process a Refund Through the System – In modern ERP systems, a refund is basically a “negative sale” and can be easily processed. Not so in this customized system. While refunds could be authorized through the system, they had to be manually processed by the accounting staff – and this meant they were not directly linked to the original purchase and payment. This was the known limitation.
  • Incomplete “Closing the Loop” on the Examination Process – If the customized system reliably handled all possible outcomes of the examination process, it would be simple to match information from the testing company's monthly invoice to the test outcomes documented in the customized system. There can be at least 5,000 examination results every month, and if even only 4% of the items on the vendor's monthly invoice – that is, at least 200 per month, or over 2,400 per year – cannot be matched with records in this system because of incomplete data, there can be problems. This limitation was unknown, until my associate and I found it. While I did come up with an alternate method of matching that might cut the 4% down to 2% (and possibly lower), the best option is to find and fix the root cause of the problem.

Why are these two limitations important? Because the first causes inefficiencies in both the operations and accounting groups, while the second hinders the client's ability to match their largest revenues with their largest non-personnel expense accurately.

The latter limitation means that revenues are not necessarily recognized in the same period as expenses, causing problems for the top executives and managers trying to evaluate profitability of the various examinations and certifications.

Dirty Data Takes Many Forms

Dirty data can take many forms – one article from November 2024 lists the following types of dirty data and discusses ways to cleanse each:

  • Duplicate Data
  • Insecure Data
  • Outdated Data
  • Incomplete Data
  • Inaccurate Data
  • Incorrect Data
  • Inconsistent Data
  • Hoarded Data

This client’s situation illustrates one form of dirty data – Incomplete Data for the examination outcomes – and possibly also Inconsistent Data if I am unable to detect a pattern in the Incomplete Data.

What's the best way to prevent dirty data? Design and develop data validation and data completeness capabilities into the software – to do so, you need to ensure any software development or software customization project has the right set of skills involved during each stage of the project.

Team members with the right skills help produce the desired outcome – data that’s crystal clean and 100% complete. Otherwise, you risk having a lot of dirty data to clean up.

Sincerely,

Todd L. Herman

PS – Please check out our most recent video - Wash Your Hands of Dirty Data, for tips on managing your company data.

PPS – Here are some additional articles related to collecting, maintaining, integrating, storing, and analyzing key business data:

 

 

Affiliations