Learn something new every day
More Info... by email
Data scrubbing, sometimes called data cleansing, is the process of detecting and removing or correcting any information in a database that has some sort of error. This error can be because the data is wrong, incomplete, formatted incorrectly, or is a duplicate copy of another entry. Many data-intensive fields of business such as banking, insurance, retail, transportation, and telecommunications may use these sophisticated software applications to clean up a database's information.
Errors are in databases can be the result of human error in entering the data, the merging of two databases, a lack of company wide or industry wide data coding standards, or due to old systems that contain inaccurate or outdated data. Before computers had the capabilities to sort through and clean data, most scrubbing was done by hand. Not only was this time consuming and expensive, but it oftentimes led to even more human error.
The need for data scrubbing is made clear when considering how easily errors can be made. In a database of names and addresses, for example, one name might be Bobby Johnson of Needham, MA, while another is Bob Johnson of Needham, MA. This variation of names is most likely an error and is referring to one person. A computer would normally deal with the information as though it were two different people, however. Specialized data scrubbing software is able to distinguish the discrepancy and fix it.
While these small errors may seem like a trivial problem, when merging corrupt or erroneous data into multiple databases, the problem may be multiplied by the millions. This so-called "dirty data" has been a problem as long as there have been computers, but it is becoming more critical as businesses are becoming more complex and data warehouses are merging data from multiple sources. There is no point in having a comprehensive database if that database is filled with errors and disputed information.
Companies using specialized software can either develop it in-house or buy it from a variety of vendors. The software is not cheap and can range anywhere from a price of $20,000 to $300,000 US Dollars (USD). It often also requires some customization so that the software will work to the business' specific needs. It goes through a process of using algorithms to standardize, correct, match, and consolidate data and is able to work with single or multiple sets of data.
Data scrubbing is sometimes skipped as part of a data warehouse implementation, but it is one of the most critical steps to having a good, accurate end product. Because mistakes will always be made in data entry, there will always be a need for this process.
Great article. I agree this is something missed in most data warehouse implementations. I think it's because most of the implementers are in IT and don't see the resulting pain from the business side. In our business we give business users (Marketing/customer service/ data entry) leads some tools to clean the data themselves at a much lower price point then mentioned here. We use Data Ladder.
I can agree from experience that if the problems aren't addressed, you get a burnout effect in the business where the application/data isn't trusted and there is a large decline in user activity.