Whether you’re mining your own data for ideas to help you grow your business, or you’re a full-time data analyst whose job is to reveal trends and information about others’ organizations, one thing’s certain: data quality is critical to your success.
However, even in the era of “big data” – with software that helps mine data for important discoveries and insights – cleaning up data by hand (aka “data wrangling” or “data munging”) is still a huge part of any project. In fact, it’s estimated that data scientists “spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets,” according to an article in the New York Times.
If you’re a business owner hoping to grow your company, your data informs what you do, how you grow, and where you head next. Therefore, your data must be as close to perfect as possible. But when you’re doing so much data cleansing by hand, making or overlooking errors – especially simple things that you’ve done hundreds of times – is easy to do. Which is why we’re taking a step back and looking at some of the basics that should be part of your best practices.
- Check spelling. Running a spell check is easy to forget, but it’s a simple step that can help make your data easier to understand. Not only can you use a spell checker to find misspelled words, but depending on the data cleansing tools you use, it might be able to help you find values that aren’t consistently used – like product or company names – which you can then add to a custom dictionary.
- Remove spaces. Leading, trailing, or multiple embedded space characters, and nonprinting characters can cause results that are difficult to understand. To avoid them, be sure to remove spaces or unwanted characters before attempting to sort, search or filter your data.
- Create uniform time and date formats. Because there are so many different time and date formats that can be confused with numbered part codes or other strings that contain slash marks, hyphens, periods, and colons, all times and dates should be reformatted.
- Find and replace text. If your data has a common leading string that would be best to remove, like a label followed by a colon and space, or a suffix, like a parenthetic phrase at the end of the string that’s unnecessary, you can use a “find and replace” feature in your spreadsheet to automatically find instances of that text, then replace it with no text or another piece of text.
- Remove duplicate rows. It’s not uncommon to find duplicate rows when you import data, but you have to be aware of the problem to remove them. To do this, filter your data for unique values, confirm that the results are what you want, then simply remove any duplicate values.
- Merge and/or split columns. In order to make your data easier to digest, you might have to split some columns and merge others. For instance, you may want to split a column that contains a full name into two containing a first and last name. Or, you may want to split a column that contains an address field into several columns that contain the individual elements of the address, like street address, city, state, region, and/or postal code. Likewise, you may prefer one column with full names, or a single column for an address, and have to merge columns to organize your data correctly. Simply do what makes sense for your organization – and do it consistently.
In an era where data seems to be everywhere, the importance of using clean data from reliable sources cannot be overstated. After all, if your data is messy, poorly designed, or inaccurate, it won’t give you information you can act on. So once your data is clean, keep it that way. Run periodic checks to make sure that your team is adhering to your standards and best practices. The quality of your data will improve, and you – and your entire team – can act on the data with confidence.