How to Manage Data…Better
I have yet to meet a person for whom the very thought of manual data cleaning or ETL (extract, transform, load) doesn’t elicit a tortured groan. Let’s face it: a root canal is a more attractive prospect than any data project that starts with the word “manual.” These projects are repetitive, mentally taxing, involve no creativity, and often are full of mistakes. Luckily, as explained in my previous post, there’s a better way: automation. When one employs an efficient programmatic strategy and manages his or her expectations for what can be done with the resources available, automating one’s data cleaning with a language such as R is an effective way to save time, energy, money, and sanity.
I will outline a few strategies, tips, and tricks for approaching data cleaning using the R programming language in this post. This is certainly neither an exhaustive list nor is it a precise technical exploration—my goal is to use my experience to add a little color to accompany the monotonous programming manuals floating around the Internet. For a more granular exploration of the topic—complete with technical details—I’d recommend reading the discussion paper An Introduction to Data Cleaning with R by Edwin de Jonge and Mark van der Loo
Managing Your Expectations for Automation-
While reading about the latest trends in data science, I often see data scientists and programmers referred to “wizards” or “ninjas.” While flattering, these titles can give the uninitiated a false sense of what data science-based computer programming actually does. As per Wikipedia (emphasis is mine): “the purpose of programming is to find a sequence of instructions that will automate performing a specific task or solving a given problem.” Ultimately, any program that one writes is only going to be as creative as its author.
Depending on how unstructured the data set is, it is quite possible that only 90-99% of the normally manual work will be automatable. Outliers are an unfortunate reality. The efficiency of the program is purely dependent on the programmer’s skill and his or her familiarity with the data set at hand. Don’t let an imperfect first attempt be discouraging while still learning R. Even an unrefined script can still bring hours or days of work down to minutes.
Searching for Patterns in Strings-
Pattern detection is the meat and potatoes of automating data clean-up. When looking at unstructured data, there is usually some kind of detectable pattern or patterns in unclean data that can be easily amended in order to consider the data cleaned or structured. Below are some strategies to identify and fix these patterns.
When working with string (text) data, look for repeated groups of characters within strings (called substrings) that need to be replaced. Make note of the groups, how frequently they occur, and what they need to be replaced with.
Below are a few examples of kinds of patterns to look for in data cleaning:
- If there is a free text field in a public-facing form for job title and a small but significant portion of respondents enter “Principal” instead of “Principle” it can be difficult to get an accurate count of how many Principles filled out the form.
- Sometimes additional characters can be added to the beginning or end of a string (called prefixes and suffixes), especially when data is entered manually. For example, what if a space had been entered after the end of many of the strings in a database? While invisible to the naked eye, this can severely impact analysis.
- Every space or has been removed or replaced with a “-” or “.” as a delimiter.
- A character limit has cut off the end of the same kind of string in many cells.
- A database contains a lead’s e-mail address, but the analyst needs the lead’s website.
- Certain data fields have been coded inconsistently in Salesforce or Marketing Automation, creating two or three names for only one group or category.
The above can be resolved with R’s stringr package. Stringr is basically a more powerful version of the Find and Replace functions in Microsoft Office. It will allows its user to replace entire strings or substrings based on specific rules it is given. The key is to identify these patterns in the data and devise a substitute pattern that makes the data more correct.
Another thing to look out for in unstructured data is errors as a result of the data being run through a database or software first. Many kinds of software need to translate certain characters into other characters or transform data in different ways depending on what’s in the dataset – usually due to technical limitations. When cleaning datasets like these, the best way approach to take is to reverse-engineer the process that the database of software performed to transform the original data into data it found suitable then write a code to “undo” the problems it caused.
The above are only very basic issues with string data. Writing a code once and then just running it repeatedly to clean various data sets is a far more efficient solution than having to find/replace or (perish the thought) manually fix each entry manually in Excel.
I recommend R to professionals looking to automate part of their manual data work is because it’s very easy to pick up and accomplish simple tasks within the first week or two of using it. The above strategies are very basic solutions to some of the most common data problems and have the potential to save you days of work with only a few hours invested learning the basics of R. In the ever-expanding #MarTech technology-driven world, this is another tool to learn and manage, but if you rely on or manage data, it is well worth it.
Stay tuned for the next post with more tips for better data management with R. Yes, you will want to convert to R after reading these.