Data wrangling

Let us do the data wrangling: you have better things to do

Ever heard the phrase 'Data science is 80% data wrangling'? It originated from a Harvard study in 2016, and despite vast technological improvements since then it still accounts for a huge proportion of most data professionals' time.1 That's where Datasight® can help. Our team thrives on wrangling data, and we understand that within every messy dataset lies the key to your next breakthrough. From the initial discovery phase to the final stages of publishing, we're skilled in every aspect of transforming your raw data into a more accessible and usable form that's ready for analysis. We can handle the time-consuming data wrangling process for you, freeing up your time to focus on your core objectives and successfully achieve your goals.

Wrangling data improves its quality and usefulness

Quality data is the foundation of informed decision-making, so those decisions are undermined when your data is messy, incomplete or in formats you can't use. That's where data wrangling comes in. Also known as data munging, data wrangling is like giving your data a makeover so it's easy on the eyes but also easier to work with and understand.2 While the exact methods vary depending on the type of data and the objectives of your project, Data wrangling typically involves the six steps below.3

Step 1: Discovery

The discovery phase, also known as data exploration, is all about getting to know your data. It's like being a detective, looking at the evidence and trying to make sense of it.4 You familiarise yourself with the data with the aim of understanding its underlying structures, patterns and potential uses, along with any issues such as missing, inconsistent or incorrect values. Understanding the breadth, depth and limitations of your data streamlines the subsequent data wrangling steps and ensures the overall quality and reliability of the final dataset.5

Even though discovery is the first step, it's not something you do just once at the start of the data wrangling process. It's a recurring step that happens throughout the process as new insights and issues often surface when data is further transformed and cleaned.6 Essentially, the discovery phases set the groundwork for the success of the entire data wrangling process.

Step 2: Structuring

The structuring phase is all about transforming and reorganising raw data into a format that's ready to be used.7 It's like sorting a messy pile of papers on your desk, where the papers are raw data. Some of the papers are not relevant anymore, others are important but not in the right order, a few are hand-written while others are auto-generated and contain completely new information. Once they're sorted into piles of similar papers, you may need to create new piles that build on the information, such as using a pile of birth date papers to create a new pile of age group papers. It all depends on what makes your data easier to work with and is most useful for your project.8 In short, structuring is about turning that pile of papers into a well-organised desk that's ready for work.

Step 3: Cleaning

The cleaning phase is exactly as it sounds - giving your data a good scrub to remove any errors that might throw your analysis off-course such as missing values, duplicated rows, incorrect entries or complete outliers. The process of finding and resolving errors is like finding odd socks in your laundry that need to be sorted out.9 During this phase you need to make decisions about what to do with the odd socks (errors). For example, do you ignore them, do you make an educated guess about what they should be, or do you remove them altogether? The decision will depend on various factors such as importance of the variable, type of analysis, volume of missing or incorrect data and the potential impact on your analysis.10 Cleaning up data is time-consuming, but it is absolutely necessary to make sure your data is accurate and reliable.

Step 4: Enriching

The enriching phase involves adding data from other sources to make your dataset more useful. For example, if you're analysing customer behaviour data for an icecream brand you could add external data about weather, demographics and market trends to give your analysis more depth and perspective.11 The process of enriching is not about just adding more information though, it's about adding the right information - data that is relevant, reliable, and will genuinely enhance your analysis.

Step 5: Validating

The validating phase is where you check whether your data is consistent and high-quality. It's like having a building inspector check out a house before you buy it to make sure it's in good shape.12. The process of validating data typically involves creating a set of rules that your data must comply with. For example, you might create a rule that all email addresses in your dataset must contain an '@' symbol. The next part of the process could be done manually, but it's usually automated using validation software due to the volume and complexity of the data being managed. In that case, validation software would  check each email address in your dataset to verify compliance with the rule and flag non-compliant records for manual review.

The trick to effective validation is devising a truly comprehensive set of rules that continues to evolve. Given the software involved, development of these automated checks often involves complex programming and testing which can be time-consuming.

Step 6: Publishing

Publishing is the final step in data wrangling. It's about sharing your clean, structured and checked data with the people who need to use it. How you share the data can vary from a well-organised database, a dashboard with visual graphics for quick insights, through to a downloadable spreadsheet that people can play around with.13 Publishing isn't just about making data available though. It's about making sure it's in the most useful format, accessible only to people who have the right data permissions, and that personal identifying information is removed to protect privacy where necessary.

In some cases the data might need to be updated regularly to keep it relevant, in which case setting up a smooth data pipeline for automated publishing becomes a necessary part of this step.14 It's important to also publish extra information, called metadata, to help users understand the source and reliability of the data. Metadata often helps explain unusual trends caused by, for example, changes in definitions, scales or systems over time.

Next: Data management

References
1 Press, Gil. (2016, March 23). Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says. Harvard Business Review. Retrieved from https://hbr.org/2016/03/data-scientists-spend-most-of-their-time-cleaning-data
2 Krishnan, K. (2013). Data wrangling: Munging in R with SQL and MongoDB. Packt Publishing Ltd.
3 Kandel, S., Paepcke, A., Hellerstein, J. M., & Heer, J. (2011, October). Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 3363-3372).
4 O'Neil, C., & Schutt, R. (2013). Doing Data Science. O'Reilly Media.
5 Rahlf, T. (2018). Data Wrangling with R. Springer.
6 Lantz, B. (2019). Machine Learning with R. Packt Publishing.
7 Wickham, H., & Grolemund, G. (2016). R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly Media.
8 Hull, K. (2019). Machine Learning with R, the tidyverse, and mlr. Manning Publications.
9 Kim, G. (2018). The Data Cleaning Prerequisite: Detection. In Data Cleaning (pp. 19-39). Springer.
10 Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4), 3-13.
11 Chapelle, O., Manavoglu, E., & Rosales, R. (2015). Simple and scalable response prediction for display advertising. ACM Transactions on Intelligent Systems and Technology (TIST), 5(4), 1-34.
12 Batini, C., Cappiello, C., Francalanci, C., & Maurino, A. (2009). Methodologies for data quality assessment and improvement. ACM Computing Surveys (CSUR), 41(3), 1-52.
13 Stodder, D., & Henschen, D. (2016). Visual analytics for making smarter decisions faster. TDWI Best Practices Report, 21(2), 1-36.
14 Kelleher, J. D., & Tierney, B. (2018). Data science. MIT Press.