Data Wrangling refers to the process of sourcing, cleaning and amalgamating data sets in order to make access to that data more straight forward and also to prepare that data for analysis and insight.
Data Wrangling is an activity that has grown out of the rise of Big Data. As organisations develop big data projects the sources of data they draw on becomes more varied, the format less regular and the quality more questionable.
Wrangling is the process of gathering, joining and cleaning these data sets to reduce the risk that it will produce spurious or inaccurate results.
Data Wrangling is a discipline within data science, with Big Data projects often requiring dedicated wranglers or data scientists with wrangling skills.
Data Wrangling can generally can be split into three distinct areas.
Data Collection or Acquisition – which involved finding data and getting access to that data to then bring it within your own big data or data analytics framework.
Amalgamating Data – find ways to join and combine the disparate data courses that are collected so that analysis can be performed across the wider joined together data.
Data Cleaning – review data identifying “bad” or inaccurate data and remove it from your data sets. Cleaning also involves reformatting data so that it is easier to process and analyse
The net effect of wrangling is to improve accuracy and depth of analysis and insight.