What is Data Wrangling and Why Does it Take So Long

Why is Data Wrangling Necessary?

  1. Analytic Base Table (ABT): The ABT is used for machine learning. Each row in the table represents a unique entity (such as a person, product, claim, vehicle, batch, etc.) with columns containing information (inputs) about that entity for a specific point in time: its attributes, history, and relationship with other entities. If the ABT will be used for supervised machine learning, it must include outcomes that post-date its inputs. Supervised machine learning analyzes the table looking for consistent patterns that are predictive of the desired outcomes.
  2. De-normalized Transactions: Transactional information is used for organizational business operations such as:
  • Presentation of prior contact with a customer, including notes and the actions made during past calls, to address concerns on a current call
  • A line item in a particular order, including information about the complete order and detailed product information
  • A dental record of a root canal and the historical and current x-rays for comparison, to ensure the tooth is ready for a crown

Data Wrangling Challenges

  1. Clarifying the use case: Because the data required depends completely on the question you are trying to answer, a data mart view is rarely adequate. To clarify the use case one must consider these questions: What subset of entities are relevant? Are you predicting the likelihood of an event or estimating a future amount? Within what time period and under what conditions? If the model results will drive an action, do you know the causal relationship between that action and the desired outcome?
  2. Obtaining access to the data: It is easiest if the data scientist or analyst secures permission to directly access the relevant data. Otherwise, they must submit precise instructions for “scrubbed” data, and hope their request will be granted and executed correctly. Navigating these policy boundaries is difficult and time consuming.
  3. Resolving key entities: There must be no ambiguity about what entity is relevant. Although this sounds easy, it isn’t. Take for example the entity “customer”. Who is a customer, exactly? There may be a Customer ID in a convenient table, but if a customer leaves and returns, they may get a new Customer ID. Are John Smith, John R. Smith, and J.R. Smith the same customer? If a person has a business account and a personal account, are they two customers or one? Can a person have multiple accounts, attempting to mask their identity? Are there other members of the household who use the same account? What about hierarchical organizations of customers? The list goes on and on.
  4. Identifying the extent and relationships of source data: This is where data warehousing best practices help considerably, especially if there are appropriate views already constructed. Without this luxury, discovering how the natural key structures of the entities tie together can require significant data exploration, and then these implied rules must be verified by the data owners. Inevitably, analytic models need to know what the data looked like on a specific date in the past — so the data stores must support “historical snapshots”. Also, if the meaning of the data (i.e., metadata) has changed over time, the confidence in the analysis may drop off sharply.
  5. Securing relevant treatment data: Since the purpose of a model is to decide which entities should be treated (that is, acted upon), the history of treatments taken is paramount. The treatment depends on the use case — anything from which customers to call based on their orders, to a decision whether to install a down­hole pump to prevent a gas well from freezing — the possibilities are endless! But, usually the treatment was conceived and implemented outside the domain of the other operational data — in documents and spreadsheets, for example. These dispersed files need to be fully compiled and integrated into the ABT. Eventually, new processes should be established to store subsequent treatment records in the master data source for use by future analytics projects.
  6. Avoiding selection bias: Too often ignored until model failures occur, selection bias is a significant problem for data science. Selection bias remediation can be a difficult task. It is important to make sure the training sample data is representative of the implementation sample. Finding the right weights for bootstrapped sampling (an important technique to generate an unbiased sample) may require building a model just for this purpose.
  7. Feature engineering: Before the supervised learning phase, raw model inputs need to be transformed into features suitable for machine learning. This is the “art” of data science. What should be nominal, what should be continuous? How should nominal levels be combined? What deterministic relationships need to be encoded? And, don’t forget that date discipline is paramount to prevent leaks from the future. Safeguards are required to prevent features from inheriting leaks beyond the “snapshot date” for an observation. Also, every entity for a model build that will be scored in production must have an equivalent history as other entities. Finally, the target outcome duration, or window of time, must be equal for all observations used to train a model.
  8. Exploring the data: Correlation analysis should be completed before starting a model build. Before exploring the relationship to the ultimate outcome, it is very helpful to remove redundancy in the data. Skip this step and you will pay for it later. Often, in wide files, a set of columns are highly related, meaning that they contain redundant information, which only serves to make feature and model selection more difficult. Although such redundancy might be evidenced by a high correlation coefficient, sometimes it is not. For example, if there are two columns for color, one with Spanish names for the color and one with English names, they contain identical and redundant information. Usually the redundancy is more subtle. Variable importance metrics will report that both related inputs are important, but will naturally find it very difficult to choose a balance between the two in the model. Failure to remedy this situation before modeling leads to confusing (and overly complex) models.

Is Data Wrangling Worth the Effort?

Related

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store