Home  >  108 Data Analysis  >  108.10.20.20.10 Data Analysis - Raw data problems

Raw data problems

Cleaning and preparing data is always the most difficult part of an analysis.

There are so many ways that data can be messy. Here are just a few:

  • Don’t rely on a “first order date” column in the raw data. Just calculate it using MIN()
  • Different business units
  • Transactions that are returns but not marked as negative numbers
  • How do columns in raw data relate to each other? one-to-one? one-to-many?
  • Duplicated id’s
  • Duplicated rows - same as above
  • Transaction detail inserted into same table as rolled-up transaction info
  • Yearly contracts sometimes need to be split into 12 monthly transactions for analysis purposes
  • Outsized transaction amounts, much higher than should be possible
  • Dual key tables
  • Lack of proper id’s
  • Missing data
  • Slightly “off” data that can be merged/corrected
  • Product names change over time, causing errors of confusion in the data, making it look like customes switched products but did not
  • Product bundles change over time, making it look like customers changed plans significantly but it was the company’s change, not the customer’s.
  • Data structures change over time - e.g. they used to lump up-sell revenue in with primary products now they split it out into another table
  • Currency is in multiple foreign denominations
  • Trial customers are sometimes counted, sometimes not counted in outputs
  • The transaction files are roll-ups. They are not actual transactions on a timeline.
  • Some rows missing data in required columns

Graph: