Raw data problems
Cleaning and preparing data is always the most difficult part of an analysis.
There are so many ways that data can be messy. Here are just a few:
- Don’t rely on a “first order date” column in the raw data. Just calculate it using
MIN()
- Different business units
- Transactions that are returns but not marked as negative numbers
- How do columns in raw data relate to each other? one-to-one? one-to-many?
- Duplicated id’s
- Duplicated rows - same as above
- Transaction detail inserted into same table as rolled-up transaction info
- Yearly contracts sometimes need to be split into 12 monthly transactions for analysis purposes
- Outsized transaction amounts, much higher than should be possible
- Dual key tables
- Lack of proper id’s
- Missing data
- Slightly “off” data that can be merged/corrected
- Product names change over time, causing errors of confusion in the data, making it look like customes switched products but did not
- Product bundles change over time, making it look like customers changed plans significantly but it was the company’s change, not the customer’s.
- Data structures change over time - e.g. they used to lump up-sell revenue in with primary products now they split it out into another table
- Currency is in multiple foreign denominations
- Trial customers are sometimes counted, sometimes not counted in outputs
- The transaction files are roll-ups. They are not actual transactions on a timeline.
- Some rows missing data in required columns
Graph:
- 108.10.20.20.10 Data Analysis - Raw data problems to 108.10.20.20 Data Analysis - Clean and map data
- 108.10.20.20 Data Analysis - Clean and map data to 108.10.20.20.10 Data Analysis - Raw data problems