GIGAMIND

Folder:
108 Data Analysis
File:
108.10.20.20.20 Data Analysis - Describe the data in words

Describe the data in words

Nearly every time a new data set comes in, it will be completely different, completely brand new, and not structured similar to anything you've ever seen before.

In order to do a proper data analysis, however, this data needs to be completely understood. When there are multiple people working on an analysis, it needs to be completely understood by every person. This is not an easy task, for a few reasons.

The first reason it is difficult is because the data will be different than what has been seen before. The second reason is that whichever person is in charge of 108.10.20 Data Analysis - Step 2 ingest clean validate will, by default, have the best understanding because of the nature of the work.

One way to get everybody on the same page is to document the data at different stages of cleaning. Each column of the data will have a specific format and purpose. It is helpful to document our understanding of these things so that
- the next person who looks at the data will understand it
- helpful if you need to come back in the future to look at the data again
- helpful if you get an updated data set and need to map the new data in with the old data

E.g.

customer_id             Unique customer identifier. Foreign key to customer table
store_id                Unique store identifier. Foreign key to store table
universal_indicator     Flag to indicate if a customer is active both online and offline
warning_flag            Unused column, previously used to indicate that customer was about to churn
missing_data            Flag to indicate if the transaction record is incomplete and should be ignored
amt                     Total amount of order before taxes
amtat                   Total amount of order after taxes are added

As it is easy to see from this example, there are often column names that only make sense in-house. There are often columns that don't matter for our analysis. There are often columns that are ambiguous, incomplete, bad format, etc etc. The only way to know this, and the only way to gain a shared understanding of these issues is to take the time to document.


Source:
  • Me