Home > 108 Data Analysis > 108.10.20.20.20 Data Analysis - Describe the data in words

Describe the data in words

Nearly every time a new data set comes in, it will be completely different, completely brand new, and not structured similar to anything you’ve ever seen before.

In order to do a proper data analysis, however, this data needs to be completely understood. When there are multiple people working on an analysis, it needs to be completely understood by every person. This is not an easy task, for a few reasons.

The first reason it is difficult is because the data will be different than what has been seen before. The second reason is that whichever person is in charge of 108.10.20 Data Analysis - Step 2 ingest clean validate will, by default, have the best understanding because of the nature of the work.

One way to get everybody on the same page is to document the data at different stages of cleaning. Each column of the data will have a specific format and purpose. It is helpful to document our understanding of these things so that - the next person who looks at the data will understand it - helpful if you need to come back in the future to look at the data again - helpful if you get an updated data set and need to map the new data in with the old data

E.g. customer_id Unique customer identifier. Foreign key to customer table store_id Unique store identifier. Foreign key to store table universal_indicator Flag to indicate if a customer is active both online and offline warning_flag Unused column, previously used to indicate that customer was about to churn missing_data Flag to indicate if the transaction record is incomplete and should be ignored amt Total amount of order before taxes amtat Total amount of order after taxes are added

As it is easy to see from this example, there are often column names that only make sense in-house. There are often columns that don’t matter for our analysis. There are often columns that are ambiguous, incomplete, bad format, etc etc. The only way to know this, and the only way to gain a shared understanding of these issues is to take the time to document.

Graph:

108.10.20.20.20 Data Analysis - Describe the data in words to 108.10.20.20 Data Analysis - Clean and map data
108.10.20.20.20 Data Analysis - Describe the data in words to 108.10.20 Data Analysis - Step 2 ingest clean validate
108.10.20.20 Data Analysis - Clean and map data to 108.10.20.20.20 Data Analysis - Describe the data in words

Travis Giggy

Describe the data in words

Graph: