GIGAMIND

Folder:
108 Data Analysis
File:
108.40.20.20 Data Analysis - Raw data understanding and summary statistics

Raw data understanding and summary statistics

The quality of data is always the most important factor for success in a data analysis project. It is usually also the hardest step of the process which takes the most time as well. There are two key moments when we need to understand the quality and shape of the data:

  1. When we get a fresh data set in, the first thing that needs to be done is to understand the high level shape and content of this data.
  2. When we finish cleaning and mapping the data into a format we can analyze, we must confirm via 108.10.20.30.10 Data Analysis - External data validation that it is correct.

Do the minimum amount of exploration to get comfort with the answer to these questions:
- Do we understand what each column and file of data represents? This is important because we are going to make a lot of assumptions based on this data. It's a good idea to schedule a call with the client to make sure we have a thorough understanding of the data format.
- The 108.10.10.10 Data Analysis - Client working relationship is important.
- Will this data be sufficient to answer the questions we proposed in 108.10.10 Data Analysis - Step 1 smart questions and proper data?
- If not, what other data is needed? Reach out to get the data you need.

Do these things to get a feel for the data

Here are some low hanging fruit, but not necessarily a complete list, for data summaries:

  • Check the packaging - # rows, # columns, data types for each column
  • Look at the top and bottom of your data - first and last 20-ish rows
  • Check your counts (group by, sum, make sure you have what you expected to have for each column)
  • Validate with at least one external data source 108.10.20.30.10 Data Analysis - External data validation
  • Make a plot

For customer-specific data sets, check these things:

All of the below, overal rollup and by major segments like product and year. When possible, check these numbers against an 108.10.20.30.10 Data Analysis - External data validation
- Total distinct customers
- Total transactions
- Total revenue

A good reference on summary statistics

Might want to take this course to get a better understanding, but the free video is pretty good
https://www.coursera.org/lecture/big-data-machine-learning/data-exploration-through-summary-statistics-ZLttd


Source:
  • Me