Home  >  108 Data Analysis  >  108.40.20.30 Data Analysis - An idea for raw data mapping and validation

An idea for raw data mapping and validation

The high level

  1. Import raw data in raw format into database.
  2. Use python or another program to pull column names, data types, and summary statistics.
  3. Write documentation for what each column means.
  4. Map the columns - Which column represents the client id? Which column is the total price? Which columns represent segments (and how are those segments to be grouped?) etc…
  5. Use test suites to verify. Are there duplicate transactions? Are there unknown data types? How many rows don’t conform? Etc.

Some more thoughts on the same topic from other locations

What if we were required to generate the Level Set outputs programmatically? This is probably very similar to what I’ve already done in Gscript for Gem and other analyses last year. It stores all the sql required to create every single level set output which is executed at the same time and put into a format to be imported into the google sheet. - This mindset of doing all the sql up front prevent it from being done ad-hoc and allows things like segments and error checking to happen at the time of output generation. - Doing an entire level set data pull at the same time, every time, prevents errors of forgetfulness. In one example during a project at Lucky8 we were getting different data for a specific column after we received a data update from the company. It turned out that the first time I cleaned the data, I forgot to run one of the minor steps of the cleaning process and the second time I did not, which caused a lot of headaches. If every single step of the Transform process was coded to run in a sequence this would not happen.

At Lucky8, I had a normalized data structure that was the same every time - a good idea in concept but difficult in practice. When I was doing analysis last year, I switched to a more flexible data structure but tried to implement structure on the code side. /Users/travis/Documents/Great_Heights/cohort-analysis/technical/hotroc_flask/scripts/resolve/script_insert_cohort_data_resolve.py This worked well, stored the level set sql in code so it can be replicated, generated the necessary sql at run-time and seemed to be adaptable to new data sets. This is worth pursuing further. - The mapping of new data sets into our core structure was what kept me up at night, and often made me dislike my job


Graph: