Recently I was on a data science/investigation project. In short, it’s a complex project but I used some basic techniques to decompose it, and found some data gold. Here’s what I learned…
Be Friends with the Data
You need to know your data and be friends with your data. You will spend time, not just time but quality time, with your data (just like friends do). Like any other friendships, you will know the characteristics and personality of your data. You may discover some “data gold” in your data. In the meantime, you may find data quality issues that you either fix or avoid.
Five Data Techniques that I Used to Befriend my Data
Trust me, you need a whiteboard, or a large piece of paper, to write down your findings and to make sense of it. I prefer a whiteboard because I am able to see the big picture by stepping away from it. In my project, I mocked up the data records on the whiteboard to visualize how I can overlay datasets and how to summarize the data to find trends. I also make a flow chart to understand how this is being processed and transformed during the business workflow. I found some data gold in the flow chart and later on I was able to formulate some hypotheses because of that.
Diagram 1: A whiteboard diagram that captures the flow of data from end to end. I was able to pinpoint places where things may go wrong (I called them data gold in the process (see the orange boxes)). I also used it to plan how to provision the end results in Hortonworks (see the blue area).
- Scope your Date Range
The dataset has 39 months of data, which is an odd number. To make it simple, I defined a data range so that I am analyzing 3 full years of data; I made Jan 1, 2015 as the start date and Dec 31, 2017 as the end date. With this approach, date buckets are defined in a very user-friendly fashion, as they are terms that are very easy to understand and everyone can relate to those date buckets. For example, in my analysis, with the years clearly defined, I am able to create a year dimension in the analysis and the users can compare different years by themselves (and I don’t need to do as much explaining).
- Visualize the Distribution of Key Fields
DO NOT visualize all fields! Pick your battles here! I want to know the distinct values of key fields and their distribution. Fields with 3 to 10 distinct values are perfect candidates for grouping and aggregation. Distribution is very useful; I can see if the data in the field is skewed. For example, I saw right-skewness in some fields and that gives me insights on how to filter out the extremes, and gives me an idea when to use average verse median.
Diagram 2: A histogram showing the distribution of a data element. The data element is highly skewed so I used medium to describe the data. Moreover, the traditional 80/20 rule is not very effective in detecting the extreme values.
- Dumbed-Down ER diagram
A dumbed-down ER diagram is a good way to visualize “things” and their relationships in your dataset (things are entities like people, accounts, events, stores, products, etc). In my case, my dataset has 6 entities. I used the ER diagram to capture what they are, and how they are related to each other, as well as if the relationships are mandatory. This is very helpful because I can analyze and summarize the data at different levels, and I know how to debug any data issues by examining different levels.
- Know thy Data Granularity
This is a MUST if you are working on datasets with mixed grains. In my case, the activity data that I got is very robust. Too robust, in that it has different records with totally different granularity. I need to understand what each activity record means – does it mean the start of an activity? The duration of an activity? Or a series of related activities in a parent and child relationship? For me, having an understand of data granularity allows me to process different data of different data grains accordingly. I actually separate the activity records into separate HIVE tables and I treat them differently, in other words, I normalize the data to some degree, so that they are their own entities.
Diagram 3: This screenshot shows a mix of data grains in the source data. Records in pale yellow are point-in-time records. Records in orange are duration records with a start and end date.
In part 2 of the series I will explain the techniques I used to make the analysis approachable and digestible for business sponsors/executives. Stay tuned.