Blog Series – Simplifying Data Science (Part 2)

Recently I was on a data science/investigation project. In short, it’s a complex project, but I used some basic techniques to decompose it, and found some data gold. Here’s what I learned…

This is part 2 of a blog series that shares some of the lessons I learned from my data science/investigation project. Feel free to a look at part 1 of the series.

I used these 5 simple techniques:

80/20 and Scoring

Executives like 80/20 and scoring, in fact, everyone knows and familiars with the concepts. Everyone knows that if someone is in the top 20% of the class, s/he must be at least an A- student (meaning the top students). I use the same concept in my data science. The execs want to know “what is high” and “what is abnormal”. In my analysis, I ranked the customers associated with the suspicious activities, and I identified the top offenders by picking the top 20%. I went one step further, I identified that a customer has to have 3 abnormal activity records on a monthly basis to qualify in the top 20%. Combined with the median of 0.4 abnormal activities per month, I painted a powerful picture for the execs to visualize “what is high” and “what is abnormal”. The average person has 0.4 abnormal activities per month, the top offenders have 3 offenses. When you see people with 3 or more offenses, be aware, they are doing something not in the normal range.

I further identified four other factors that are indicators of offenders. I normalize each factor so that the top offenders will get a normalized score of 100 while the least offenders will get a score of 1. Again, I am using common-sense concept – you get a score from 1 to 100 to represent the how likely a customer engages in suspicious activities. At the end, I combined all scoring for the five factors, normalize the scoring from 1 to 100, as the final score. The beauty of this design is that the end users can change the weighting of the scores on the fly by defining a calculated column in Excel or in their favourite BI tool. The executive totally got the concept. In fact, my sponsor was able to use the same simple concepts and sell it to the EVPs.

Group Data

Executives love groupings. For me, grouping is a must. I cannot analyze a scatter of data, I can only analyze data that is grouped logically. My recommendation is to work with 3 to 7 groups. Anything above that will most likely give everyone (both you and the end users) a headache. Anything below is just too dilute. Knowing the top offenders in step one helps to visualize them, but we don’t really know the inner workings of those people. My plan is that in the next version of the data analysis, I would use some kind of classification or clustering method to put the top offenders into logical buckets. At the end of the day, I want to tell the execs that the top offenders are not coming from one gigantic group, there are X gangs in among those top offenders.

There are some confusions on classification vs clustering. Here’s a very good post to visualize the difference between classification and clustering. In short, classification involves supervised learning in which you labeled the data into different classes. You use algorithms to determine rules to classify future data into those classes. Clustering is unsupervised training. You use the algorithm to put members into groups based on similarity. The algorithm may find 3 groups or it may find 10 groups depending how you set the “catchiness”.

Profiles and Examples

Executives see the world through profiles and examples. In my analysis, I have pulled two sets of profiles and examples. I am doing that for two reasons: first, having two sets of profiles and examples allow the execs to compare and contrast them, and it helps to communicate the data analysis to them. Second, the stats in the two profiles are solid proof to reinforce that there are at least two different groups of customers in the customer universe. I prepared examples to defend my analysis, and they are especially useful. At some points of my analysis, I was asked by the execs: “Can you show me the data?”. With the examples handy, I was able to pull them up and show them why the scoring model thinks this customer is a top offender. I even went one step further; I extract all activities for those examples and present those in a chronically order. The sorted activities serve an audit trail to show what that customer did over time and what makes a particular customer a suspect. Execs, like everyone, are “see it to believe it” people, be prepared to show and tell your thoughts.

Present like Partner Content

Executes like catchy headlines. Make sure you structure your presentation like the “partner content” found on the news site (example: go to, scroll down to area where it says “partner content” or “paid content”). “Partner content” has two basic components – a catchy title, and some kind of infographic or photo. In this context, I structure my final presentation in a way that I have a one-pager that summarizes my findings. The one-pager uses the same concept of “partner content” – I used catchy bullet points and simple infographic to communicate the message. In fact, my sponsor used the same one-pager in his meeting with the EVPs. He did have a handy slide deck that contains some detailed information, but for the most part, they spent time on the one-pager.

Create “Big Flatten Table” for Self-service

Executes like to kick the tires. They like to see the data by themselves and think through the data. That is reason why I package all the results and summarize them into two flatten tables in Hive. The two self-service tables contains the same data. The only difference is that the first table contains aggregates whereas the other one contains the granular details. At the end, the execs created filters, pivot tables and charts on top of the two tables to understand and to build trust with the results. They also asked their analysts to do additional data analyses in Alteryx on the two flatten tables. After that, they had a greater appreciation for the analysis.

I am getting some questions on how to ensure a data analysis/investigation is relevant to the business. In part 3 of the series I will explain the approach I used to make sure the analysis is business-driven. Stay tuned.

Leave a Reply

captcha *