I’ve been intrigued with the open source statistical language R for a few years. Part of my job when I worked in sales at Google was large-scale data analysis and I used MySQL. But the auction statisticians used R to understand the dynamics of the money-printing machines, AdWords and AdSense.
I wanted to get a better sense of the language and its power but couldn’t find the right data set to analyze. Two weeks ago, one of my portfolio companies, Expensify, provided me access to their data and I kicked off my journey into the world of data science.
I had posed many questions in board meetings and wondered separately about relationships between different factors in the business. Expensify is a SaaS business selling expense management solutions. It’s a metrics rich business and the company had a great analytics database that I used.
My Set Up
I used mysql and ssh to pull data from the Expensify database. Then I took this data and dumped it to a csv. Lastly, I used RStudio, an Eclipse-like IDE (integrated development environment). This worked great.
Analyses I performed
Next, I wrote down all the questions I had and started digging through the data. I was looking to answer these questions:
- Customer segmentation: What are the major customer segments we have and how large are they? How do usage patterns differ across small, medium and large customers?
- Feature use analysis: How many receipts does the average user upload? How many receipts in the average expense report? how many expense reports per user per month?
- Conversion funnel: How long does it take for 50%, 75% and 90% of new users to convert to paid? What are the triggers for a user to convert to paid?
- Customer support: What is the impact of customer support on conversion funnel? What are the characteristics of customers who require little support but generate large revenues?
The one question I was most curious about in my investigation were the advantages of R over MySQL. Sure, R has lots of fancy statistical components but most startups don’t really need this firepower. But I was surprised to discover, R has 3 big advantages over MySQL.
First, I could filter data much more quickly in R than MySQL. Instead of entering in massive where subclauses in MySQL, you can subset a matrix using syntax like customers[customers$revenue>1,000,]. It’s simple.
Second, I could get a sense of the data much faster using things like correlation matrices (there’s a graphical example below) which just aren’t supported in MySQL. R gives you the power and flexibility of Excel’s pivot tables but applied to MySQL scale data sets.
Last, R’s visualizations communicated the trends much better than a MySQL table or Excel chart ever could. There are many kinds (marimekko, sankey, and even the basic xy plot). I’ve shown you two of my favorites below.
Visualization is one of the best parts of R, because it allows you to see relationships instantly, where the correlations are. Here I used a venture capital dataset to inform last week’s blog post on the 10 data points about the VC market. I was trying to figure out where the relationships are across 10 variables: money raised, money invested, number of investments, S&P close, S&P annual change/performance, IPOs (number and value), M&A (number and value).
The first chart, the correlation matrix, shows the kind of relationship and the second, the heatmap, shows the strength of the linear relationship with whiter blocks indicating higher correlations. Each of these was one command in R.
If you’re hiring a data scientist, make sure they can use R. If you’re looking to understand the trends in your business better, learn R. If you have nothing better to do this weekend, learn R.
In just a few weeks, I’ve become a huge believer in the power of the platform to deeply understand customer trends and patterns. The best businesses are metrics oriented businesses and having this competency in the business is incredibly powerful.
Flowing Data – Nathan Yau, who used to work at the NYTimes visualization team, shows you how visualizations are best done. Great inspiration for communication.
The Art of R programming – Fantastic overview and deep explanations of R. A great manual. It even includes discussion of performance questions.
RStudio – a good IDE for R programming. Feels a lot like Eclipse.