A little data analysis project I worked on together with Chris Eidhof. Here we show how to set up a database connection with RMySQL, analyze the data in RStudio, and finally visualize the results with the package ggplot2.
“The ideal trivium education, and the least harmful one to society and pupils, would be mathematics, logic, and Latin; a double dose of Latin authors to compensate for the severe loss of wisdom that comes from mathematics; just enough mathematics and logic to control verbiage and rhetoric.”
~Nassim Nicholas Taleb, The Bed of Procrustes
Most of what qualifies as research-oriented business intelligence comes down to identifying (previously unknown) patterns. Typically an analyst or business user sees an anomaly and drills down charts and tables to investigate. Ideally to identify the cause on a level as granular as possible. Subsequently, action is taken. However, how scientifically correct is this approach?
So say that for an online toy store we’ve identified that sales for a certain toy are particularly high for Chinese women aged 20-25. How can we know if this is a regularity, or that it merely happened by chance? The answer is that we cannot know for certain based on the current data alone. At this point it is still a hypothesis, inspired by a merely exploratory analysis. Note that we can not use our historical data for this comparison, as this data generated our hypothesis in the first place. We would need to set up an experiment to test it. One way to do this is to compare, from now until a certain date in the future, the sales of the toy for Chinese women aged 20-25 with (a sample) of the rest of the population.
Many analytics and business intelligence products are lacking in this respect, because 1) they have no support for testing hypotheses at all, or (more commonly) 2) they do not have the workflow systems in place for testing on new data. So the reality is that in most analyses the final step of testing is stepped over. This leads to acting on spurious patterns and in effect basing decisions on thin air. It is important to educate analysts and business users that being “data-driven” entails more than making decisions based on just looking at numbers and pretty visualizations.
Hi everbody! Finally, I’m doing something with this domain. It’s been mine for over a year now, but never really put stuff up here. Now is the time. Some of the things I intend to publish here:
- My thoughts and interesting links on behavioural science, applied statistics, and empirical skepticism
- How the above translate into business opportunities
- Academic publications
- Projects I’m working on, especially in R (the programming language)
This is the site as it is now. As for the exact makeup and direction, I’ll figure that out as I go. Stochastic Tinkering, as Nassim Nicholas Taleb would call it, author of my all time favourite “The Black Swan: The Impact of the Highly Improbable”. A reference I will call upon frequently for sure. So stay tuned, more will follow soon!