Computing for Data Analysis

Tags: programming

Two mathematical shames haunt me from my college days. I never really understood calculus as much as I understood how to make Mathematica solve calculus problems, and I never took a class in probability and statistics.

I’ve been working on that second shame source over the last few weeks. In addition to self study through Statistics, I stumbled by chance upon Computing for Data Analyis shortly before the last session began. From there I also picked up S Programming, but let’s talk about the course rather than the books.

The course does not explicitly cover statistics as some had hoped. It provides a solid introduction to many aspects of the R programming language that is often used for statistical analysis. Even without any statistics experience, I kept up without any trouble, and my R experience will help as I dive deeper into stats in the coming weeks.

The course is short, and they estimate an average of 3-5 hours per week. Their estimate was spot on for me, but I already have a couple of programming languages under my belt, so this was incremental learning. People with limited software development experience struggled considerably with the programming assignments and spent more than the estimated time on the course. The last assignment involved extracting data from an ugly HTML file with regular expressions: it was like an easy day at the office for me, but only because I’ve extracted data from uglier files with. Your mileage may vary.

As I started to learn R during the class and thus started talking about it trying to recruit friends to join, Tal Yarkoni published The homogenization of scientific computing, or why Python is steadily eating other languages’ lunch

A Python guy learning R while an R guy blogged about Python taking over his programming toolbox amused my friends. I preach frequently about picking tools and sticking with them to increase the velocity with which interesting solutions find their way into users’ lives, but this was study time, not work time. R transitions quite easily from an interactive language to a “traditional” programming language, and the language design aids that continuous transition. For instance, most programmers spend a considerable amount of time working within the confines of a “for” loop, but “for” loops get ugly in an interactive environment. The vectorized nature of data and functions in R mean that, although occasionally useful, writing “for” loops is usually unnecessary and stylistically improper.

If you’d like to learn more along these lines, there’s a new Data Science Specialization offered by Johns Hopkins through Coursera. The course that I just finished is replaced with the “R Programming” course and you can mix and match classes if you aren’t interested in the certificate at the end.

If instead you’d just like to click a link to a new tool, check out knitr. It’s sort of like the IPython Notebook, but it feels like document creation may be a little more natural, and knitr supports multiple languages. I’ve only casually used both, so I’d welcome more enlightened opinions on them.