Today I completed Module 4 of the self paced EdX class Microsoft: DAT208x Introduction to Python for Data Science – Numpy.

Just a quick moment of celebration: Yay! I’m two-thirds of the way through a programming class in a topic that might have made me run away screaming in college.

The material is getting harder and denser, as a class probably should that is teaching a lot of new material.

Numpy is short for numeric Python. It seems to be pronounced “numb pie” instead of “numb P” which makes me thing of paraphrasing the Gumby theme song to “He can analyze any data set .. Num-py!”

But I digress. This is the module where I really started to see the power of Python and realize I may need to study some aspects of statistics more.

Numpy’s main strengths in my view are 1) the ability to work on entire tables of data at once with no need for loop code and its built-in package of statistical functions and relative easy subsetting of arrays.

The MS course also started to go into a few data analysis techniques apart from programming. Two examples:

- When you first get your data, it is very helpful to print the mean and median of each of the variables in your data. If the mean and median are far apart, and especially if the mean is an unrealistic value (say 2000 inches for human height) it may represent a flaw in data gathering and/or retrieval.
- It offered some tips on testing a guess/hypothesis, working through an example of whether soccer goal keepers were generally taller than others. Also offered and example of seeing whether their was a correlation between height and weight.

I also learned how to generate simulated data by passing parameters to a randomizing function.

At this point, I think a number of things you can do with data in Python are similar to what can be done in Excel. But I get the sense that Python will handle much larger datasets than Excel can. It may also be easier to compactly report the results. Also an examination of documentation at www.numby.org may yield functionality not available in Excel.

I haven’t yet established a home Python environment, but this lesson gave me inducement to do so. I have a few datasets I’d like to play with. Though at this point we haven’t covered importing data files into the Python environment.

Next module, likely done Sunday or Monday, will be on plotting data. Something I’m very much looking forward to.