I'm taking the Coursera Regression class to keep my skills sharp, and to get more comfortable using Python for Data Science instead of R. I used to use R for my data tasks, but it was always a frustrating experience. There are some things that R definitely does better, such as graphics, but now it's straightforward to call R from Python for these one-off tasks, while the same can't be said for calling Python from R.

The Coursera class uses R, so I translate all of the code into Python and usually get the same result, but not always, and here is one example.

The original point of this example is to show that if you have a linear regression in the form

$$Y_i = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3... \beta_nx_n + \epsilon$$

then if you create another regressor $$x_j$$ that is a linear combination of any of the other $$x_i$$'s, then you are not adding any new information and you can throw out the new $$x_j$$. The technical way to say this is that the new regressor has perfect collinearity with existing regressors. See this article for more information on collinearity.

For example, in the swiss dataset I create another regressor z that is the linear combination of two of the existing regressors.

R gives a NA value for the coefficient of z, which is what we expect.

Python's statsmodels ols method does not recognize that z has perfect collinearity with the other regressors, though there is a warning about the zero eigenvalue and the strong possibility of multicollinearity.

     The smallest eigenvalue is 3.87e-11. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.


I originally did this work in this IPython notebook. In the IPYthon notebook, there is no warning about the eigenvalues, so you have to be careful to check for collinearity yourself. Beacuse Python does not elimiate z, it also gives it a $$\beta$$ value, and therefore some of the $$\beta$$ values for the other regressors are also different!

If you're curious for more details, see the question I posted about this topic on StatsExchange.com.

The table below summarizes the differences and you can see how 3 of the coefficients are different.

In the end, the lesson is that be sure you need to understand some of the details of what your stats package is doing. I like the idea of running models in Python and R, to make sure you get the same answer, and if not, I know then to investigate further.