Data Science: Linear Regression in Python

A linear regression finds a linear relationship between a random variable Y say, like house price and an explanatory variable X, like number of rooms. Linear regression usually looks to estimate the expected value of Y given X.

If you are using Python to do this, and a package, you need to import pandas, numpy and seaborn, then import your data which will probably be a csv file. I take a quick look at the data at this point just to get an idea of what’s in it. In the example dataset from Udemy, I have average house area income, average house age, average number of rooms etc. and the y variable, price.

You can also use the describe method now to look at statistics such as mean, standard deviation etc. A nice seaborn method is seaborn.pairplot which gives you a variety of diagrams for each of your variables.

Start ‘training’ a linear regression model:

This is very strange language to me still, as in economics, you’d probably manually write out the code in Python for a regression.

  1. Make a subset of the dataframe explanatory variables and call it X
  2. Make a subset o the dataframe for the outcome variable, called y,which will be a vector of house prices in my case.
  3. Split data into training and test data that you will use later (import train_test_split from sklearn)
  4. Import LinearRegression from sklearn
  5. Make a variable called lm
  6. lm.fit(X_train,y_train)
  7. Look at the output
  8. Prediction: make predictions of house prices using X_test from before, so all the explanatory variables that are in the test dataframe.
  9. Compare to the actual prices in your dataset
  10. Assess your predictions

I am working on how to do this from scratch, as I think it would be useful to know.