A linear regression finds a linear relationship between a random variable Y say, like house price and an explanatory variable X, like number of rooms. Linear regression usually looks to estimate the expected value of Y given X.
If you are using Python to do this, and a package, you need to import pandas, numpy and seaborn, then import your data which will probably be a csv file. I take a quick look at the data at this point just to get an idea of what’s in it. In the example dataset from Udemy, I have average house area income, average house age, average number of rooms etc. and the y variable, price.
You can also use the describe method now to look at statistics such as mean, standard deviation etc. A nice seaborn method is seaborn.pairplot which gives you a variety of diagrams for each of your variables.
Start ‘training’ a linear regression model:
This is very strange language to me still, as in economics, you’d probably manually write out the code in Python for a regression.
- Make a subset of the dataframe explanatory variables and call it X
- Make a subset o the dataframe for the outcome variable, called y,which will be a vector of house prices in my case.
- Split data into training and test data that you will use later (import train_test_split from sklearn)
- Import LinearRegression from sklearn
- Make a variable called lm
- lm.fit(X_train,y_train)
- Look at the output
- Prediction: make predictions of house prices using X_test from before, so all the explanatory variables that are in the test dataframe.
- Compare to the actual prices in your dataset
- Assess your predictions
I am working on how to do this from scratch, as I think it would be useful to know.