Linear Regression of Predicting Home Prices

 Introduction

In this blog post, we will explore how to create a linear regression model using Python/R to predict home prices using the Boston Housing Dataset. This dataset contains information about various houses in Boston, including 14 feature variables and their corresponding prices. Our objective is to develop a predictive model that can accurately estimate the prices of houses based on these features



Code

import pandas as pd

import numpy as np

from sklearn.datasets import fetch_openml

 

boston = fetch_openml(name='boston')

df = pd.DataFrame(boston.data, columns=boston.feature_names)

df['PRICE'] = boston.target

 

X = df.drop('PRICE', axis=1)

y = df['PRICE']

 

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

 

from sklearn.linear_model import LinearRegression

lr = LinearRegression()

lr.fit(X_train, y_train)

 

# Convert X_test to a numpy array

X_test_array = np.array(X_test)

 

y_pred = lr.predict(X_test_array)

 

from sklearn.metrics import mean_squared_error, r2_score

mse = mean_squared_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)

 

print('MSE:', mse)

print('R2:', r2)


  1. Understanding the Boston Housing Dataset

The Boston Housing Dataset is a popular dataset used for regression tasks. It consists of 506 samples, each representing a house, and contains 14 feature variables such as crime rate, average number of rooms per dwelling, and accessibility to highways. The target variable, which we aim to predict, is the price of the house.

 

  1. Loading and Preparing the Dataset

To begin, we import the necessary libraries and load the Boston Housing Dataset. We use the `fetch_openml` function from `sklearn.datasets` to retrieve the dataset. The dataset is then stored in a pandas DataFrame for further analysis. Additionally, we split the dataset into input features (X) and the target variable (y) using the `drop` method.

 

  1. Splitting the Dataset

To evaluate the performance of our model, we split the dataset into training and testing sets. The `train_test_split` function from `sklearn.model_selection` is used for this purpose. We allocate 80% of the data for training and 20% for testing, ensuring randomness and consistency by setting a random seed value.

 

  1. Building the Linear Regression Model

We employ linear regression, a widely used regression algorithm, to create our predictive model. We import the `LinearRegression` class from `sklearn.linear_model` and instantiate an object of the class. We then train the model using the training data by calling the `fit` method.

 

  1. Making Predictions

Once the model is trained, we make predictions on the test set. We convert the features of the test set to a numpy array and pass it to the `predict` method of the linear regression model. The predicted values are stored in the `y_pred` variable.

 

  1. Evaluating the Model

To assess the performance of our model, we use two commonly used metrics: Mean Squared Error (MSE) and R-squared (R2). The `mean_squared_error` and `r2_score` functions from `sklearn.metrics` are used to calculate these metrics. MSE measures the average squared difference between the predicted and actual values, while R2 represents the proportion of the variance in the target variable that is predictable from the input features.

 

  1. Results and Interpretation

After evaluating our model, we obtain the following results:

- Mean Squared Error (MSE): 24.29

- R-squared (R2): 0.67

 

The MSE value of 24.29 indicates the average squared difference between the predicted and actual prices of houses. The lower the MSE, the better the model performance. The R2 value of 0.67 suggests that approximately 67% of the variance in the house prices can be explained by the features in our model. A higher R2 value indicates a better fit of the model to the data.

 

Conclusion

In this blog post, we explored how to create a linear regression model to predict home prices using the Boston Housing Dataset. By utilizing the 14 feature variables provided in the dataset, we built a model that can estimate house prices with a moderate level of accuracy. The MSE and R2 metrics helped us evaluate the model's performance and understand its predictive power. With further exploration and enhancements, this model can be refined to provide more accurate predictions for housing prices, aiding in real estate decision-making and analysis.

Previous Post Next Post