Predicting home sales prices from square footage (36 points). In this problem, we will consider predicting house sale prices (SalePrice) from square footage (SqrFeet) using a dataset of 𝑛 = 506 houses.
a. (2 points) Download the dataset homes.csv and load it into R. Fit a simple linear regression model to this model with SalePrice as the response variable and SqrFeet as the predictor. What is the fitted equation for our model, and what is the coefficient of determination for this model?
b. (3 points) Plot a scatterplot of the data (SqrFeet, SalePrice), along with the best fit line from part (a) on top of this scatterplot. Make sure to add an appropriate title and labels to your plot and ensure that the best fit line is distinguishable from the scatterplot points.
c. (2 points) In our dataset, the houses were randomly selected, so we can assume that the independence assumption is met. Use diagnostic plots to check that the other linear model assumptions are met for the model in part (a), and check for possible outliers. Report the diagnostic plots in your homework submission. Do any of the assumptions appear to be violated? If so, which ones? Are there any potential outliers?
d. Sometimes an effective way to fix violations of the model assumptions and outliers in linear regression is to log transform the response variable and/or the predictor variable. We will consider a log transformation of just the response variable.
- i. (1 point) In your dataframe, create a new column called logSalePrice by log-transforming SalePrice. Report the first 10 observations in the updated dataframe using the head() function.
- ii. (1 point) Fit a new simple linear regression model with logSalePrice as the response and SqrFeet as the predictor. What are the least squares estimators for 𝛽0 and 𝛽1?
- iii. (1 point) What is the coefficient of determination for this model? How does it compare to your answer in part (a)? Explain. (Note: For SLR, we can compare the 𝑅2 between different models since we only have one predictor.)
- iv. (3 points) Plot a scatterplot of (SqrFeet, logSalePrice), along with the best fit line from part (i) on top of this scatterplot. Make sure to add an appropriate title and labels to your plot and ensure that the best fit line is distinguishable from the scatterplot points. What do you observe about this new log-transformed model?
e. (2 points) Use diagnostic plots to check that the linear regression assumptions (besides the independent errors assumption) are met for the model fit in part (d), and check for possible outliers. Report the diagnostic plots in your homework submission. What do you observe? How do these plots compare to the plots in part (c)?