Predicting health care charges of insurance beneficiaries (30 points). Health insurance companies often have to cover a sizable portion of the health care costs of their primary beneficiaries. Thus, it is of practical interest for these companies to predict the cost of medical bills from individual patient characteristics. In this problem, we will consider predicting the medical costs (or charges) in dollars for 𝑛 = 1338 primary beneficiaries. The response variable is charges, while the predictors are age, sex, BMI, number of dependents (children), an indicator variable for whether the primary beneficiary is a smoker, and the region of the U.S.A. where the beneficiary resides.
a. (3 points) Download the dataset insurance.csv and read it into R. In order to register the categorical variables as factors, please use the following R code:
insurance.dat <- read.csv(“insurance.csv”, header=T, stringsAsFactors = T)
Moreover, please make sure that the baseline group for smoker is no. If necessary, re-level the smoker variable so that no is the baseline. Next, set a seed so that you can reproduce the same results for this problem each time you run your code. Then split the data into a training set of roughly 80% of the data and a test set for the remaining 20% of the data. Print the first 10 observations for the training sets and test sets.
(NOTE: Students may get slightly different answers depending on how they created their training and test sets. This problem is not graded on getting exactly the same results, but on whether you understand what is going on.)
b. (2 points) Fit linear regression to your training data with charges as the response variable and all other variables as predictors. What is the 𝑅𝑎2?
c. (4 points) Based on your model in part (b), which covariates are significantly associated with charges, conditionally on the other ones in the model? Give the interpretations of the 95% confidence intervals for the statistically significant covariates in the model.
d. (3 points) Based on your model in part (b), which covariates are not determined to be significantly associated with charges, conditionally on the other ones in the model? Are any of the non-region covariates marginally associated with charges? Explain what could be going on here.
e. (2 points) The continuous covariates in this dataset are age and bmi. Report the pairwise correlation between these covariates. What can be concluded based on this correlation analysis?