Predicting health care charges of insurance beneficiaries (30 points). Health insurance companies often have to cover a sizable portion of the health care costs of their primary beneficiaries. Thus, it is of practical interest for these companies to predict the cost of medical bills from individual patient characteristics. In this problem, we will consider predicting the medical costs (or charges) in dollars for 𝑛 = 1338 primary beneficiaries. The response variable is charges, while the predictors are age, sex, BMI, number of dependents (children), an indicator variable for whether the primary beneficiary is a smoker, and the region of the U.S.A. where the beneficiary resides.

a. (3 points) Download the dataset insurance.csv and read it into R. In order to register the categorical variables as factors, please use the following R code:

insurance.dat <- read.csv(“insurance.csv”, header=T, stringsAsFactors = T)

Moreover, please make sure that the baseline group for smoker is no. If necessary, re-level the smoker variable so that no is the baseline. Next, set a seed so that you can reproduce the same results for this problem each time you run your code. Then split the data into a training set of roughly 80% of the data and a test set for the remaining 20% of the data. Print the first 10 observations for the training sets and test sets.

(NOTE: Students may get slightly different answers depending on how they created their training and test sets. This problem is not graded on getting exactly the same results, but on whether you understand what is going on.)

b. (2 points) Fit linear regression to your training data with charges as the response variable and all other variables as predictors. What is the 𝑅𝑎2?

c. (4 points) Based on your model in part (b), which covariates are significantly associated with charges, conditionally on the other ones in the model? Give the interpretations of the 95% confidence intervals for the statistically significant covariates in the model.

d. (3 points) Based on your model in part (b), which covariates are not determined to be significantly associated with charges, conditionally on the other ones in the model? Are any of the non-region covariates marginally associated with charges? Explain what could be going on here.

e. (2 points) The continuous covariates in this dataset are age and bmi. Report the pairwise correlation between these covariates. What can be concluded based on this correlation analysis?

Instant Solution Available for $5

Chat with us or submit your question here.

Related Questions and Answers

Predicting birthweight based on maternal risk factors (34 points). In this problem, we will consider predicting the birthweight (birthwt) of newborns in milligrams, using the age of the mother at the time of delivery (age), the weight of the mother at the last menstrual period (lwt) in kilograms, and the mother’s race (race)

Check Solution »

Predicting home sales prices from square footage (36 points). In this problem, we will consider predicting house sale prices (SalePrice) from square footage (SqrFeet) using a dataset of 𝑛 = 506 houses.

Check Solution »

A prime number (or a prime) is a natural number that has exactly two distinct natural number divisors: 1 and itself. The purpose of this problem is to write a function (say its name is check.prime) to check whether or not a given natural number is a prime. Unless you want to use some other more advanced method, you can write your function based on the so-called “trial division” method. The idea is as follows. For a positive integer

Check Solution »

Consider the following contingency table: What is the probability of event
a. D′?
b. D and C
c. D′ and C′?
d. D′ or C′?

Check Solution »

A box contains 14 red pens and 10 green pens. A pen is to be selected at random. Give an example of a simple event.

Check Solution »

Three coins are tossed.
a. Give an example of a simple event.
b. Give an example of a joint event.
c. What is the complement of a head?

Check Solution »

A local public-action group solicits donations by telephone. For a particular list of prospects, it was estimated that for any individual, the probability was .05 of an immediate donation by

Check Solution »

A manager has available a pool of eight employees who could be assigned to a project-monitoring task. Four of the employees

Check Solution »

Each year, ratings are compiled concerning the performance of new cars during the first 90 days of use. Suppose that the cars have been categorized

Check Solution »

Share this question:

Facebook
Twitter
Pinterest
LinkedIn
WhatsApp

Get Step-by-Step Solutions

Experience expert help with your homework
RECENT REVIEWS
Kimberly
Kimberly
Statistics
Read More
Excellent work. Meet my expectations. Thanks.
John
John
Math
Read More
" Learnmathstat.com " is a name that MUST remember when you have a project in mathematics, even if that project is related to an advanced course!
Eva
Eva
Algebra
Read More
Very professional, high quality, and always delivers on time.
Previous
Next
Scroll to Top