Midterm Take Home Test
Dr DH Jones
- This is an unproctored examination in the form of a real data analysis project.
- It is expected that all your work is only your work: you may not consult for any reason with other student, staff, faculty, or live internet entities.
- You may use lecture notes, books, or internet libraries.
- You will be required to download and electronically sign the honor certificate, and then upload it to Canvas.
- Your exam will not be graded until you have met the honor certificate requirement.
Uploading your work
- There will be eight(8) canvas assignment slots: one for the honor certificate, and seven for the test questions.
- Therefore, in effect, you must prepare seven source files for each question with each file containing the code to load and rename the data.
- You will upload your answers to each question individually to Canvas.
- Your files must be in HTML format as generated from RStudio.
- Please do not email your answers to the professor.
- Show all your R code for each question that calls for coding.
- If the coding is missing, you will not receive credit for that portion of the test.
- The dataset for analysis is GaltonFamilies in the
HistDatapackage using a statistical linear model.
- In the 1880âs, Francis Galton, inventor of the concept of correlation, assembled the dataset as part of his ground-breaking research and applications of regression analysis.
R code for loading and renaming the data
- For each question, use the following R code to obtain and rename the data.
# install.packages("HistData", repos = "http://cran.us.r-project.org", dependencies=TRUE) # After the first compile, you may comment out this line. library("HistData") data(GaltonFamilies) Galton2 <- data.frame(GaltonFamilies)
- The variables are:
##  "family" "father" "mother" "midparentHeight" ##  "children" "childNum" "gender" "childHeight"
1 Data pre-processing (8 points)
- Obtain the summary of the data GaltonFamilies.
- Are there any data that should be coded missing?
- Which variables are numeric, integer, or factor?
- What is the R command for obtaining the levels of a factor?
- Use this command to determine the levels of gender.
- Are the labels sufficiently informational?
- Remove the family and childNum columns.
- Produce the summary table of the modified dataframe.
2 Correlation plots and Scatterplots (8 points)
- Obtain the correlation matrix of all the numeric and integers variables.
- Obtain the correlation plot of all the numeric and integer variables.
- Obtain the scatterplot matrix of all the variables in Galton2 with gender the first variable and childHeight variable as the output variable.
- Which variables look like potential predictors of childHeight?
- Which pairs of predictors look redundant?
- Obtain the scatterplot childHeight vs midparentHeight with color of points according to gender.
- Add to this plot, title = âOriginal Galton Dataâ, and subtitle = âScatterplotâ.
- Add to this plot, loess regression lines for each gender group.
3 Interaction Model (8 points)
- Fit an the interaction model g1:
childHeight ~ gender + midparentHeight + gender:midparentHeight.
- Using title âResidual Plotâ, obtain the scatterplot of the residuals vs fitted values. Donât print it, save it in
- Using title âResidual Plotâ, obtain the scatterplot of the residuals vs midparentHeight. Donât print it, save it in
- Using title âBoxplotâ, obtain the boxplot of the residuals vs gender. Donât print it, save it in
- Using title âQQ-plotâ, obtain the QQ-plot, with a red qq-line, of the residuals. Donât print it, save it in
- Using a 2×2 grid, plot all four plots using gridExtra.
- What patterns do the above plots reveal if any?
- Obtain the coefficients, standard error of estimate, t-value of estimate, and p-value of estimate for model g1.
4 Main Effects Model (3 points)
- Fit the main effects model g2:
childHeight ~ gender + midparentHeight.
- Obtain the coefficients, standard error of estimate, t-value of estimate, and p-value of estimate for model g2.
- Interpret the value of the coefficient for gender.
5 Constant Model (3 points)
- Fit the constant model g0:
childHeight ~ 1.
- Obtain the coefficients, standard error of estimate, t-value of estimate, and p-value of estimate for model g0.
- Interpret the value of the coefficient.
6 Comparing models g.sm versus g.big. (13 points)
- Fit the following model for childHeight, g.big: (childHight = beta_0 + beta_1gender + beta_2father + beta_3mother + beta_4children). What are the estimated coefficients, standard errors of estimate, t-values of estimate, and p-values of estimate?
- Fit the following model for childHeight, g.sm: (childHight = beta_0 + beta_1gender + beta_2father). What are the estimated coefficients, standard errors of estimate, t-values of estimate, and p-values of estimate?
- For the test of the model g.sm vs the model g.big, in terms of the beta coefficients, what are the null and alternative hypotheses for this statistical test?
- Compute the Analysis of Variance Table for this test based on the data.
- Using (alpha = 0.001), based on the p-value, what is the decision rule and conclusion of the hypotheses test of (g.sm) versus (g.big)?
- Compute the fit plot for the g.big with the following specifications:
- 45 degree line in red
- title âFit Plotâ
- y-axis label âChild Heightâ
- x-axis label âFitted Valuesâ
- What is the Pearson correlation between the g.big fitted-values and actual-values,
- Is the correlation strong, moderate or weak?
- What does this indicate for the model?
- What is the (R^2) for the g.big model?
- What does (R^2) mean for the fitted model?
- Theoretically, what is the relation between the (R^2) and the Pearson correlation (actual y-values vs fitted-values)?
- Show that this relation holds for the computed values of the
- Assignment status: Already Solved By Our Experts
- (USA, AUS, UK & CA PhD. Writers)
- CLICK HERE TO GET A PROFESSIONAL WRITER TO WORK ON THIS PAPER AND OTHER SIMILAR PAPERS, GET A NON PLAGIARIZED PAPER FROM OUR EXPERTS