EXAMINATION PAPER Exam in: STA-3300 Date: Wednesday 27. November 203 Time: Kl 09:00 3:00 Place: Åsgårdsv. 9 Approved aids: - Calculator - All printed and written The exam contains 20 pages included this cover page Contact person: Georg Elvebakk Phone: 77646532
IMPORTANT: All points a), b), c) etc. will count for 0% of the final grade. SPSS printouts are listed on pages 5 to 8, but some of the numbers are replaced by?. There is a norwegian translation (without tables) on pages 9 and 20. If nothing else is specified use 5% significance level on tests, and 95% confidence intervals. Problem The data in this problem are from a sample of cars for sale at www.finn.no on one day in November 203. On that day there were a total of 42 cars of the model Opel Meriva.6l for sale. The following variables were collected for each car: Y X X 2 The asking price in 000 NOK. The mileage of the car in 000 km. The age of the car in whole years. The total dataset is given in the SPSS printouts. In this problem what is of particular interest to us is the relationsship between price and mileage, but we also want to see if age influences this relationship. In the SPSS printouts three different models are fitted: ) with X and X 2, 2) with X and X 2 and interaction and 3) with only X. a) Formulate a multiple regression model for Y as a function of X and X 2. Report the fitted (estimated) model from the printouts, and explain what the estimated parameters tell us. Perform an overall test for the relationship between Y and the group of X and X 2. We will now investigate whether X and X 2 interacts. The model with interaction: Y = β 0 + β X + β 2 X 2 + β 3 X X 3 + E b) Explain what it means that X (mileage) interacts with X 2 (age). Perform a test to check whether there is significiant interaction. Finally we want to see if X 2 might be a confounder for X. c) Explain what it means that X 2 is a confounding variable for the relationship between Y and X. Based on your knowledge of used car prices, would you think this could be? Use the SPSS output to argue whether X 2 is a confounder or not, and give your best estimate of the slope of the relationship between Y and X. 2
Problem 2 In this problem we will use data from a vocabulary experiment with kindergarden children. n = 69 children have been given the Peabody Picture Vocabulary Test (PPVT), and this is the dependent variable (response) in the experiment. The goal is to see how PPVT can be modelled with several simpler tests, called pairedassociated tests. (The details are not important, but these tests look at how many correct pairs the children get after a number of pairs are first presented, and they next have to fill in the then missing second half of the pairs.) The 5 tests in this experiment were called Named (N), Still (S), Named still (NS), Named action (NA) and Sentence still (SS). Variables recorded for each child (data given in SPSS printouts): Y : PPVT. X 3 : NS test. X : N test. X 4 : NA test. X 2 : S test. X 6 : SS test. The goal here is to use the variables X, X 2, X 3, X 4 and X 5 to model Y (PPVT). We will start by fitting a linear regression model of Y as a function of only X 4. a) Based on information in the SPSS printouts, would you say that X 4 is the most natural independent variable to use in a one-variable model for Y? Explain. Formulate the model and report the fitted (estimated) model. What proportion of the variation in Y were explained by this model? In the next point you can use that X 4 = 23.275 and S X4 = 6.223. b) Compute the estimated mean value for Y for children with X 4 = 30. Also find a 90% confidence interval for the mean (line) when X 4 = 30. Give an interpretation of what this interval tells you. We will now see whether adding other X-variables to our model (in addition to X 4 ) will improve it. In the SPSS printouts two additional models are fitted: c) Compute Y = β 0 + β 4 X 4 + β 5 X 5 + E Y = β 0 + β X + β 2 X 2 + β 3 X 3 + β 4 X 4 + β 5 X 5 + E r 2 Y,X 5 X 4 and r 2 Y,(X,X 2,X3) X 4,X 5 and explain what they measure. Then do the two tests where the null hypotheses are that the corresponding correlations are 0. Which model seems to be the best? d) In SPSS a forward selection procedure is performed. Explain briefly the steps in this particular case. Report the fitted final model. Find R 2, C p and MSE for the maximum model and for the model from the forward selection procedure. From this and earlier results, which model would you choose? Forward selection gave us a suggested model. We want to check this model to see if the model assumptions are OK, and if there are other possible problems. e) Use the plots and diagnostics in the SPSS printouts to check the model assumptions. Explain how you check each assumption. Are there troublesome outliers in the data? 3
Problem 3 In this problem we will use data from an agricultural experiment in Kansas. We have results from a total of 48 experiments. The response, Y, is the amount of wheat that was harvested from a plot of land (yield of wheat). The goal of the experiment is to evaluate the effect of four different fertilizers. In addition we want to see if the type of wheat matters, so three different types are chosen from the types of wheat grown in Kansas. Results: Y ijk Type Type 2 Type 3 Fertilizer means Fertilizer 72.8 53.9 57.6 49.5 49.7 7. 50.4 69.3 55. 73.2 40.6 77.0 Cell means: 6.48 53.38 65.20 60.02 Fertilizer 2 44.3 7. 84.8 55. 73.2 86.2 44.6 82.9 83.9 77. 73. 74.8 Cell means: 55.28 75.08 82.43 70.93 Fertilizer 3 45.6 64.0 93.3 63.0 54. 77.8 46.2 64. 69.0 42.7 72.5 86.0 Cell means: 49.38 63.68 8.53 64.86 Fertilizer 4 64.7 54.3 98.7 4.4 79.6 63.9 76.8 6.5 79.0 45.9 89.5 7.8 Cell means: 57.20 7.23 78.35 68.93 Type means 55.83 65.84 76.88 Total mean: Y = 66.8 The wheat yield is Y ijk where i =, 2, 3, 4 (fertilizers), j =, 2, 3 (types of wheat) and k =,..., 4 (experiments per combination). The total mean and the means of each cell, fertilizer and wheat type are given. We will use a two-way ANOVA model to analyze these data. a) Would you consider the two factors (fertilizer and type) fixed or random? Explain. Formulate a model for this experiment, explain what the elements represent. b) Formulate and do the tests that are of interest for this experiment, and draw your conclusions. Find (approximately) the p-values for these tests. Explain what a p-value tells you (in general). 4
PROBLEM 5
Model with X and X2: ANOVA b Model Sum of Squares df Mean Square F Sig. Regression 27723.622 2??.? Residual 2828.783 39? Total 40552.405 4 a. Predictors: (Constant), X2, X b. Dependent Variable: Y Coefficients a Standardized Unstandardized Coefficients Coefficients Model B Std. Error Beta t Sig. (Constant) 7.392 9.459 8.20.000 X -.94.07 -.288-2.729.009 X2-8.78.442 -.639-6.045.000 a. Dependent Variable: Y Model with only X: ANOVA b Model Sum of Squares df Mean Square F Sig. Regression 5705.8 5705.8 25.283.000 a Residual 24847.224 40 62.8 Total 40552.405 4 a. Predictors: (Constant), X b. Dependent Variable: Y Coefficients a Standardized Unstandardized Coefficients Coefficients Model B Std. Error Beta t Sig. (Constant) 30.486 9.082 4.368.000 X -.48.083 -.622-5.028.000 a. Dependent Variable: Y 6
Model with X, X2 and X*X2 (interaction variable): ANOVA b Model Sum of Squares df Mean Square F Sig. Regression 27907.62 3 9302.540 27.956.000 a Residual 2644.784?? Total 40552.405 4 a. Predictors: (Constant), XX2, X2, X b. Dependent Variable: Y Coefficients a Standardized Unstandardized Coefficients Coefficients Model B Std. Error Beta t Sig. (Constant) 83.356 8.69 9.80.000 X -.348.29 -.58 -.586.2 X2-0.499 2.800 -.769-3.750.00 XX2.02.028.325.744.462 a. Dependent Variable: Y 7
PROBLEM 2 8
Bivariate correlations: Correlations Y X X2 X3 X4 X5 Y Pearson Correlation.384 **.235.36 **.57 **.550 ** Sig. (2-tailed).00.052.002.000.000 N 69 69 69 69 69 69 X Pearson Correlation.384 **.246 *.54 **.490 **.46 ** Sig. (2-tailed).00.04.000.000.000 N 69 69 69 69 69 69 X2 Pearson Correlation.235.246 *.342 **.553 **.429 ** Sig. (2-tailed).052.04.004.000.000 N 69 69 69 69 69 69 X3 Pearson Correlation.36 **.54 **.342 **.682 **.656 ** Sig. (2-tailed).002.000.004.000.000 N 69 69 69 69 69 69 X4 Pearson Correlation.57 **.490 **.553 **.682 **.723 ** Sig. (2-tailed).000.000.000.000.000 N 69 69 69 69 69 69 X5 Pearson Correlation.550 **.46 **.429 **.656 **.723 ** Sig. (2-tailed).000.000.000.000.000 N 69 69 69 69 69 69 **. Correlation is significant at the 0.0 level (2-tailed). *. Correlation is significant at the 0.05 level (2-tailed). 9
Model with X4: Model Summary Std. Error of the Model R R Square Adjusted R Square Estimate.?.?.35 3.77668 a. Predictors: (Constant), X4 ANOVA b Model Sum of Squares df Mean Square F Sig. Regression 637.440 637.440 32.337.000 a Residual 276.386 67 89.797 Total 8853.826 68 a. Predictors: (Constant), X4 b. Dependent Variable: Y Coefficients a Standardized Unstandardized Coefficients Coefficients Model B Std. Error Beta t Sig. (Constant) 36.600 6.465 5.662.000 X4.527.268.57 5.687.000 a. Dependent Variable: Y 0
Model with X4 and X5: Model Summary b Std. Error of the Model R R Square Adjusted R Square Estimate.604 a.365.346 3.46497 a. Predictors: (Constant), X5, X4 b. Dependent Variable: Y ANOVA b Model Sum of Squares df Mean Square F Sig. Regression 6887.667 2 3443.833 8.995.000 a Residual 966.59 66 8.305 Total 8853.826 68 a. Predictors: (Constant), X5, X4 b. Dependent Variable: Y Coefficients a Standardized Unstandardized Coefficients Coefficients Model B Std. Error Beta t Sig. (Constant) 33.85 6.465 5.230.000 X4.967.380.362 2.546.03 X5.797.392.289 2.034.046 a. Dependent Variable: Y
Model with X, X2, X3, X4 and X5: Model Summary Std. Error of the Model R R Square Adjusted R Square Estimate.635 a.403.355 3.36866 a. Predictors: (Constant), X5, X2, X, X3, X4 ANOVA b Model Sum of Squares df Mean Square F Sig. Regression 7594.406 5 58.88 8.499.000 a Residual 259.420 63 78.72 Total 8853.826 68 a. Predictors: (Constant), X5, X2, X, X3, X4 b. Dependent Variable: Y Coefficients a Standardized Unstandardized Coefficients Coefficients Model B Std. Error Beta t Sig. (Constant) 34.324 6.79 5.054.000 X.594.57.34.49.255 X2 -.55.446 -.36 -.55.253 X3 -.635.442 -.209 -.439.55 X4.278.45.478 2.832.006 X5.933.44.338 2.254.028 a. Dependent Variable: Y 2
Forward selection: Variables Entered/Removed a Model Variables Entered Variables Removed Method X4. Forward (Criterion: Probability-of-F-toenter <=,050) 2 X5. Forward (Criterion: Probability-of-F-toenter <=,050) a. Dependent Variable: Y Model Summary Std. Error of the Model R R Square Adjusted R Square Estimate??.35 3.77668 2.604 b.365.346 3.46497 a. Predictors: (Constant), X4 b. Predictors: (Constant), X4, X5 ANOVA c Model Sum of Squares df Mean Square F Sig. Regression 637.440 637.440 32.337.000 a Residual 276.386 67 89.797 Total 8853.826 68 2 Regression 6887.667 2 3443.833 8.995.000 b Residual 966.59 66 8.305 Total 8853.826 68 a. Predictors: (Constant), X4 b. Predictors: (Constant), X4, X5 c. Dependent Variable: Y 3
Coefficients a Standardized Unstandardized Coefficients Coefficients Model B Std. Error Beta t Sig. (Constant) 36.600 6.465 5.662.000 X4.527.268.57 5.687.000 2 (Constant) 33.85 6.465 5.230.000 X4.967.380.362 2.546.03 X5.797.392.289 2.034.046 a. Dependent Variable: Y Excluded Variables c Collinearity Statistics Model Beta In t Sig. Partial Correlation Tolerance X.37 a.92.237.45.760 X2 -.6 a -.96.340 -.7.694 X3 -.052 a -.378.706 -.047.534 X5.289 a 2.034.046.243.477 2 X.099 b.869.388.07.736 X2 -.28 b -.087.28 -.34.692 X3 -.56 b -.03.274 -.36.479 a. Predictors in the Model: (Constant), X4 b. Predictors in the Model: (Constant), X4, X5 c. Dependent Variable: Y 4
Plots and diagnostics: 5
6
Residuals Statistics a Minimum Maximum Mean Std. Deviation N Predicted Value 50.8360 90.2902 72.304 0.06425 69 Std. Predicted Value -2.6.804.000.000 69 Standard Error of Predicted Value.637 3.858 2.724.684 69 Adjusted Predicted Value 50.2970 90.8877 72.428 0.0755 69 Residual -27.7005 3.07037.00000 3.26548 69 Std. Residual -2.057 2.307.000.985 69 Stud. Residual -2.092 2.374.000.006 69 Deleted Residual -28.644 32.8998 -.0233 3.82979 69 Stud. Deleted Residual -2.48 2.464.00.08 69 Mahal. Distance.09 4.598.97.445 69 Cook's Distance.000..04.020 69 Centered Leverage Value.000.068.029.02 69 a. Dependent Variable: Y 7
PROBLEM 3 Between-Subjects Factors N Fertilizer.00 2 2.00 2 3.00 2 4.00 2 Type.00 6 2.00 6 3.00 6 Tests of Between-Subjects Effects Dependent Variable:Yield Type III Sum of Source Squares df Mean Square F Sig. Corrected Model 5709.436 a??? Intercept 20237.977??? Fertilizer 837.402???.? Type 3545.55???? Fertilizer * Type 326.482 6??.? Error 594.558 36? Total 224.970 48 Corrected Total 0903.993 47 a. R Squared =,524 (Adjusted R Squared =,378) 8
NORSK OVERSETTING: Oppgave Dataene i denne oppgava er fra bilannonser på www.finn.no en dag i november 203. Det var den dagen total 42 biler av modellen Opel Meriva.6l til salgs. Følgende variabler blei notert for hver bil: Y X X 2 Pris i 000 NOK. Kilometerstand i 000km. Alder i heile år Datasettet er gitt i SPSS-utskriftene. I denne oppgava er relasjonen mellom pris og kilometerstand av speiell interesse, men vi vil også undersøke om alder påvirker denne relasjonen. I SPSS-utskriftene er tre ulike modeller tilpassa: ) med X og X 2, 2) med X og X 2 og samspill og 3) bare med X. a) Formuler en multippel regresjonsmodell for Y som funksjon av X og X 2. Oppgi den tilpassa (estimerte) modellen fra utskriftene, og forklar hva de estimerte parametrene forteller oss. Gjør en overall test for relasjonen mellom Y og gruppa av X og X 2. We ønsker nå å undersøke om det er samspill mellom X og X 2. Modell med samspillsledd: Y = β 0 + β X + β 2 X 2 + β 3 X X 3 + E b) Forklar hva det betyr at X (kilometerstand) samspiller med X 2 (alder). Utfør en test for å sjekke om samspillet er signifikant. Til slutt vil vi undersøke om X 2 kan være en konfunder for X. c) Forklar hva det betyr at X 2 er en konfunderende variabel for relasjonen mellom Y og X. Basert på kjennskapet ditt til bruktbilpriser, vil du tru dette kan stemme? Bruk SPSS-utskriftene til å argumenter for om X 2 er en konfunder eller ikke. Oppgi ditt beste estimat for stigningstallet for relasjonen mellom Y og X. Oppgave 2 I denne oppgave vil vi bruke data fra et vokabulareksperiment med barnehagebarn. n = 69 barn har blitt gitt the Peabody Picture Vocabulary Test (PPVT), og dette er den avhengige variabelen i eksperimentet Målet er å undersøke om PPVT kan modelleres ved hjelp av enklere tester, kalt par-assosierte tester. (Detaljene er uviktige, men disse testene presenterer en rekke par, og måler hvor mange av disse barna får rett når den ene halvdelen mangler.) De 5 testene i eksperimentet blei kalt Named (N), Still (S), Named still (NS), Named action (NA) og Sentence still (SS). Variabler notert for hvert barn (data i SPSS-utskriftene): Y : PPVT. X 3 : NS-test. X : N-test. X 4 : NA-test. X 2 : S-test. X 6 : SS-test. Målet er å bruke variablene X, X 2, X 3, X 4 og X 5 til å modellere Y (PPVT). Vi begynner med å tilpasse en lineær regresjonsmodell for Y som funksjon av X 4. 9
a) Basert på informasjonen i SPSS-utskriftene, vil du si at X 4 er den mest naturlige variabelen å bruke i en modell med bare en X-variabel? Forklar. Formuler modellen og oppgi den tilpassa (estimerte) modellen. Hvor stor andel av variasjonen i Y blir forklart av modellen? I neste delpunkt kan du bruke at X 4 = 23.275 og S X4 = 6.223. b) Finn tilpassa verdi for Y for barn med X 4 = 30. Finn også et 90% konfidensintervall for populasjonssnittet (linja) når X 4 = 30. Forklar hva dette intervallet sier deg. Vi ønsker nå å undersøke om modellen kan forbedres ved å legge til andre X-variabler (utover X 4 ). I utskriftene er følgene to utvida modeller tilpassa: Y = β 0 + β 4 X 4 + β 5 X 5 + E Y = β 0 + β X + β 2 X 2 + β 3 X 3 + β 4 X 4 + β 5 X 5 + E c) Finn ry,x 2 5 X 4 and ry,(x 2,X 2,X3) X 4,X 5 og forklar hva de måler. Utfør så de to hypotesetestene der nullhypotesene er at korrelasjonene er 0. Hvilken modell ser ut til å være å foretrekke? d) I SPSS er det utført en forlengsutvelgingsprosedyre. Forklar kort stega i denne prosedyren i dette tilfellet. Oppgi den tilpassa endelige modellen. Finn R 2, C p og MSE for maksimalmodellen og for modellen fra forlengutvelgingsprosedyra. Basert på disse og tilligere resultater, hvilken vil du velge? Forlengsutvelginga ga oss et modellforslag. Vi vil sjekke om modellforutsetningene er OK for denne modellen, og om det kan være andre mulige problem. e) Bruk plott og diagnostika i SPSS-utskriftene til å sjekke modellforutsetningene. Forklar hvordan du sjekker de enkelte forutsetningene. Er det utliggere i datsettet som kan volde problem? Oppgave 3 I denne oppgava vil vi bruke data fra et jordbrukseksperiment i Kansas. Vi har resultat fra 48 forsøk. Responsen, Y, er mengde hvete (avling) som blei høsta fra et mål land. Målet med forsøket er å undersøke effekten av 4 ulike typer gjødsel. I tillegg vil vi også undersøke om hvetetype spiller noen rolle, derfor er tre ulike typer hvete valgt blant de hvetetypene som er vanlige å bruke i Kansas. Hveteavlinga er Y ijk hvor i =, 2, 3, 4 (gjødsel), j =, 2, 3 (hvetetype) og k =,..., 4 (forsøk per kombinasjon). I tabellen er det oppgitt totalt gjennomsnitt og gjennomsnitt for ulike gjødsler og hvetetyper. Vi vil bruke en tovegs ANOVA-modell til å analysere datene. a) Vi du si at det to faktorene gjødsel og hvetetype er fikserte eller stokastiske (random). Forklar Formuler en modell for dette eksperimentet, og forklar hva elementene i modellen er. b) Formuler og utfør de testene som er interessante for dette forsøket, og trekk konklusjonene dine. Finn omtrentlige p-verdier for testene. Forklar generelt hva en p-verdi forteller deg. 20