NTNU, TRONDHEIM Norges teknisk-naturvitenskapelige universitet Institutt for sosiologi og statsvitenskap EKSAMENSOPPGAVE I SVSOS 36 REGRESJONSANALYSE Faglig kontakt under eksamen: Kristen Ringdal Tlf.: 73 59 7 0 Eksamensdato: 06. desember Eksamenstid: 6 timer Vekttall: 5 Språkform: norsk, nynorsk, engelsk Antall sider vedlegg + forside: 9 Antall sider bokmål: Antall sider nynorsk: Antall sider engelsk: Antall sider i alt: 2 Sensurdato: 0. januar 2003 Sensurtelefon: 85 4804 Tillatte hjelpemidler: Kalkulator (hand calculator) Norsk-engelsk/ engelsk-norsk ordbok (dictionary) Hamilton, Lawrence C. 992 «Regression with Graphics», Belmont, Duxbury, Hardy, Melissa A. 993 «Regression with Dummy Variables», QASS 93, London, Sage Breen, Richard 996 «Regression s. Censored, Sample Selected, or Truncated Data», QASS, London, Sage Trykte forelesningsnotater fra Ringdals forelesninger. (Printed lecture notes from Ringdal s lectures).
BOKMÅL OPPGAVE Begreper (teller 20%) a) Forklar begrepene sensurerte, selekterte og trunkerte utvalg. b) Hva består diskrimineringsproblemet i logistisk regresjon i? c) Hva er den viktigste fordelen med robust regresjon i forhold til vanlig (OLS) regresjon? OPPGAVE 2 Regresjonsanalyse (teller 40%) a) Hvilken av de to første modellene er best.? Hva er hensikten med å sammenligne de to første modellene? b) Hvilken av de to siste modellene er best? Formuler modellen for populasjonen for den beste av dem (ta ikke med forutsetningene). c) Lag et betinget effektplott for effekten av utdanning basert på modell 3. d) Finn, med utgangspunkt i modell 4, predikerte Y-verdier for hver region. Framstill resultatet i et diagram. e) Evaluer de følgende hypotesene: H : Effekten av utdanning er den samme for menn og kvinner. H 2 : Det er ikke regionale forskjeller i holdninger til inntektsulikhet H 3 : Menn er mindre likhetsorienterte enn kvinner. OPPGAVE 3 Logistisk regresjon (teller 40%) a) Finn den beste modellen og skriv ut ligningen for denne modellen (ta med forutsetningene). b) Beskriv sammenhengen mellom alder og det å være uføretrygdet. c) Finn oddsratioet for å være uføretrygdet mellom en person med 7 års og en person med 20 års utdanning. d) Finn oddsratioet for å være uføretrygdet mellom personer som rapporterte å ha opplevd motsetninger i barndomshjemmet og de som ikke gjorde det, for både menn og kvinner. Hva forteller de to oddsratioene? e) Finn den predikerte sannsynligheten for å være uføretrygdet for en 50 år gammel kvinne med 9 års utdanning, bosatt i en stor by, og som har opplevd motsetninger i barndomshjemmet. 2
NYNORSK OPPGÅVE Omgrep (tel 20%) a) Grei ut om omgrepa sensurerte, selekterte og trunkerte utval. b) Kva er diskrimineringsproblemet i logistisk regresjon? c) Kva er den viktigaste fordelen med robust regresjon i høve til vanleg (OLS) regresjon? OPPGÅVE 2 Regresjonsanalyse (tel 40%) a) Kven av de to første modellane er best? Kva er føremålet med å samanlikne dei to første modellane? b) Kven av dei to siste modellane er best? Formuler modellen for populasjonen for den beste av dei (ta ikkje med føresetnadane). c) Lag eit betinga effektplott for effekten av utdanning basert på modell 3. d) Finn, med utgangspunkt i modell 4, predikerte Y-verdiar for kvar region. Syn resultatet i eit diagram. e) Evaluer dei følgjande hypotesane: H : Effekten av utdanning er den same for menn og kvinner. H 2 : Det er ikkje regionale skilnader i holdningar til inntektsulikskap H 3 : Menn er mindre likskapsorienterte enn kvinner. OPPGÅVE 3 Logistisk regresjon (tel 40%) a) Finn den beste modellen og skriv ut likninga for denne modellen (ta med føresetnadane). b) Gi ei skildring av samanhengen mellom alder og det å vere uføretrygda. c) Finn oddsratioet for å være uføretrygda mellom ein person med 7 år og ein person med 20 år utdanning. d) Finn oddsratioet for å vere uføretrygda mellom personar som rapporterte å ha opplevd motsetningar i barndomsheimen og dei som ikkje gjorde det, for både menn og kvinner. Kva kan dei to oddsratioa fortelje? e) Finn det predikerte sannsynet for å vere uføretrygda for ei 50 år gamal kvinne med 9 år utdanning, busett i ein stor by, og som har opplevd motsetningar i barndomsheimen. 3
ENGLISH QUESTION Concepts (counts 20%) d) Explain the following concepts: censured, sample selected, and truncated samples. a) Explain the discrimination problem in logistic regression. b) What is the main advantage of robust regression over ordinary (OLS) regression? QUESTION 2 Regression analysis (counts 40%) a) Which of the two first models is the best one? What is the purpose of comparing the two first models? b) Which of the two last models is the best one? Specify the model for the population for the best model (do not state the assumptions). c) Make a conditional effect plot from model 3 to describe the effect of education. d) Find, based on model 4, predicted Y-values for each region. Show the result in a plot. e) Evaluate the following hypotheses H : The effect of education is identical for men and women. H 2 : There are no regional differences in attitudes to income inequality. H 3 : Men are less equality oriented than female respondents. QUESTION 3 Logistic regression (counts 40%) a) Find the best model and write the equation for the population (include the assumptions). b) Describe the relationship between age and receiving disability pension. c) Find the odds ratio of receiving disability pension between a person with 7 years and a person with 20 years of education. d) Find the odds ratio of receiving disability pension between persons who reported to have experienced conflicts in their childhood home and those who did not report such experience, for men and women separately. What do the two odds ratios tell us? e) Find the predicted probability of receiving disability pension for a 50 year old woman with 9 years of education, living in a large town, and who has experienced conflicts in her childhood home. 4
Documentation and tables for question 2 Regression analysis In the survey ISSP92, we find four questions on attitudes to income inequality. Please show how much you agree or disagree with each statement... (Please tick one box on each line). Differences in income in Norway are too large. 2. It is the responsibility of the government to reduce the differences in income between people with high incomes and those with low incomes. 3. The government should provide a job for everyone who wants one. 4. The government should provide everyone with a guaranteed basic income. The answer categories were the following: -----------------------------------------------------------. Strongly agree 2. Agree 3. Neither agree nor disagree 4. Disagree 5. Strongly disagree 8. Can't choose, don't know (set to missing) 9. NA (set to missing) ----------------------------------------------------------- On the basis of these four questions the mean score for each respondent was used as a scale to indicate attitudes to income inequality. High scores means that the respondents prefer inequalities. In the tables, the scale is named INEQUAL. A histogram of INEQUAL: 500 400 300 200 00 Std. Dev =.82 Mean = 2.8 0 N = 5.00.00.50 2.00 2.50 3.00 3.50 4.00 4.50 5.00 Inequality scale, p76-79 5
Valid Sentrale 0stland 2 Indre 0stland,Telem. 3 Agder, Rogaland 4 Vestlandet 5 Tr0ndelag 6 Nord-Norge REGION Regionar Cumulative Frequency Percent Valid Percent Percent 440 28.6 28.6 28.6 284 8.5 8.5 47. 222 4.4 4.4 6.5 268 7.4 7.4 78.9 46 9.5 9.5 88.4 78.6.6 00.0 538 00.0 00.0 Region is decomposed into a set of dummies REGION2-REGION6, corresponding to the categories in the table above, with the first region as the reference category. Valid Missing 9.00.00 4.00 8.00 System Cumulative Frequency Percent Valid Percent Percent 346 22.5 24. 24. 558 36.3 38.9 63.0 390 25.4 27.2 90.2 40 9. 9.8 00.0 434 93.2 00.0 04 6.8 538 00.0 Education is coded in years as midpoint of years in four educational categories. ED, ED4, and ED8 are dummy variables with the values of for the three corresponding categories of education in the table above. ED_ is the product ED* = for male respondent, =0 for female respondents. AGE is measured in years. Descriptive Statistics INEQUAL Inequality scale, p76-79 AGE Age in years ED_ REGION2 REGION3 REGION4 REGION5 REGION6 Valid N (listwise) N Minimum Maximum Mean Std. Deviation 4.00 5.00 2.922.8826 4.00.00.5053.5005 4 6.00 79.00 39.879 6.3867 4 9.00 8.00 2.0354 2.67802 4.00 8.00 6.843 6.42524 4.00.00.843.38784 4.00.00.446.3580 4.00.00.736.37893 4.00.00.0964.29522 4.00.00.4.3805 4 6
Summary Adjusted Std. Error of R R Square R Square the Estimate.364 a.33.3.76280 a. Predictors: (Constant),,, AGE Age in years Regression Residual ANOVA b Sum of Squares df Mean Square F Sig. 25.376 3 4.792 7.825.000 a 88.678 407.582 944.054 40 a. Predictors: (Constant),,, AGE Age in years b. Dependent Variable: INEQUAL Inequality scale, p76-79 (Constant) AGE Age in years Coefficients a Unstandardized Coefficients Standardized Coefficients a. Dependent Variable: INEQUAL Inequality scale, p76-79 B Std. Error Beta t Sig..73.8 6.7.000.22.04.29 5.87.000.003.00.056 2.88.029.03.008.338 3.200.000 7
2 Summary Adjusted Std. Error of R R Square R Square the Estimate.377 a.42.39.75930 a. Predictors: (Constant), ED8 7+ years of education, AGE Age in years,, ED4 3-6 years of education, ED 0-2 years of education ANOVA b Sum of Squares df Mean Square F Sig. Regression 34.022 5 26.804 46.492.000 a Residual 80.03 405.577 944.054 40 a. Predictors: (Constant), ED8 7+ years of education, AGE Age in years,, ED4 3-6 years of education, ED 0-2 years of education b. Dependent Variable: INEQUAL Inequality scale, p76-79 (Constant) AGE Age in years ED 0-2 years of education ED4 3-6 years of education ED8 7+ years of education Coefficients a Unstandardized Coefficients Standardized Coefficients a. Dependent Variable: INEQUAL Inequality scale, p76-79 B Std. Error Beta t Sig..523.085 7.900.000.209.04.27 5.29.000.004.00.084 3.04.003.33.058.87 5.352.000.696.06.379.369.000.862.080.34 0.840.000 8
3-4 Summary R R Square Adjusted R Square Std. Error of the Estimate 3.370 a.37.34.7629 4.388 b.5.45.75652 a. Predictors: (Constant), ED_, AGE Age in years,, b. Predictors: (Constant), ED_, AGE Age in years,,, REGION3, REGION5, REGION6, REGION4, REGION2 3 4 Regression Residual Regression Residual ANOVA c Sum of Squares df Mean Square F Sig. 29.93 4 32.298 55.729.000 a 84.86 406.580 944.054 40 42.22 9 5.802 27.6.000 b 80.833 40.572 944.054 40 a. Predictors: (Constant), ED_, AGE Age in years,, b. Predictors: (Constant), ED_, AGE Age in years,,, REGION3, REGION5, REGION6, REGION4, REGION2 c. Dependent Variable: INEQUAL Inequality scale, p76-79 3 4 (Constant) AGE Age in years ED_ (Constant) AGE Age in years ED_ REGION2 REGION3 REGION4 REGION5 REGION6 Coefficients a Unstandardized Coefficients Standardized Coefficients a. Dependent Variable: INEQUAL Inequality scale, p76-79 B Std. Error Beta t Sig..987.55 6.383.000 -.258.88 -.58 -.376.69.003.00.053 2.080.038.082.0.269 7.238.000.039.05.307 2.567.00.3.62 6.862.000 -.244.86 -.49 -.307.9.002.00.043.68.093.079.0.260 6.998.000.039.05.306 2.57.00 -.35.06 -.064-2.29.027.009.065.004.44.885 -.034.062 -.06 -.548.584 -.208.075 -.075-2.76.006 -.264.07 -.03-3.7.000 9
Documentation and tables for question 3 Logistic Regression The analysis is based on the 995 Level of Living Survey carried out by Statistics Norway. The dependent variable is whether the respondent receives disability pension or not. The documentation of the variables to be used in the logistic regression analysis follows below: DISABLE Valid Missing.00 Not disabled.00 Disabled System Cumulative Frequency Percent Valid Percent Percent 344 9.8 93.2 93.2 249 6.7 6.8 00.0 3663 98.5 00.0 57.5 3720 00.0 = for males, =0 for female respondents AGE is age in years AGE2 is age*age is years of education MOTSET= for respondents who reported conflicts (motsetninger) at home during childhood, MOTSET=0 otherwise. DENSITY is population density: DENSITY Valid Missing.00 Sparsely populated 2.00-999 innhabitants 3.00 2000-9999 4.00 20000+ System Cumulative Frequency Percent Valid Percent Percent 762 20.5 2.0 2.0 542 4.6 5.0 36.0 93 25.0 25.7 6.7 387 37.3 38.3 00.0 3622 97.4 00.0 98 2.6 3720 00.0 0
Logistic Regression Case Processing Summary Unweighted Cases a N Percent Selected Cases Included in Analysis 3422 92.0 Missing Cases 298 8.0 3720 00.0 Unselected Cases 0.0 3720 00.0 a. If weight is in effect, see classification table for the total number of cases. Dependent Variable Encoding Original Value.00 Not disabled.00 Disabled Internal Value 0 Categorical Variables Codings Parameter coding Frequency () (2) (3) DENSITY.00 Sparsely populated 734.000.000.000 2.00-999 innhabitants 54.000.000.000 3.00 2000-9999 885.000.000.000 4.00 20000+ 289.000.000.000 Block 0: Beginning Block (tables not shown for this block) Block : Method = Enter Omnibus Tests of Coefficients Step Step Block Chi-square df Sig. 293.63 6.000 293.63 6.000 293.63 6.000 Summary Step -2 Log Cox & Snell Nagelkerke likelihood R Square R Square 408.265.082.20
Step a AGE AGE2 MOTSET by MOTSET Constant Variables in the Equation B S.E. Wald df Sig. Exp(B).057.55.36.73.059.489.048 0.924.000.630 -.004.000 98.745.000.996 -.82.029 38.457.000.833.769.255 9.7.003 2.58 -.823.437 3.546.060.439-3.62.378 97.574.000.000 a. Variable(s) entered on step :, AGE, AGE2,, MOTSET, * MOTSET. Block 2: Method = Enter Omnibus Tests of Coefficients Step Step Block Chi-square df Sig. 9.27 3.028 9.27 3.028 302.740 9.000 Summary Step -2 Log Cox & Snell Nagelkerke likelihood R Square R Square 399.38.085.26 Classification Table a Predicted Observed Step DISABLE Overall Percentage a. The cut value is.500 Not disabled Disabled DISABLE Percentage Not disabled Disabled Correct 389 0 00.0 233 0.0 93.2 Step a AGE AGE2 MOTSET by MOTSET DENSITY DENSITY() DENSITY(2) DENSITY(3) Constant a. Variable(s) entered on step : DENSITY. Variables in the Equation B S.E. Wald df Sig. Exp(B).084.56.289.59.087.488.048 02.049.000.630 -.004.000 98.958.000.996 -.97.030 43.625.000.82.746.256 8.530.003 2.09 -.903.440 4.29.040.405 8.868 3.03.29.243.442.230.338.339.29 2.387.22.403.589.20 8.579.003.802-3.808.382 99.789.000.000 2