HVORFOR? STATA vs SPSS. Hvordan lære STATA. Introduksjon til STATA. Introduksjon til STATA. Pål Romundstad. STATA fordeler: STATA:



Like dokumenter
HVORFOR? STATA vs SPSS. Hvordan lære STATA. Hvordan lære STATA. Introduksjon til STATA. Introduksjon til STATA. Pål Romundstad. STATA noen fordeler:

Introduksjon til STATA. Pål Romundstad. Introduksjon til STATA HVORFOR?

Unit Relational Algebra 1 1. Relational Algebra 1. Unit 3.3

Neural Network. Sensors Sorter

Slope-Intercept Formula

Databases 1. Extended Relational Algebra

Exercise 1: Phase Splitter DC Operation

Datahandling and presentation. Themes. Respekt og redelighet Masterseminar, Frode Volden

SAS FANS NYTT & NYTTIG FRA VERKTØYKASSA TIL SAS 4. MARS 2014, MIKKEL SØRHEIM

Minimumskrav bør være å etablere at samtykke ikke bare må være gitt frivillig, men også informert.

Start MATLAB. Start NUnet Applications Statistical and Computational packages MATLAB Release 13 MATLAB 6.5

5 E Lesson: Solving Monohybrid Punnett Squares with Coding

Medisinsk statistikk, KLH3004 Dmf, NTNU Styrke- og utvalgsberegning

Introduksjon til SPSS. Johan Håkon Bjørngaard Institutt for samfunnsmedisin, NTNU

UNIVERSITETET I OSLO ØKONOMISK INSTITUTT

Mål: SPSS. Litteratur. Noen statistikk-programpakker. Dokumentasjon fra SPSS Inc. Introduksjon til IBM SPSS Statistics 20

SVM and Complementary Slackness

Moving Objects. We need to move our objects in 3D space.

Generalization of age-structured models in theory and practice

Hvordan føre reiseregninger i Unit4 Business World Forfatter:

1 User guide for the uioletter package

Regresjonsmodeller. HEL 8020 Analyse av registerdata i forskning. Tom Wilsgaard

Information search for the research protocol in IIC/IID

Kategoriske data, del I: Kategoriske data - del 2 (Rosner, ) Kategoriske data, del II: 2x2 tabell, parede data (Mc Nemar s test)

Accuracy of Alternative Baseline Methods

Eksamensoppgave i PSY3100 Forskningsmetode - Kvantitativ

Eksamensoppgave i SOS3003 Anvendt statistisk dataanalyse i samfunnsvitenskap

Andrew Gendreau, Olga Rosenbaum, Anthony Taylor, Kenneth Wong, Karl Dusen

HONSEL process monitoring

UNIVERSITETET I OSLO ØKONOMISK INSTITUTT

Oppgave 1a Definer følgende begreper: Nøkkel, supernøkkel og funksjonell avhengighet.

Eksamensoppgave i PSY3100 Forskningsmetode - Kvantitativ

Dynamic Programming Longest Common Subsequence. Class 27

KROPPEN LEDER STRØM. Sett en finger på hvert av kontaktpunktene på modellen. Da får du et lydsignal.

Logistisk regresjon 1

PSi Apollo. Technical Presentation

Passasjerer med psykiske lidelser Hvem kan fly? Grunnprinsipper ved behandling av flyfobi

Du må håndtere disse hendelsene ved å implementere funksjonene init(), changeh(), changev() og escape(), som beskrevet nedenfor.

Model Description. Portfolio Performance

GeWare: A data warehouse for gene expression analysis

Estimating Peer Similarity using. Yuval Shavitt, Ela Weinsberg, Udi Weinsberg Tel-Aviv University

DATAUTFORSKNING I EG, EG 7.1 OG EGENDEFINERTE FUNKSJONER SAS FANS I STAVANGER 4. MARS 2014, MARIT FISKAAEN

EXAMINATION PAPER. Exam in: STA-3300 Applied statistics 2 Date: Wednesday, November 25th 2015 Time: Kl 09:00 13:00 Place: Teorifagb.

Perpetuum (im)mobile

Satellite Stereo Imagery. Synthetic Aperture Radar. Johnson et al., Geosphere (2014)

of color printers at university); helps in learning GIS.

Little Mountain Housing

Call function of two parameters

Eksamensoppgave i SOS3003 Anvendt statistisk dataanalyse i samfunnsvitenskap Examination paper for SOS3003 Applied Social Statistics

Checking Assumptions

Software applications developed for the maritime service at the Danish Meteorological Institute

Kom i gang med Stata for Windows på UiO - hurtigstart for begynnere

UNIVERSITETET I OSLO

AvtaleGiro beskrivelse av feilmeldinger for oppdrag og transaksjoner kvitteringsliste L00202 levert i CSV fil

Quantitative Spectroscopy Chapter 3 Software

BEGYNNERKURS I SPSS. Anne Schad Bergsaker 12. februar 2019

Sensorveiledning SPED1200 vår 2018

Eksamensoppgave i PSY3100 Forskningsmetode - Kvantitativ

IN2010: Algoritmer og Datastrukturer Series 2

Elektronisk innlevering/electronic solution for submission:

UNIVERSITETET I OSLO

SAS-feil kavalkade. Viggo Skar Oslo Universitetssykehus HF (OUS)

2A September 23, 2005 SPECIAL SECTION TO IN BUSINESS LAS VEGAS

TFY4170 Fysikk 2 Justin Wells

Analyse av kontinuerlige data. Intro til hypotesetesting. 21. april Seksjon for medisinsk statistikk, UIO. Tron Anders Moger

BEGYNNERKURS I SPSS. Anne Schad Bergsaker 17. november 2017

Eksamen PSY1010 / PSYC1100 Forskningsmetode I

Utvikling av skills for å møte fremtidens behov. Janicke Rasmussen, PhD Dean Master Tel

TMA4329 Intro til vitensk. beregn. V2017

Speed Racer Theme. Theme Music: Cartoon: Charles Schultz / Jef Mallett Peanuts / Frazz. September 9, 2011 Physics 131 Prof. E. F.

XML enabled database. support for XML in Microsoft SQL Server 2000 & Martin Malý

BEGYNNERKURS I SPSS. Anne Schad Bergsaker 26. april 2018

Følgende endringer er gjort i denne versjonen:

Endelig ikke-røyker for Kvinner! (Norwegian Edition)

Trigonometric Substitution

EXAM TTM4128 SERVICE AND RESOURCE MANAGEMENT EKSAM I TTM4128 TJENESTE- OG RESSURSADMINISTRASJON

MID-TERM EXAM TDT4258 MICROCONTROLLER SYSTEM DESIGN. Wednesday 3 th Mars Time:

Checking Assumptions

Logistisk regresjon 2

Den europeiske byggenæringen blir digital. hva skjer i Europa? Steen Sunesen Oslo,

BYFE/EMFE 1000, 2012/2013. Numerikkoppgaver uke 33

DATAØVING 1 INTRODUKSJON TIL STATA I

Physical origin of the Gouy phase shift by Simin Feng, Herbert G. Winful Opt. Lett. 26, (2001)

TUSEN TAKK! BUTIKKEN MIN! ...alt jeg ber om er.. Maren Finn dette og mer i. ... finn meg på nett! Grafiske lisenser.

Examination paper for (BI 2015) (Molekylærbiologi, laboratoriekurs)

TriCOM XL / L. Energy. Endurance. Performance.

Dean Zollman, Kansas State University Mojgan Matloob-Haghanikar, Winona State University Sytil Murphy, Shepherd University

Surgical Outcome of Drug-Resistant Epilepsy in Prasat Neurological Institute

2018 ANNUAL SPONSORSHIP OPPORTUNITIES

Lineære modeller i praksis

Verifiable Secret-Sharing Schemes

STILLAS - STANDARD FORSLAG FRA SEF TIL NY STILLAS - STANDARD

TUSEN TAKK! BUTIKKEN MIN! ...alt jeg ber om er.. Maren Finn dette og mer i. ... finn meg på nett! Grafiske lisenser.

Issues and challenges in compilation of activity accounts

PSY 1002 Statistikk og metode. Frode Svartdal April 2016

Kursus 02402/02323 Introduktion til statistik. Forelæsning 13: Et overblik over kursets indhold. Peder Bacher

TDT4117 Information Retrieval - Autumn 2014

Fakultet for informasjonsteknologi, Institutt for matematiske fag EKSAMEN I EMNE ST2202 ANVENDT STATISTIKK

Trust in the Personal Data Economy. Nina Chung Mathiesen Digital Consulting

Eksamensoppgave i GEOG Geografi i praksis - Tall, kart og bilder

Transkript:

Introduksjon til STATA Introduksjon til STATA Pål Romundstad HVORFOR? SPSS Fordeler: STATA vs SPSS Generelt lettere å bruke Enkelt å produsere tabeller og grafer Det aller meste dekkes av SPSS, men for epidemiologiske analyser og medisinsk statistikk er det mangler STATA fordeler: Mange flere prosedyrer enn SPSS Bedre dokumentasjon Rask Oppdateres over nettet Nedlasting av brukergenererte program (søk online via stata) Større påvirkningsmuligheter i analysene Mata- matrise orientert programmering Lav pris Styrkeberegninger Kalkulator Immediate commands - analyser på aggregert nivå Diagnostisk testing Standardiserte rater Overlevelse/Cox bedre og flere prosedyrer, parametriske modeller for overlevelse Maximum likelihood estimatorer Telle prosedyrer (poisson, negative binomial ++ Clusteranalyser Korrelerte utkom (eks: flernivå, repeterte målinger) Robust SE Huber-White correction for heteroskedascity Random effects models/gee STATA: Missing data analyser med multippel imputering (ICE) Vekting (pweights vs. aweights and iweights) Likelihood ratio Survey analyser mulig å estimere basert på komplekse survey design Omfattende ANOVA/GLM rutiner Gode rutiner for longitudinale panel data Brukergenererte prosedyrer på nettet Hvordan lære STATA Diverse kurs (Norge og utland) På egenhånd online: Karolinska: http://www.cpc.unc.edu/services/computer/presentations /statatutorial UCLA: http://www.ats.ucla.edu/stat/stata Statas hjemmeside med link til andre: http://www.stata.com/links/resources1.html ++++ Bøker: se Statas hjemmeside 1

Hvordan lære STATA Mange eksempler i dag er hentet fra kurs i Stata gitt av Nicola Orsini Karolinska Institutet, Stocholm: http://nicolaorsini.altervista.org/sc/ntnu/is.htm Hensikten med denne introduksjonen Bli kjent med hvordan Stata fungerer Bli i stand til å generere enkel beskrivende statistikk Lære å lage nye og endre variabler Kunne lage krysstabeller Gjøre enkle statistiske tester og lineær regresjon Vise noen eksempler på grafer Snakker STATA med andre program SPSS kan generere/lese STATA filer Stattransfer er et program som lett konverterer mellom aktuelle programpakker inkludert SAS, SPSS, Stata, Access, S+, R, Excel Ting å være klar over Som SPSS er Stata record orientert Både vindusmeny og kommandoorientert Missing defineres numerisk med. og ses av stata som et høyt tall. Missing string som varname>10 blir true for missing Stata er case sensitiv (skiller mellom store og små bokstaver) Ting å være klar over Tids/datoformat: Stata bruker = og == Ett likhetstegn brukes ved tilordning, to betyr lik equals Du kan bruke forkortelser for de fleste kommandoene EKS: generate = ge, di = display, tab= tabulate.. Ved større filer må minne settes høyere enn default- Eks1: set memory 500000 Eks2: set mem 20m, permanently Ses numerisk i forhold til 01.01.1960-1=31.12.1959 1=02.01.1960 Dato format: %d Eksempel: Display %d 17177 gir datoen 11jan1977 2

Stata vinduer filtyper Command line her legges kommandoer inn Results resultatet av kommandoene (output) Variables oversikt over variabler i minnet Review oversikt over tidligere brukte kommandoer Oversikt over fila: browse eller edit.dta fil: datasettet (tilsvarer.sav i spss).do fil: kommandofil for å gjøre flere kommandoer på rad (tilsvarer syntax-.sps i spss).ado fil: kommandofil i STATA.gph fil: graph fil.smcl fil: Log fil.hlp fil: hjelpefil som forklarer rutiner og kommandoer Hjelp i Stata On-line Hvis du kjenner kommandonavnet help kommandonavn Hvis du ikke kjenner kommandonavnet findit keywords search keywords Var. 1 2 3 4 5 6 7 8 Low Birth Weight Data (lowbwt.dta) Beskrivelse Verdier Identification Code ID Number Low Birth Weight 1 bwt<=2500 g, 0 bwt>2500 g Age of Mother Years Weight of Mother at Last Menstrual Pounds Race 1 = White, 2=Black, 3=Other Smoking Status During Pregnancy 0 = No, 1= Yes History of Premature Labor 0 = None, 1=One, 2=Two, etc. History of Hypertension 0 = No, 1 = Yes Navn id low age lwt race smoke ptl ht FAQ (Frequently Asked Questions) http://ww.stata.com/support/faqs UCLA (Resources to help you learn and use Stata) http://www.ats.ucla.edu/stat/stata/ Manualer og bøker 9 10 11 Presence of Uterine Irritability Physician Visits First Trimester Birth Weight 0 = No, 1 = Yes 0 = None, 1=One 2=Two, etc. Grams ui ftv bwt Source and abstract of the data Hente inn datasett i Stata Source Hosmer and Lemeshow (2000) Applied Logistic Regression: Second Edition. These data are copyrighted by John Wiley & Sons Inc. and must be acknowledged and used accordingly. Data were collected at Baystate Medical Center, Springfield, Massachusetts. Descriptive abstract Identify risk factors associated withgiving birth to a low birth weight baby (weighing less than 2500 grams). Data were collected on 189 women, 59 of which had low birth weight babies and 130 of which had normal birth weight babies. Hente datasett (extension.dta) til arbeidsminne use filename [, clear ] clear option erases all data currently in memory and proceeds with loading the new data from the disk or from a web server. /correctly specify the location of the file use "c:\is\lowbwt.dta", clear use http://nicolaorsini.altervista.org/data/lowbwt.dta Alternativ: bruk vindusmeny til å lete frem fil 3

Estimeringskommandoer Generelt: Analysekommandoer Commando varname(s) if in using, options Eks: Tabulate wheese smoker if age>30, col [prefix:] [command] [varlist] [if] [in] [weight[ [,options] Hakeparantes angir valgfri input if begrenser analysen til en undergruppe der if statementet er true options angir spesielle ønsker i output (modify the default) Eks: tabulate wheese smoker if age>30, col regress fødselsvekt maternalage if sex==1 logistic sga maternalage if prematur==0, or Output Se på data Finnes brukergenererte programmer som lager fine artikkelferdige output Kan kopieres direkte til word (bruk tegnsett 8 pkts courier) Grafer kan kombineres og kopieres på flere måter list creates a list of values of specified variables and observations (as alternative to browse) describe provides information on the size of the dataset and the names, labels and types of variables. codebook summarizes a variable in a format designed for printing a codebook (missing value; unique values; descriptive statistics). Run a command in a subset of the data Save a dataset List the variables and observations if age is greater than 30 years list if age > 30 List the variables id, low, and race, if race is equal to 1(White women) list id low race if race == 1 List all the variables for the first 10 observations of the data list in 1/10 To save the dataset available in the current memory type save filename [, replace] the replace option indicates that if filename already exists,stata will overwrite it // Example save lowbwt2.dta, replace 4

Summary statistics summarize calculates and displays a variety of summary statistics. If no varlist is specified, then summary statistics are calculated for all the variables in the dataset summarize [varlist] [if] [in] [, detail ] where detail produces additional statistics, including skewness, kurtosis,various percentiles and the four smallest and largest values.. su bwt, detail Example of summarize Histogram Box-plot A histogram is a graphical method for displaying the shape of a distribution The height of each bar corresponds to its class frequency. The command histogram allows you to control any aspect of the graph (y-axis and x-axis). Eks: histogram bwt, frequency histogram bwt, frequency width(500) start(500) ylabel(0(5)50) xlabel(500(500)5000) normal histogram bwt, percent width(500) start(500) ylabel(0(3)24, angle(h)) xlabel(500(500)5000) normal addlabel A box plot provides a visual summary of many important aspects of a distribution The box stretches from the lower hinge (defined as the 25th percentile) to the upper hinge (the 75th percentile) and therefore contains the middle half of the values The median is shown as a line across the box. graph hbox bwt graph hbox bwt, over(smoke) Table of counts Kakediagram tabulate produces one- and two-way tables of frequency counts along with various measures of association ( help tabulate). // One-way tabulate varname, options // Two-way tabulate varname1 varname2, options graph pie, over(race) /// plabel(_all percent, color(white)) /// betyr at stata skal fortsette å lese på neste linje It has several options: chi2 to calculate the Pearson-Chi Square test col to display column percentages row to display row percentages +++ Eks: tabulate race 5

The Chi-square test tabulate low smoke, col Let s compare the proportion of low birth weight in the population of smokers and non-smokers women. graph pie, over(low) by(smoke) /// plabel(_all percent, color(white)) We test the null hypothesis that the two proportions are identical. graph bar (mean) low, over(smoke) /// blabel(bar, format(%3.2f)) /// ytitle(risk low birth weight) tabulate low smoke, chi2 tabulate low smoke, chi2 expected Without the full data available Without the full data available tabi is an immediate commands that displays r x c table using the values specified. tabi #11 #12 [...] \ #21 #22 [...] [\...] Immediate commands are useful when full data are not available, like in a review of published analysis or sensitivity analysis of observed data It has all the options of tabulate tabi 86 44 \ 29 30, chi2 col tabi is an immediate commands that displays r x c table using the values specified. tabi #11 #12 [...] \ #21 #22 [...] [\...] Immediate commands are useful when full data are not available, like in a review of published analysis or sensitivity analysis of observed data It has all the options of tabulate tabi 86 44 \ 29 30, chi2 col 5 misklassifisert Without the full data available Tables of summary statistics tabi is an immediate commands that displays r x c table using the values specified. tabi #11 #12 [...] \ #21 #22 [...] [\...] Immediate commands are useful when full data are not available, like in a review of published analysis or sensitivity analysis of observed data It has all the options of tabulate tabstat displays summary statistics for a series of numeric variables in a single tables ( help tabstat ) tabstat varlist [, statistics(statname) by(varname)] where statname are descriptive statistics (mean, min, max,sd, var ). Without the by() option is a useful alternative to summarize because it allows you to specify the list of statistics tabi 86 44 \ 29 30, chi2 col 5 misklassifisert tabi 86 44 \ 24 35, chi2 col tabstat bwt, by(race) stat(mean sd n) 6

Create a new variable-generate The command generate creates a new variable. The syntaxis: generate newvar = exp [if] [in] where exp could be any reasonable combination of variables, numbers, operators (help operators) and functions (help functions) Operators 1. Arithmetic: + (addition); - (subtraction); / (division); * (multiplication); ^ (raise to a power) and prefix! (negation). Any arithmetic operation on missing values or an impossible arithmetic operation yields a missing value. 2. String: + (concatenation of two strings), i.e. Name + Surname. 3. Relational: > (greater than); < (less than); >= (greater than or equal); <= (less than or equal); == (equal);!= (not equal) 4. Logical: & (and); (or);! (not) Some examples of generate Replacing values * generate birth weight in kilograms gen bwtkg = bwt/1000 * generate the log of birth weight gen logbwt = log(bwt) * generate body mass index gen bmi = weight_kg/height_mt^2 The command replace changes values of a variable that already exists replace varname = exp [if] [in] * Identify white women with 0 and non-white women with 1 replace white = 0 if race == 1 replace white = 1 if race == 2 race == 3 How to categorize a variable Select variables and observations The most general approach is to copy or clone the variable you want to categorize gen white = race and run one or more replacements to get what you want replace white = 0 if race == 1 replace white = 1 if race == 2 race == 3 tabulate race2 drop eliminates variables that are explicitly listed keep keeps variables that are explicitly listed (opposite of drop) drop varlist drop if exp keep varlist keep if exp 7

Example Sort observations * eliminate the variables ptl and ht drop ptl ht * keep 3 variables in memory: id, low, and age keep id low age * keep only women who smoke keep if smoke == 1 sort varlist arranges the observations of the current data into ascending order based on the values of the variables in varlist gsort [+ -] varname arranges observations to be in ascending [+] or descending [-] order of the specified variables // Examples sort age gsort age Prefix by Stata som kalkulator by varlist: stata_kommando bysort varlist: stata_kommando repeats the command for each group of observations for which the values of the variables in varlist are the same. The prefix bysort gets the sort and by commands in one line display exp displays strings and values of scalar expressions. // Example display sqrt(4)+ 2*log(1) display exp(0.5) display "The square root of 4 is " sqrt(4) * for each level of race tabulate low and smoke bysort race : table low smoke Confidence interval for a proportion Confidence interval of the mean Calculation of an exact 95% CI for the proportion of women who smoke ci smoke, binomial Immediate form: cii cii 189 74 cii 189 74, agresti Calculation a 95% CI of the mean birth weight ci bwt Let s calculate a 95% CI of the mean birth weight among smoking and non-smoking women bysort smoke: ci bwt 8

Immediate form of ci Comparisons of two independent samples Let s see how to calculate a 95% CI of the mean birth weight if you know only the number of observations, mean, and standard deviation of the sample. cii 189 2945 729 Let s see how to test the null hypothesis that the mean birth weights in the population of smokers and non-smokers women are identical The underlying populations are independent (unpaired) and normally distributed. Is it likely that the observed difference in sample means 3054 vs 2773 is the result of chance variation?, Or should we conclude that the discrepancy is due to true difference in populations means? Two sample t-test Regression model for continuous outcome ttest bwt, by(smoke) ttest bwt, by(smoke) unequal To investigate the relationship between a continuous outcome variable and a continuous covariate or predictor we can combine the use of: two-way graphs simple correlation and linear regression For instance, investigate the association between birth weight (bwt) and mother s weight (lwt) Twoway graph twoway plot [if] [in] [, twoway_options] where plot is defined (plottype varlist, options) and where plottype (help twoway) can be, among many others plottype description ---------------------------------------------------------- scatter scatterplot line line plot area line plot with shading lowess LOWESS line plot lfit linear prediction plot qfit quadratic prediction plot bar bar plot Scatter plot scatter is the most powerful graph command (help scatter) Can be modified through options and the dialog box can help specifying options (db scatter). The most common options are: msymbol() to specify the shape of marker mcolor() to specify the color of marker ylabel() to specify labels on y-axis xlabel() to specify labels on x-axis ytitle() to specify a title on y-axis xtitle() to specify a title on x-axis scheme() to specify the overall look of the graph legend() to specify the legend of the graph 9

Overlay two plots twoway scatter bwt lwt Addsomeoptions twoway scatter bwt lwt, /// ytitle("birth weight, grams") /// xtitle("mother's weight, pounds") /// ylabel(500(500)5000, angle(horizontal)) /// xlabel(50(25)250) /// msymbol(o) /// mcolor(red) twoway /// (scatter bwt lwt) /// (lfit bwt lwt), /// ytitle("birth weight, grams") /// xtitle("mother's weight, pounds") /// ylabel(500(500)5000, angle(h)) /// xlabel(50(25)250) /// legend(label(1 "Scatter") label(2 "Trend")) /// text(4500 210 "r=0.2 p=0.01") /// scheme(s1mono) Overlay two plots by groups Simple linear regression twoway /// (scatter bwt lwt) /// (lfit bwt lwt), /// ytitle("birth weight, grams") /// xtitle("mother's weight, pounds") /// ylabel(500(500)5000, angle(h)) /// xlabel(50(25)250) /// legend(label(1 "Scatter") label(2 "Trend")) /// text(4500 210 "r=0.2 p=0.01") /// scheme(s1mono) /// by(smoke) regress bwt lwt The first variable is the outcome/response followed by the list of covariates or predictors Regression with categorical covariates To calculate the population mean (95% CI) birth weight among black women xi:regress bwt i.race lincom _cons + _Irace_2 Prefix xi generates dummy variables for the categories Can be used for a variety of commands and types of regression 10

Regression models for binary response Odds ratios, risk ratios, risk differences, and rate ratios can be estimated with generalized linear models (help glm). Regression models are much more flexible than tables for epidemiologist and allow the construction of multivariable models. binreg fits generalized linear models for the binomial family. It estimates odds ratios, risk ratios, and risk differences binreg low smoke, rr xi:binreg low smoke i.race, rr Several options for generation of odds ratios binreg low smoke, or logistic low smoke logit low smoke, or glm low smoke, family(binomial) eform 11