HVORFOR? STATA vs SPSS. Hvordan lære STATA. Hvordan lære STATA. Introduksjon til STATA. Introduksjon til STATA. Pål Romundstad. STATA noen fordeler:

Like dokumenter
Introduksjon til STATA. Pål Romundstad. Introduksjon til STATA HVORFOR?

HVORFOR? STATA vs SPSS. Hvordan lære STATA. Introduksjon til STATA. Introduksjon til STATA. Pål Romundstad. STATA fordeler: STATA:

Unit Relational Algebra 1 1. Relational Algebra 1. Unit 3.3

Slope-Intercept Formula

Neural Network. Sensors Sorter

SAS FANS NYTT & NYTTIG FRA VERKTØYKASSA TIL SAS 4. MARS 2014, MIKKEL SØRHEIM

Databases 1. Extended Relational Algebra

Exercise 1: Phase Splitter DC Operation

Datahandling and presentation. Themes. Respekt og redelighet Masterseminar, Frode Volden

Minimumskrav bør være å etablere at samtykke ikke bare må være gitt frivillig, men også informert.

Start MATLAB. Start NUnet Applications Statistical and Computational packages MATLAB Release 13 MATLAB 6.5

5 E Lesson: Solving Monohybrid Punnett Squares with Coding

Introduksjon til SPSS. Johan Håkon Bjørngaard Institutt for samfunnsmedisin, NTNU

Mål: SPSS. Litteratur. Noen statistikk-programpakker. Dokumentasjon fra SPSS Inc. Introduksjon til IBM SPSS Statistics 20

UNIVERSITETET I OSLO ØKONOMISK INSTITUTT

Medisinsk statistikk, KLH3004 Dmf, NTNU Styrke- og utvalgsberegning

Hvordan føre reiseregninger i Unit4 Business World Forfatter:

Moving Objects. We need to move our objects in 3D space.

Generalization of age-structured models in theory and practice

Andrew Gendreau, Olga Rosenbaum, Anthony Taylor, Kenneth Wong, Karl Dusen

Accuracy of Alternative Baseline Methods

1 User guide for the uioletter package

Eksamensoppgave i PSY3100 Forskningsmetode - Kvantitativ

SVM and Complementary Slackness

Information search for the research protocol in IIC/IID

Regresjonsmodeller. HEL 8020 Analyse av registerdata i forskning. Tom Wilsgaard

Kategoriske data, del I: Kategoriske data - del 2 (Rosner, ) Kategoriske data, del II: 2x2 tabell, parede data (Mc Nemar s test)

GeWare: A data warehouse for gene expression analysis

Dynamic Programming Longest Common Subsequence. Class 27

HONSEL process monitoring

Perpetuum (im)mobile

Oppgave 1a Definer følgende begreper: Nøkkel, supernøkkel og funksjonell avhengighet.

BEGYNNERKURS I SPSS. Anne Schad Bergsaker 12. februar 2019

Eksamensoppgave i PSY3100 Forskningsmetode - Kvantitativ

PSi Apollo. Technical Presentation

BEGYNNERKURS I SPSS. Anne Schad Bergsaker 17. november 2017

Eksamensoppgave i SOS3003 Anvendt statistisk dataanalyse i samfunnsvitenskap

BEGYNNERKURS I SPSS. Anne Schad Bergsaker 26. april 2018

Du må håndtere disse hendelsene ved å implementere funksjonene init(), changeh(), changev() og escape(), som beskrevet nedenfor.

Model Description. Portfolio Performance

UNIVERSITETET I OSLO

Estimating Peer Similarity using. Yuval Shavitt, Ela Weinsberg, Udi Weinsberg Tel-Aviv University

Kom i gang med Stata for Windows på UiO - hurtigstart for begynnere

XML enabled database. support for XML in Microsoft SQL Server 2000 & Martin Malý

Passasjerer med psykiske lidelser Hvem kan fly? Grunnprinsipper ved behandling av flyfobi

SAS-feil kavalkade. Viggo Skar Oslo Universitetssykehus HF (OUS)

Software applications developed for the maritime service at the Danish Meteorological Institute

KROPPEN LEDER STRØM. Sett en finger på hvert av kontaktpunktene på modellen. Da får du et lydsignal.

UNIVERSITETET I OSLO ØKONOMISK INSTITUTT

DATAUTFORSKNING I EG, EG 7.1 OG EGENDEFINERTE FUNKSJONER SAS FANS I STAVANGER 4. MARS 2014, MARIT FISKAAEN

Little Mountain Housing

Logistisk regresjon 1

Introduksjon til SPSS

Begynnerkurs i Stata. UiO vår 2019, Knut Waagan 1 / 95

Quantitative Spectroscopy Chapter 3 Software

Call function of two parameters

UNIVERSITETET I OSLO

Satellite Stereo Imagery. Synthetic Aperture Radar. Johnson et al., Geosphere (2014)

TMA4329 Intro til vitensk. beregn. V2017

Maple Basics. K. Cooper

IN2010: Algoritmer og Datastrukturer Series 2

Sensorveiledning SPED1200 vår 2018

EXAMINATION PAPER. Exam in: STA-3300 Applied statistics 2 Date: Wednesday, November 25th 2015 Time: Kl 09:00 13:00 Place: Teorifagb.

Endelig ikke-røyker for Kvinner! (Norwegian Edition)

of color printers at university); helps in learning GIS.

2A September 23, 2005 SPECIAL SECTION TO IN BUSINESS LAS VEGAS

EN Skriving for kommunikasjon og tenkning

BYFE/EMFE 1000, 2012/2013. Numerikkoppgaver uke 33

AvtaleGiro beskrivelse av feilmeldinger for oppdrag og transaksjoner kvitteringsliste L00202 levert i CSV fil

Speed Racer Theme. Theme Music: Cartoon: Charles Schultz / Jef Mallett Peanuts / Frazz. September 9, 2011 Physics 131 Prof. E. F.

Elektronisk innlevering/electronic solution for submission:

Følgende endringer er gjort i denne versjonen:

MID-TERM EXAM TDT4258 MICROCONTROLLER SYSTEM DESIGN. Wednesday 3 th Mars Time:

Dean Zollman, Kansas State University Mojgan Matloob-Haghanikar, Winona State University Sytil Murphy, Shepherd University

TFY4170 Fysikk 2 Justin Wells

Physical origin of the Gouy phase shift by Simin Feng, Herbert G. Winful Opt. Lett. 26, (2001)

Eksamensoppgave i SOS3003 Anvendt statistisk dataanalyse i samfunnsvitenskap Examination paper for SOS3003 Applied Social Statistics

Examination paper for (BI 2015) (Molekylærbiologi, laboratoriekurs)

Issues and challenges in compilation of activity accounts

TUSEN TAKK! BUTIKKEN MIN! ...alt jeg ber om er.. Maren Finn dette og mer i. ... finn meg på nett! Grafiske lisenser.

Exploratory Analysis of a Large Collection of Time-Series Using Automatic Smoothing Techniques

Eksamensoppgave i PSY3100 Forskningsmetode - Kvantitativ

Checking Assumptions

Trust in the Personal Data Economy. Nina Chung Mathiesen Digital Consulting

EXAM TTM4128 SERVICE AND RESOURCE MANAGEMENT EKSAM I TTM4128 TJENESTE- OG RESSURSADMINISTRASJON

DATAØVING 1 INTRODUKSJON TIL STATA I

Eksamensoppgave i GEOG1004 Geografi i praksis Tall, kart og bilder

2018 ANNUAL SPONSORSHIP OPPORTUNITIES

Eksamensoppgave i GEOG Geografi i praksis - Tall, kart og bilder

Compello Fakturagodkjenning Versjon 10 Software as a service. Tilgang til ny modulen Regnskapsføring

Besvar tre 3 av følgende fire 4 oppgaver.

Trigonometric Substitution

Eksamen PSY1010 / PSYC1100 Forskningsmetode I

TUSEN TAKK! BUTIKKEN MIN! ...alt jeg ber om er.. Maren Finn dette og mer i. ... finn meg på nett! Grafiske lisenser.

Emneevaluering GEOV272 V17

TriCOM XL / L. Energy. Endurance. Performance.

Compello Fakturagodkjenning Versjon 10.5 As a Service. Tilgang til Compello Desktop - Regnskapsføring og Dokument import

UNIVERSITETET I OSLO

BEGYNNERKURS I SPSS. Anne Schad Bergsaker 24. november 2017

Rom-Linker Software User s Manual

Transkript:

Introduksjon til STATA Introduksjon til STATA Pål Romundstad HVORFOR? STATA vs SPSS STATA noen fordeler: SPSS Fordeler: Generelt lettere å bruke Enkelt å produsere tabeller og grafer Det aller meste dekkes av SPSS, men for epidemiologiske analyser og medisinsk statistikk er det mangler Rettet mot medisinsk statistikk / epidemiologi Oppdateres over nettet Nedlasting av brukergenererte program (søk online via stata) Mata- matrise orientert programmering Styrkeberegninger Kalkulator Immediate commands - analyser på aggregert nivå Flere muligheter Hvordan lære STATA Diverse kurs (Norge og utland) På egenhånd online: Karolinska: http://www.cpc.unc.edu/services/computer/presentations /statatutorial UCLA: http://www.ats.ucla.edu/stat/stata Statas hjemmeside med link til andre: http://www.stata.com/links/resources1.html ++++ Bøker: se Statas hjemmeside Hvordan lære STATA Mange eksempler i dag er hentet fra kurs i Stata gitt av Nicola Orsini Karolinska Institutet, Stocholm: http://nicolaorsini.altervista.org/ 1

Hensikten med denne introduksjonen Bli kjent med hvordan Stata fungerer Bli i stand til å generere enkel beskrivende statistikk Lære å lage nye og endre variabler Kunne lage krysstabeller Noen eksempler på enkle statistiske tester og lineær regresjon Vise noen eksempler på grafer Snakker STATA med andre program SPSS kan generere/lese STATA filer Finnes nedlastbare program for lesing av SPSS filer: eks: usespss Stattransfer er et program som lett konverterer mellom aktuelle programpakker inkludert SAS, SPSS, Stata, Access, S+, R, Excel Stata egner seg ikke som «data entry» program Ting å være klar over Som SPSS er Stata record orientert Både vindusmeny og kommandobasert Missing defineres numerisk med. og ses av stata som et høyt tall. Missing string som BMI>10 blir true for missing Stata er case sensitiv (skiller mellom store og små bokstaver) Ting å være klar over Stata bruker = og == Ett likhetstegn brukes ved tilordning, to betyr likhet i logiske utrykk Du kan bruke forkortelser for de fleste kommandoene EKS: generate = ge, display=di, tabulate=tab=ta.. Ved større filer må minne settes høyere enn default- Eks1: set memory 500000 Eks2: set mem 20m, permanently Tids/datoformat: Ses numerisk i forhold til 01.01.1960-1=31.12.1959 1=02.01.1960 0=? Dato format: %d Stata vinduer Command line her legges kommandoer inn Results resultatet av kommandoene (output) Variables oversikt over variabler i minnet Review oversikt over tidligere brukte kommandoer Oversikt over fila: browse eller edit Eksempel: Display %d 17177 gir datoen 11jan1977 2

filtyper Hjelp i Stata.dta fil: datasettet (tilsvarer.sav i spss).do fil: kommandofil for å gjøre flere kommandoer på rad (tilsvarer syntax-.sps i spss).ado fil: kommandofil i STATA.gph fil: graph fil.smcl fil: Log fil.hlp fil: hjelpefil som forklarer rutiner og kommandoer On-line Hvis du kjenner kommandonavnet help kommandonavn Hvis du ikke kjenner kommandonavnet findit keywords search keywords FAQ (Frequently Asked Questions) http://ww.stata.com/support/faqs UCLA (Resources to help you learn and use Stata) http://www.ats.ucla.edu/stat/stata/ Manualer og bøker Low Birth Weight Data (lowbwt.dta) Var. Beskrivelse Verdier Navn 1 Identification Code ID Number id 2 Low Birth Weight 1 bwt<=2500 g, 0 bwt>2500 g low 3 Age of Mother Years age 4 Weight of Mother at Last Menstrual Pounds lwt 5 Race 1 = White, 2=Black, 3=Other race 6 Smoking Status During Pregnancy 0 = No, 1= Yes smoke 7 History of Premature Labor 0 = None, 1=One, 2=Two, etc. ptl 8 History of Hypertension 0 = No, 1 = Yes ht 9 Presence of Uterine Irritability 0 = No, 1 = Yes ui 10 Physician Visits First Trimester 0 = None, 1=One 2=Two, etc. ftv 11 Birth Weight Grams bwt Source and abstract of the data Source Hosmer and Lemeshow (2000) Applied Logistic Regression: Second Edition. These data are copyrighted by John Wiley & Sons Inc. and must be acknowledged and used accordingly. Data were collected at Baystate Medical Center, Springfield, Massachusetts. Descriptive abstract Identify risk factors associated withgiving birth to a low birth weight baby (weighing less than 2500 grams). Data were collected on 189 women, 59 of which had low birth weight babies and 130 of which had normal birth weight babies. Hente inn datasett i Stata kommandoer Hente datasett (extension.dta) til arbeidsminne use filename [, clear ] clear option erases all data currently in memory and proceeds with loading the new data from the disk or from a web server /correctly specify the location of the file use "c:\is\lowbwt.dta", clear use http://nicolaorsini.altervista.org/data/lowbwt.dta Alternativ: bruk vindusmeny til å lete frem fil Commando variabelnavn if in using, options Eksempel: Tabulate low smoke if age>30, col 3

Generelt: Analysekommandoer [prefix:] [command] [varlist] [if] [in] [weight] [,options] Hakeparantes angir valgfri input if begrenser analysen til en undergruppe der if statementet er true options angir spesielle ønsker i output (modify the default) Output Finnes brukergenererte programmer som lager fine artikkelferdige output Kan kopieres direkte til word (bruk tegnsett 8 pkts courier) Grafer kan kombineres og kopieres på flere måter Noen eks: tabulate smoke if age>30 regress fødselsvekt maternalage if sex==1 logistic sga maternalage if prematur==0, or Se på data Run a command in a subset of the data list creates a list of values of specified variables and observations (as alternative to browse) describe provides information on the size of the dataset and the names, labels and types of variables. codebook summarizes a variable in a format designed for printing a codebook (missing value; unique values; descriptive statistics). List the variables and observations if age is greater than 30 years list if age > 30 crosstabulates the variables low and race if id is larger than 15 Tabulate low race if id> 15 List all the variables for the first 10 observations of the data list in 1/10 Save a dataset Summary statistics To save the dataset available in the current memory type save filename [, replace] the replace option indicates that if filename already exists,stata will overwrite it // Example save lowbwt2.dta, replace summarize calculates and displays a variety of summary statistics. If no varlist is specified, then summary statistics are calculated for all the variables in the dataset summarize [varlist] [if] [in] [, detail ] where detail produces additional statistics, including skewness, kurtosis,various percentiles and the four smallest and largest values. 4

Example of summarize Do-fil Syntaksfil : sekvens av kommandoer som tekstfil Kan lagres- bruke dette som dokumentasjon og sikkerhet!. su bwt, detail Kommentarer * for kommentarer på en enkelt linje // for kommentarer etter en kommando på samme linje /* for kommentarer som går over flere linjer */ Tre skråstreker /// betyr at stata skal lese på neste linje også Histogram Box-plot A histogram is a graphical method for displaying the shape of a distribution The height of each bar corresponds to its class frequency/density/percent. The command histogram allows you to control any aspect of the graph (y-axis and x-axis). Eks: histogram bwt, frequency histogram bwt, frequency width(500) start(500) ylabel(0(5)50) xlabel(500(500)5000) normal histogram bwt, percent width(500) start(500) ylabel(0(3)24, angle(h)) xlabel(500(500)5000) normal addlabel A box plot provides a visual summary of a distribution The box stretches from the the 25th percentile to the 75th percentile - contains the middle half of the values The median is shown as a line across the box. graph hbox bwt graph hbox bwt, over(smoke) Table of counts Kakediagram tabulate produces one- and two-way tables of frequency counts along with various measures of association ( help tabulate). // One-way tabulate varname, options // Two-way tabulate varname1 varname2, options graph pie, over(race) /// plabel(_all percent, color(white)) /// betyr at stata skal fortsette å lese på neste linje It has several options: chi2 to calculate the Pearson-Chi Square test col to display column percentages row to display row percentages +++ Eks: tabulate race 5

The Chi-square test tabulate low smoke, col Let s compare the proportion of low birth weight in the population of smokers and non-smokers women. graph pie, over(low) by(smoke) /// plabel(_all percent, color(white)) We test the null hypothesis that the two proportions are identical. graph bar (mean) low, over(smoke) /// blabel(bar, format(%3.2f)) /// ytitle(risk low birth weight) tabulate low smoke, chi2 tabulate low smoke, chi2 expected Without the full data available Without the full data available tabi is an immediate commands that displays r x c table using the values specified. tabi #11 #12 [...] \ #21 #22 [...] [\...] Immediate commands are useful when full data are not available, like in a review of published analysis or sensitivity analysis of observed data It has all the options of tabulate tabi 86 44 \ 29 30, chi2 col tabi is an immediate commands that displays r x c table using the values specified. tabi #11 #12 [...] \ #21 #22 [...] [\...] Immediate commands are useful when full data are not available, like in a review of published analysis or sensitivity analysis of observed data It has all the options of tabulate tabi 86 44 \ 29 30, chi2 col 5 misklassifisert Without the full data available Tables of summary statistics tabi is an immediate commands that displays r x c table using the values specified. tabi #11 #12 [...] \ #21 #22 [...] [\...] Immediate commands are useful when full data are not available, like in a review of published analysis or sensitivity analysis of observed data It has all the options of tabulate tabstat displays summary statistics for a series of numeric variables in a single tables ( help tabstat ) tabstat varlist [, statistics(statname) by(varname)] where statname are descriptive statistics (mean, min, max,sd, var ). Without the by() option is a useful alternative to summarize because it allows you to specify the list of statistics tabi 86 44 \ 29 30, chi2 col 5 misklassifisert tabi 86 44 \ 24 35, chi2 col tabstat bwt, by(race) stat(mean sd n) 6

Create a new variable-generate The command generate creates a new variable. The syntaxis: generate newvar = expression [if] [in] where expression could be any reasonable combination of variables, numbers, operators (help operators) and functions (help functions) Operators 1. Arithmetic: + (addition); - (subtraction); / (division); * (multiplication); ^ (raise to a power) and prefix! (negation). Any arithmetic operation on missing values or an impossible arithmetic operation yields a missing value. 2. String: + (concatenation of two strings), i.e. Name + Surname. 3. Relational: > (greater than); < (less than); >= (greater than or equal); <= (less than or equal); == (equal);!= (not equal) 4. Logical: & (and); (or);! (not) Some examples of generate Replacing values * generate birth weight in kilograms gen bwtkg = bwt/1000 * generate the log of birth weight gen logbwt = log(bwt) * generate body mass index gen bmi = weight_kg/height_mt^2 The command replace changes values of a variable that already exists replace varname = exp [if] [in] * Identify white women with 0 and non-white women with 1 replace white = 0 if race == 1 replace white = 1 if race == 2 race == 3 How to categorize a variable Select variables and observations The most general approach is to copy or clone the variable you want to categorize gen white = race and run one or more replacements to get what you want replace white = 0 if race == 1 replace white = 1 if race == 2 race == 3 tabulate race2 drop eliminates variables that are explicitly listed keep keeps variables that are explicitly listed (opposite of drop) drop varlist drop if exp keep varlist keep if exp 7

Example Sort observations * eliminate the variables ptl and ht drop ptl ht * keep 3 variables in memory: id, low, and age keep id low age * keep only women who smoke keep if smoke == 1 sort varlist arranges the observations of the current data into ascending order based on the values of the variables in varlist gsort [+ -] varname arranges observations to be in ascending [+] or descending [-] order of the specified variables // Examples sort age gsort age Prefix by Stata som kalkulator by varlist: stata_kommando bysort varlist: stata_kommando repeats the command for each group of observations for which the values of the variables in varlist are the same. The prefix bysort gets the sort and by commands in one line display exp displays strings and values of scalar expressions. // Example display sqrt(4)+ 2*log(1) display exp(0.5) display "The square root of 4 is " sqrt(4) * for each level of race tabulate low and smoke bysort race : table low smoke Confidence interval for a proportion Confidence interval of the mean Calculation of an exact 95% CI for the proportion of women who smoke ci smoke, binomial Immediate form: cii cii 189 74 cii 189 74, agresti Calculation a 95% CI of the mean birth weight ci bwt Let s calculate a 95% CI of the mean birth weight among smoking and non-smoking women bysort smoke: ci bwt 8

Immediate form of ci Comparisons of two independent samples Let s see how to calculate a 95% CI of the mean birth weight if you know only the number of observations, mean, and standard deviation of the sample. cii 189 2945 729 Let s see how to test the null hypothesis that the mean birth weights in the population of smokers and non-smokers women are identical The underlying populations are independent (unpaired) Is it likely that the observed difference in sample means 3054 vs 2773 is the result of chance variation? Or should we conclude that the discrepancy is due to true difference in populations means? Two sample t-test Regression model for continuous outcome ttest bwt, by(smoke) ttest bwt, by(smoke) unequal To investigate the relationship between a continuous outcome variable and a continuous covariate or predictor we can combine the use of: two-way graphs simple correlation and linear regression For instance, investigate the association between birth weight (bwt) and mother s weight (lwt) Twoway graph twoway plot [if] [in] [, twoway_options] where plot is defined (plottype varlist, options) and where plottype (help twoway) can be, among many others plottype description ---------------------------------------------------------- scatter scatterplot line line plot area line plot with shading lowess LOWESS line plot lfit linear prediction plot qfit quadratic prediction plot bar bar plot Scatter plot scatter is the most powerful graph command (help scatter) Can be modified through options and the dialog box can help specifying options (db scatter). The most common options are: msymbol() to specify the shape of marker mcolor() to specify the color of marker ylabel() to specify labels on y-axis xlabel() to specify labels on x-axis ytitle() to specify a title on y-axis xtitle() to specify a title on x-axis scheme() to specify the overall look of the graph legend() to specify the legend of the graph 9

Overlay two plots twoway scatter bwt lwt Add some options twoway scatter bwt lwt, /// ytitle("birth weight, grams") /// xtitle("mother's weight, pounds") /// ylabel(500(500)5000, angle(horizontal)) /// xlabel(50(25)250) /// msymbol(o) /// mcolor(red) twoway /// (scatter bwt lwt) /// (lfit bwt lwt), /// ytitle("birth weight, grams") /// xtitle("mother's weight, pounds") /// ylabel(500(500)5000, angle(h)) /// xlabel(50(25)250) /// legend(label(1 "Scatter") label(2 "Trend")) /// text(4500 210 "r=0.2 p=0.01") /// scheme(s1mono) Overlay two plots by groups Simple linear regression twoway /// (scatter bwt lwt) /// (lfit bwt lwt), /// ytitle("birth weight, grams") /// xtitle("mother's weight, pounds") /// ylabel(500(500)5000, angle(h)) /// xlabel(50(25)250) /// legend(label(1 "Scatter") label(2 "Trend")) /// text(4500 210 "r=0.2 p=0.01") /// scheme(s1mono) /// by(smoke) regress bwt lwt The first variable is the outcome/response followed by the list of covariates or predictors Regression with categorical covariates To estimate the estimated population mean (95% CI) of birth weight among black women xi:regress bwt i.race lincom _cons + _Irace_2 Prefix xi generates dummy variables for the categories Can be used for a variety of commands and types of regression 10

Regression models for binary response Odds ratios, risk ratios, risk differences, and rate ratios can be estimated with generalized linear models (help glm). Regression models are much more flexible than tables for epidemiologist and allow the construction of multivariable models. binreg fits generalized linear models for the binomial family. It estimates odds ratios, risk ratios, and risk differences binreg low smoke, rr xi:binreg low smoke i.race, rr Several options for generation of odds ratios binreg low smoke, or logistic low smoke logit low smoke, or glm low smoke, family(binomial) eform 11