Getting Started in R~StataNotes on Exploring Data (v. 1.0) Oscar Torres-Reyna [email protected] Fall 2010 http://dss.princeton.edu/training/ What is R/Stata? What is R? • “R is a language and environment for statistical computing and graphics”* • R is offered as open source (i.e. free) What is Stata? • It is a multi-purpose statistical package to help you explore, summarize and analyze datasets. • A dataset is a collection of several pieces of information called variables (usually arranged by columns). A variable can have one or several values (information for one or several cases). • Other statistical packages are SPSS and SAS. Features Stata SPSS SAS R Learning curve Steep/gradual Gradual/flat Pretty steep Pretty steep User interface Programming/point-and-click Mostly point-and-click Programming Programming Data manipulation Very strong Moderate Very strong Very strong Data analysis Powerful Powerful Powerful/versatile Powerful/versatile Graphics Very good Very good Good Excellent Affordable (perpetual Expensive (but not need to Expensive (yearly Open source Cost licenses, renew only when renew until upgrade, long renewal) upgrade) term licenses) NOTE: The R content presented in this document is mostly based on an early version of Fox, J. and Weisberg, S. (2011) An R Companion to Applied Regression, Second Edition, Sage; and from class notes from the ICPSR’s workshop Introduction to the R Statistical Computing Environment taught by John Fox during the summer of 2010. * http://www.r-project.org/index.html This is the R screen in Multiple-Document Interface (MDI)… This is the R screen in Single-Document Interface (SDI)… . this window ????? Files will be saved here Property of each Write commands here variable here PU/DSS/OTR . Stata 12/13+ screen Variables in dataset here Output here History of commands. search("regression") the work ‘regression’. workspace in R Getting help ?plot # Get help for an object. A window will pop-up. R Stata Working directory getwd() # Shows the working directory (wd) pwd /*Shows the working directory*/ setwd("C:/myfolder/data") # Changes the wd cd c:\myfolder\data /*Changes the wd*/ setwd("H:\\myfolder\\data") # Changes the wd cd “c:\myfolder\stata data” /*Notice the spaces*/ Installing packages/user-written programs install.packages("ABC") # This will install the ssc install abc /*Will install the user-defined package –-ABC--. mirror site to download from (the closest to where you are) and click ok. in this case for help tab /* Get help on the command ‘tab’*/ the –-plot– function. You can also hsearch regression /* Search the help files for type: help. library(ABC) # Load the package –-ABC-– to your It also searcher your computer. You can also type: help(plot) search regression /* Search the keywords for the word ‘regression’*/ ??regression # Search the help pages for anything that has the word "regression". help(package=car) # View documentation in package ‘car’. It will be ready to run. findit abc /*Will do an online search for program ‘abc’ or programs that include ‘abc’. select a program ‘abc’. It provides more options than ‘search’*/ apropos("age") # Search the word "age" in the objects available in the current R session. You can also type: library(help="car“) help(DataABC) # Access codebook for a dataset called ‘DataABC’ in the package ABC 6 . tab. R Stata Data from *. header=TRUE. copy.txt". in the command line type: mydata <. Data from/to *.read. edit sep="\t") /*The data editor will pop-up and paste the data (Ctrl-V).csv file.raw mydata <. na.csv("c:\mydata\mydatafile.table(("C:/myfolder/abc. sep="\t". insheet using "c:\mydata\mydatafile. Select the link for to include variable names Data from *. go /* Select the table from the excel file.choose().read.csv(file. comma-separated) # In the example above. Click on ‘Browse’ to find the file mydata <. header=TRUE. file = "test. variables have spaces and /* See insheet above */ missing data is coded as ‘-9’ infile var1 var2 str7 var3 using abc.read. Go to File->Import->”ASCII data created by spreadsheet”.txt".csv (copy-and-paste) # Select the table from the excel file.table("clipboard".txt (space . header = TRUE) and then OK. copy.csv # Reading the data directly /* In the command line type */ mydata <. go to the R Console and type: to Stata.csv".csv" .strings = "-9") /* Variables with embedded spaces must be enclosed in quotes */ # Export the data # Export data 7 write.csv" header=TRUE) /* Using the menu */ # The will open a window to search for the *.table(mydata. sep = "\t") outsheet using "c:\mydata\abc.read. Note: This does not work with SPSS portable files # (*.spss /* From Stata type: */ -------------------------------------------------- saveold mydata.labels’ Convert variables with value labels into R factors with those levels.data.frame’ return a data frame.0 or later*/ Source: type ?read. http://dss. this only once) */ library(foreign) # Load package –-foreign-.raw data file 8 . -------------------------------------------------- # # ‘use. /* To read the *. that can be read by SPSS v15.edu/training/mydat a.packages("foreign") # Need to install /* Need to install the program ‘usespss’ (you do package –-foreign–.por) # ‘use. codefile="test2. R Stata Data from/to SPSS install.missings = to. package=“SPSS") # Provides a syntax file (*.value.princeton.sps".edu/training/mydata.sav type (in one line): read.frame = TRUE.raw". usespss using to.labels=TRUE.foreign(mydata.princeton. ssc install usespss mydata.frame) /* For additional information type */ # Where: help usespss # # ‘to.sps) to read the *.9 for SPSS write.value. datafile="test2.data.sav".first (you do this only once).sav use. You need user-defined missing values be used to set the to save the data file as a Stata file version 9 corresponding values to NA.missings’ logical: should information on /* Stata does not convert files to SPSS.spss <.dta /* Saves data to v.spss("http://dss. use.data. xpt) # It works usesas using "c:\mydata.xpt) you can use ‘fdause’ (or go to File->Import).sas7bcat type (in one line): */ sasxport. -------------------------------------------------- /* You can export a dataset as SAS XPORT by menu write.read.sas) to read the *. package=“SAS") fdasave "c:/myfolder/mydata1. */ library(foreign) # Load package –-foreign-- fdause "c:/myfolder/mydata. which you can install by typing: */ # Using package –-Hmisc— ssc install usesas library(Hmisc) mydata <.princeton.foreign(mydata.xport("http://dss.sas <- read.sas <. (go to File->Export) or by typing */ datafile="test2.edu/training/myd ata. /* To read the *.sas7bdat” -------------------------------------------------.sas".xport("c:/myfolder/mydata. codefile="test2.xpt") can use the program ‘usesas’. you can use SAS Universal Viewer (freeware from SAS) to read SAS files and save them as *.xpt“ mydata.edu/training/myda /* Type help fdause for more details */ ta.xpt“ # Provide a syntax file (*.xpt) /*If you have a file in SAS XPORT format (*.princeton.xpt") # Does not work for files online /* If you have SAS installed in your computer you mydata. 9 .csv.csv removes variable/value labels.raw data /* Type help fdasave for more details */ NOTE: As an alternative. make sure you have the codebook available.get(http://dss.raw". R Stata Data from/to SAS # To read SAS XPORT format (*. Saving the file as *. dta("http://dss.dates. to read Stata 13+ 10 see slide on page 29.dta(mydata. or type: */ mydata <- read.dta") Or mydata.factors.princeton.dta <- read.edu/training/mydata.dta" ts. Warn if a variable is /* To save a dataset as Stata file got File -> specified with value labels and those value labels Save As. . replace /*If already saved as Stata file*/ write.factors=TRUE." in R names? # warn. save mydata.labels. Use Stata value labels to create factors? (version 6.dta") # Direct export to Stata save. /* To open a Stata file go to File -> Open.edu/training/studen use "c:\myfolder\mydata. file type */ convert.dta". file = "test. Convert "_" in Stata -------------------------------------------------- variable names to ". /* If you need to load a subset of a Stata data convert.missing. # convert. -------------------------------------------. package="Stata") # Provide a do-file to read the *.dta".dta" # Where (source: type ?read.dta("http://dss.underscore. datafile="test1.edu/training/mydata use "http://dss. convert.princeton.dta" .raw".princeton. R Stata Data from/to Stata library(foreign) # Load package –-foreign-.labels=TRUE) use var1 var2 using "c:\myfolder\mydata. warn. codefile="test1.dates=TRUE.foreign(mydata. clear # convert.can only read Stata 12 or older versions.underscore=TRUE.raw data NOTE: Package -foreign.0 or later).do".dta) use id city state gender using "mydata. Convert Stata dates to Date class # convert. replace /*If the fist time*/ write. or type: */ are not present in the file.missing. RData") /* Stata can’t read R data files */ load("mydata.rda") /* Add path to data if necessary */ ------------------------------------------------- save.RData save(object1. R Stata Data from/to R load("mydata.image("mywork.rda") # Saving selected objects 11 . file=“mywork.RData") # Saving all objects to file *. object2. fwf(file="http://dss.dct With infile we run the dictionary by typing: Start of var(t+1) – End of var(t) . /* Using infix */ read.princeton. negative sign excludes _column(24) var2 %2f " Label for var2 " variables not wanted (you must include these). -10.names=c("w".2 var2 1 24 25 F2. "sex").edu/training/m ydata."y"."x3". /*Do not forget to close the brackets and press enter after the last bracket*/ # To get the widths for unwanted spaces use the formula: Save it as mydata.dat { # Reading ASCII record form.1 infile using c:\data\mydata For other options check http://dss. 2). "http://dss. 2. infix var1 1-7 var2 24-25 str2 var3 26-27 var4 32- width=c(7.edu/trainingmydata. -110.dat" col.edu/training/DataPrep101.0 var5 1 44 45 A2 var6 1 156 158 A3 var7 1 165 166 A2 12 . 2. 33 str2 var5 44-45 var6 156-158 var7 165-166 using 3."x2".princeton. 2. numbers represent the _column(1) var1 %7.pdf *Thank you to Scott Kostyshak for useful advice/code. -4.dat". -16. 2. R Stata Data from ACII Record form mydata.princeton.dat <.2f " Label for var1 " width of variables. Var Rec Start End Format Data locations usually available in codebooks var1 1 1 7 F7.0 var3 1 26 27 A2 var4 1 32 33 F2."x1". -------------------------------------------------- n=1090) /* Using infile */ dictionary using c:\data\mydata. _column(26) str2 var3 %2s " Label for var3 " _column(32) var4 %2f " Label for var4 " _column(44) str2 var5 %2s " Label for var5 " # To get the width of the variables you must have _column(156) str3 var5 %3s " Label for var6 " _column(165) str2 var5 %2s " Label for var7 " a codebook for the data set available (see an } example below). "age". -6. omit(mydata) 13 . of missing per row /* For missing values per observation see the mydata[mydata$age=="& ". command*/ mydata[mydata$age==999.cases(mydata). n=10) # Last 10 rows tail(mydata.NA The function complete.na(mydata))# Number of missing in dataset tabmiss /* # of missing.frame()) mydata[1:10. n= -10) # All rows but the first 10 mydata <. Need to install.na(data))*length(data)# No."age"] <.omit() returns the object with listwise deletion of missing values.1:3] # First 10 rows of data of the first 3 variables edit(mydata) # Open data editor Missing data sum(is.edit(data. Also try findit variable tabmiss and follow instructions */ rowMeans(is. n= -10) # All rows but the last 10 edit*/ tail(mydata) # Last 6 rows browse /* Browse data */ tail(mydata. # create new dataset without missing data newdata <.] The function na. # list rows of data that have missing values mydata[!complete.na.na(data))# Number of missing per scc install tabmiss. ] # First 10 rows of the mydata[1:10."age"] <. n=10)# First 10 rows of dataset edit /* Open data editor (double-click to head(mydata. R Stata Exploring data str(mydata) # Provides the structure of the describe /* Provides the structure of the dataset dataset*/ summary(mydata) # Provides basic descriptive summarize /* Provides basic descriptive statistics and frequencies statistics for numeric data*/ names(mydata) # Lists variables in the dataset ds /* Lists variables in the dataset */ head(mydata) # First 6 rows of dataset list in 1/6 /* First 6 rows */ head(mydata.NA # NOTE: function ‘rowmiss’ and the ‘egen’ Notice hidden spaces. type rowSums(is.cases() returns a logical vector indicating which cases are complete. R Stata Renaming variables #Using base commands edit /* Open data editor (double-click to edit) fix(mydata) # Rename interactively.rename(mydata.rename(mydata.Status="Status")) mydata <. c(First. c(Newspaper..readership. rename oldname newname names(mydata)[3] <.wk.rename(mydata.times.score. c(Height..rename(mydata.grade. rename firstname first rename studentstatus status library(reshape) rename averagescoregrade score rename heightin height mydata <.Name="Last")) rename newspaperreadershiptimeswk read mydata <. c(Student.="Score")) mydata <."First" rename lastname last # Using library –-reshape-.rename(mydata.="Height")) mydata <.="Read")) Variable labels Use variable names as variable labels /* Adding labels to variables */ label variable w "Weight" label variable y "Output" label variable x1 "Predictor 1" label variable x2 "Predictor 2" label variable x3 "Predictor 3" label variable age "Age" label variable sex "Gender" 14 . c(Average. c(Last..in.rename(mydata.Name="First")) mydata <. "female")) "Approve somewhat" 3 "Disapprove somewhat" 4 "Disapprove strongly" 5 "Not sure" 6 "Refused" # Use ordered() for ordinal data label define well 1 "Very well" 2 "Fairly well" 3 "Fairly badly" 4 "Very badly" 5 "Not sure" 6 mydata$var2 <.2. label define approve 1 "Approve strongly" 2 labels = c("male". replace label values x1 approve label values x2 well label values x3 partyid tab x3 destring x3. label define partyid 1 "Party A" 2 "Party B" 3 "Strongly disagree")) "Equally party A/B" 4 "Third party candidates" 5 "Not sure" 6 "Refused" mydata$var8 <.ordered(mydata$var2.factor(mydata$sex. levels = label define gender 1 "Male" 2 "Female“ c(1. "Somewhat disagree".3. "Somewhat disagree". R Stata Value labels # Use factor() for nominal data /* Step 1 defining labels */ mydata$sex <. labels = c("Strongly agree".2.4).4).2). levels = "Refused" c(1. "Somewhat agree". levels = c(1. "Somewhat agree". labels = c("Strongly agree".3. replace ignore(&) label values x3 partyid tab x3 label values sex gender tab1 y x1 x2 x3 age sex 15 . /* Step 2 applying labels */ "Strongly disagree")) # Making a copy of the same variable label values y approve label values x1 approve tab x1 d x1 destring x1.ordered(mydata$var2. pdf 16 .length(x).1)) mydata$idgroup <.seq(dim(mydata)[1]) gen id = _n # Creating a variable with the total number of /* Creating a variable with the total number of observations observations */ mydata$total <.princeton.mydata[order(mydata$group).dim(mydata)[1] gen total = _N /* Creating a variable with a sequence of numbers /* Creating a variable with a sequence of numbers from 1 to n per category (where ‘n’ is the total from 1 to n per category (where ‘n’ is the total number of observations in each category)(1) */ number of observations in each category) */ mydata <. mydata$group.stata.tapply(mydata$group. R Stata Creating ids/sequence of numbers # Creating a variable with a sequence of numbers /* Creating a variable with a sequence of numbers or to index or to index */ # Creating a variable with a sequence of numbers /* Creating a variable with a sequence of numbers from 1 to n (where ‘n’ is the total number of from 1 to n (where ‘n’ is the total number of observations) observations) */ mydata$id <. function(x) seq(1.edu/training/StataTutorial.cgi?_n (1) Thanks to Alex Acs for the code http://dss.com/help.unlist(idgroup) For more info see: http://www.] bysort group: gen id = _n idgroup <. Rhistory") log close /*Close the log*/ log using mylog.log.smcl).recode(mydata$Age. 20:29='20to29'. R Stata Recoding variables library(car) recode age (18 19 = 1 "18 to 19") /// mydata$Age.as.rec <.mydata$var2 <. The best way is to save it as a text file (*.log /*Start the log*/ loadhistory(file="mylog.factor(mydata$Age. generate(agegroups) label(agegroups) 30:39='30to39'") mydata$Age.log)*/ # Load the commands used in a previous session log using mylog. Notice that the file has to have the processor*/ extension *.Rhistory 17 .Rhistory with any word /*You can read mylog.).rec) Dropping variables mydata$Age.rec <.log using any word processor.rec <.log.NULL drop var1 mydata$var1 <. (20/28 = 2 "20 to 29") /// "18:19='18to19'.log) or to a Stata read-only savehistory(file="mylog.Rhistory") file (*. (30/39 = 3 "30 to 39") (else=. append /*Add to an existing # Display the last 25 commands log*/ log using mylog.NULL drop var1-var10 Keeping track of your work # Save the commands used during the session /* A log file helps you save commands and output to a text file (*. replace /*Replace an history() existing log*/ # You can read mylog. 2) # Round col % to 2 digits round(100* prop.table(mydata$Read.table(readgender. 2) # Round col prop to 2 digits bysort studentstatus: tab read gender.mydata$Gender) prop.2).table(readgender.table(readgender) # Tot proportions tab read gender.table(readgender.2) # Col proportions prop.test(readgender) # Do chisq test Ho: no relathionship /*Crosstabs where chi2 (The null hypothesis (Ho) fisher. colum row round(prop.1) # Row proportions tab read gender. 2) # Round col prop to 2 digits round(100* prop. col row /*Crosstabs*/ prop.table(readgender.2)) # Round col % to whole numbers addmargins(readgender) # Adding row/col margins #install.table(readgender. R Stata Categorical data: Frequencies/Crosstab s table(mydata$Gender) tab gender /*Frequencies*/ table(mydata$Read) tab read readgender <.test(readgender) # Do fisher'exact test is that there is no relationship) and V (measure Ho: no relationship of association goes from 0 to 1)*/ round(prop.2).table(readgender.packages("vcd") library(vcd) assocstats(majorgender) # NOTE: Chi-sqr = sum (obs-exp)^2/exp Degrees of freedom for Chi-sqr are (r-1)*(c-1) # NOTE: Chi-sqr contribution = (obs-exp)^2/exp # Cramer's V = sqrt(Chi-sqr/N*min) Where N is sample size and min is a the minimun of (r-1) or (c-1) 18 .2). col row chi2 V chisq. xtabs(~Read+Major+Gender.mydata$ones.mydata$Gender.packages("vcd") library(vcd) assocstats(majorgender) 19 .digits=2. dnn=c("Major".digits=2) tab major gender. col row chi2 V CrossTable(mydata$Major. data=mydata) #install.dnn=c("Major". expected=TRUE.packages("gmodels") tab gender /*Frequencies*/ library(gmodels) tab major mydata$ones <.digits=2.mydata$Gender. col row /*Crosstabs*/ CrossTable(mydata$Major."Gender")) bysort studentstatus: tab gender major.1 # Create a new variable of ones tab major gender. digits=2) /*Crosstabs which chi2 (The null hypothesis (Ho) CrossTable(mydata$Gender. digits=2) is that there is no relationship) and V (measure of association goes from 0 to 1)*/ CrossTable(mydata$Major.mydata$ones."Gender")) chisq. R Stata Categorical data: Frequencies/Crosstab s install. colum row CrossTable(mydata$Major.mydata$Gender) # Null hipothesis: no association # 3-way crosstabs test <.test(mydata$Major. basic=TRUE."Score". stat. sd.desc(mydata[. detail stat.desc(mydata[. max. p=0. kurtosis.95) -----------------------------------------------.subset(mydata2. mean. statistics(mean. Age >= 20 & Age <= 30) /*Type help tabstat for a list of all statistics*/ mydata4 <. norm=TRUE. var. ------------------------------------------------ basic=TRUE. Age)) table gender.95) tabstat age sat score heightin read /*Gives the stat. sd. First. sd. mean. p=0.subset(mydata2.desc(mydata[10:14]. variables min. summarize age. max) mydata2 <. max*/ library(pastecs) summarize.1:14] tabstat age sat score heightin read.c("Age".desc(mydata) percentiles*/ stat. ------------------------------------------------ select=c(ID. desc=TRUE. min.mydata2[1:30. median) by(gender) mydata3 <. mean only*/ norm=TRUE."Score")]. detail "Read")]) summarize sat. # Selecting the first 30 observations and first 14 statistics(n. skewness.subset(mydata2. sum(sat) 20 .c("Age"."SAT". min. mean. by(gender) # Selection using the --subset— tabstat age sat score heightin read."SAT". R Stata Descriptive Statistics install. variance. median. Last. desc=TRUE. Gender=="Female" & Status=="Graduate" & Age == 30) bysort studentstatus: tab gender major. Age >= 20 & Age <= 30.subset(mydata2. Gender=="Female" & Status=="Graduate" & Age >= 30) tab gender major. contents(freq mean age mean score) mydata5 <."Height". tabstat age sat score heightin read. sum(sat) /*Categorical and continuous*/ mydata6 <. detail /*N.packages("pastecs") summarize /*N. median. detail /*N. min. stderr) 21 . mean(SAT)) percentiles*/ median(mydata$SAT) summarize age. min. mean only*/ table(mydata$Country)))[1] tabstat age sat score heightin read. i. statistics(n. sd. tab gender major. by(gender) range(mydata$SAT) # Range tabstat age sat score heightin read.tapply(incomes. From table gender. detail summarize sat. c(. kurtosis.6. with(mydata. median. var(mydata$SAT) # Variance min..max(mydata$SAT) # From help: "Determines the location. index of the (first) minimum or maximum of a numeric vector” stderr <. var. mean. median) by(gender) quantile(mydata$SAT.. statef. quantile(mydata$SAT) statistics(mean. sum(sat) length(mydata) # Number of variables when a dataset is specify which.min(mydata$SAT) # From help: "Determines the location. contents(freq mean age mean score) help: "Returns Tukey's five number summary (minimum. index of the (first) minimum or maximum of a numeric vector" which. R Stata Descriptive Statistics mean(mydata) # Mean of all numeric variables. max. lower-hinge. upper-hinge. sd. mean(mydata$SAT) variance. sum(sat) /*Categorical and maximum) for the input data ~ boxplot" continuous*/ length(mydata$SAT) # Num of observations when a variable is specify bysort studentstatus: tab gender major.. mean. sd.function(x) sqrt(var(x)/length(x)) incster <. max*/ using --sapply--('s' for simplify) summarize. skewness.3. i.e. detail table(mydata$Country) # Mode by frequencies -> tabstat age sat score heightin read /*Gives the max(table(mydata$Country)) / names(sort(. max) sd(mydata$SAT) # Standard deviation /*Type help tabstat for a list of all statistics*/ max(mydata$SAT) # Max value min(mydata$SAT) # Min value tabstat age sat score heightin read.e. mean. same summarize /*N..9)) fivenum(mydata$SAT) # Boxplot elements. na. major=mydata$Major. mean. status=mydata$Status). mean) variance. max).by=list(sex=mydata$Gender). mean. mean."SAT")]. na. max.tapply(mydata$SAT. max) mean only*/ cbind(mean.digits=1) min.rm=TRUE) table gender. sd. sd. detail /*N. min. detail sd <. mean. sd. mean. bysort studentstatus: tab gender major. status=mydata$Status).mydata["Gender"]. mean. statistics(mean.rm=TRUE) aggregate(mydata$SAT. major=mydata$Major. status=mydata$Status). mean <.by=list(sex=mydat a$Gender).rm=TRUE) tab gender major.rm=TRUE to remove missing values in the percentiles*/ estimation summarize age. kurtosis. contents(freq mean age mean score) aggregate(mydata[c("Age".tapply(mydata$SAT. round(cbind(mean.mydata$Gender. na. na.mydata$Gender. sd.rm=TRUE) 22 . skewness. sd. mean.by=list(sex=mydata$Gender.rm=TRUE) aggregate(mydata. max) t1 /*Type help tabstat for a list of all statistics*/ # Descriptive statistics by groups using -. detail median <. sd) summarize sat. sd.digits=1) statistics(n. mean. continuous*/ na. R Stata Descriptive Statistics # Descriptive statiscs by groups using --tapply-.by=list(sex=mydata$Gender. sum(sat) major=mydata$Major. min. median. max). max) tabstat age sat score heightin read. sum(sat) /*Categorical and aggregate(mydata.rm=TRUE) aggregate(mydata[c("SAT")]. median."SAT")]. median) tabstat age sat score heightin read /*Gives the max <.tapply(mydata$SAT.round(cbind(mean. var. t1 <.mydata$Gender. summarize /*N. median.tapply(mydata$SAT. by(gender) aggregate— tabstat age sat score heightin read. median. max*/ summarize. na.mydata$Gender.by=list(sex=mydata$Gend er. # Add na. mean. tabstat age sat score heightin read. median) by(gender) aggregate(mydata[c("Age". 1)) with(Prestige. col="green") with(Prestige. R Stata Histograms library(car) hist sat head(Prestige) hist sat. adjust=0. lwd=2) lines(density(income. with(Prestige. col="green") # Braces indicate a compound command allowing several commands with 'with' command par(mfrow=c(1. by(gender) hist(Prestige$income. breaks="FD". main="Female". main="Male".com/help/toolbox/stats/bquc g6n.col="green") hist(mydata$SAT[mydata$Gender=="Male"]. col="green") lines(density(income). hist(income)) # Histogram of income with a nicer title. freq=FALSE.5). breaks="FD". xlab="SAT". based on the sample size and the spread of the data" http://www. xlab="SAT". breaks="FD". normal hist(Prestige$income) hist sat.lwd=1) rug(income) 23 }) . col="green")) # Applying Freedman/Diaconis rule p. hist(income.mathworks.html) box() hist(Prestige$income. 2)) hist(mydata$SAT[mydata$Gender=="Female"].120 ("Algorithm that chooses bin widths and locations automatically. breaks="FD") # Conditional histograms par(mfrow=c(1. breaks="FD". { hist(income. "gray")) 24 ."Male"). breaks="FD". col="green") hist sat. normal hist(mydata$SAT. add=TRUE) legend("topright". breaks="FD". R Stata Histograms # Histograms overlaid hist sat hist sat. col="gray". fill=c("green". c("Female". by(gender) hist(mydata$SAT[mydata$Gender=="Male"]. mydata$Age) twoway (lfitci sat age) || (scatter sat age.names to identify. R Stata Scatterplots # Scatterplots. mlabel(last) by(major. col="blue") # regression line (y~x) twoway scatter sat age. "row.ch/R-manual/R.paste(mydata$Last. /*Reverse order shaded area cover dots*/ devel/library/base/html/row. title("Figure 1. mlabel(last) by(gender. title("SAT scores by age") ytitle("SAT") row. identify(mydata$SAT." (source link below). mlabel(last) || lfit sat age plot(mydata$Age. total) 25 . # "Use attr(x. mydata$SAT). Useful to 1) study the mean and twoway scatter y x variance functions in the regression of y on x p. mlabel(last)).names") if you need an integer twoway (lfitci sat age) || (scatter sat age) value.names(mydata) <.names(mydata)) twoway scatter sat age. 2)to identify outliers and leverage points.html twoway (lfitci sat age) || (scatter sat age. SAT/Age") # plot(x.y) identify(mydata$Age. main=“Age/SAT".)" http://stat. mydata$SAT. lines(lowess(mydata$Age. a character vector of length the number of rows with no duplicates nor /* Adding confidence intervals */ missing values. mydata$First) mlabel(last)) row.128. ylab=“SAT". "All data frames have || lfit sat age a row names attribute. total) row. mlabel(last) by(major.ethz.y) twoway scatter sat age.names(mydata)) twoway scatter sat age. mlabel(last) || lfit sat age plot(mydata$Age. twoway scatter sat age. twoway scatter sat age. mlabel(last) plot(mydata$SAT) # Index plot twoway scatter sat age. mydata$Age. mydata$Names <.names. total) # On row. col="red") scatterplot smoothing */ abline(lm(mydata$SAT~mydata$Age). || lowess sat age /* locally weighted xlab=“Age".mydata$Names plot(mydata$SAT. mydata$SAT. mydata$SAT) twoway scatter sat age. col="green") yline(1800) xline(30) # lowess line (x. mlabel(last) || lfit sat age. lwd=3. col=gray(c(0. data=mydata) twoway scatter sat age.0. twoway scatter sat age. mlabel(last) by(major. boxplots=FALSE. mlabel(last) || lfit sat age. mlabel(last) by(major. small sample bigger (~0. boxplots= FALSE. id. total) 26 . SAT/Age") library(car) twoway scatter sat age. data=Prestige) twoway scatter sat age. mlabel(last)). mlabel(last) by(gender.0.5. mlabel(last) || lfit sat age data=mydata) || lowess sat age /* locally weighted scatterplot smoothing */ scatterplot(SAT~Age. data=mydata) twoway (lfitci sat age) || (scatter sat age) /*Reverse order shaded area cover dots*/ scatterplot(prestige~income|type.7) twoway scatter sat age.n=4. id. yline(1800) xline(30) scatterplot(prestige~income. span=0. data=Prestige) twoway (lfitci sat age) || (scatter sat age. total) # By groups twoway scatter sat age.method="identify". twoway (lfitci sat age) || (scatter sat age. mlabel(last) scatterplot(SAT~Age. big sample smaller twoway scatter y x (~0.method="identify". title("Figure 1. span=0. title("SAT scores by age") ytitle("SAT") data=Prestige) twoway scatter sat age. data=mydata) /* Adding confidence intervals */ scatterplot(SAT~Age|Gender. total) || lfit sat age scatterplot(SAT~Age|Gender.7)). R Stata Scatterplots # Rule on span for lowess.6. boxplots=FALSE.3). data=mydata) twoway scatter sat age. id. span=0. id.75. mlabel(last) || lfit sat age scatterplot(SAT~Age.75. mlabel(last)) scatterplot(prestige~income|type.method="identify". id. R Stata Scatterplots (multiple) scatterplotMatrix(~ prestige + income + education graph matrix sat age score heightin read + women.n=0. cex.labels the font size (Dalgaard:186) 3D Scatterplots library(car) scatter3d(prestige ~ income + education.7.9) # gap controls the space between subplot and cex. data=Duncan) 27 . span=0. id. half data=Prestige) pairs(Prestige) # Pariwise plots. graph matrix sat age score heightin read. gap=0. Scatterplots of all variables in the dataset pairs(Prestige.labels=0.n=3. cex=0. factor=2) ~ where # represents the size of the noise as a jitter(education. plot(jitter(vocabulary. { /*Use mydata. lty="dashed") graph matrix y x1 x2 x3 lines(lowess(education.dat and the jitter option*/ plot(jitter(vocabulary) ~ jitter(education). factor=2) ~ represented one or 1.5) }) scatter y x1.5) half 28 . many of the points # cex makes the point half the size. data=Vocab) with(Vocab. percentage of the graphical area. Stata’s help page.5) graph matrix y x1 x2 x3. 134 would be on top of each other. jitter(7) msize(0. jitter(13) msize(0. type: help scatter*/ col="gray". scatter y x1. lwd=3) scatter y x1. f=0. data=Vocab) /*"scatter will add spherical random noise to your data before plotting if you specify jitter(#). factor=2). jitter(13) msize(0.2).5.5) || lfit y x1 graph matrix x1 x1 x2 x3. jitter(5) graph matrix y x1 x2 x3. jitter(13) msize(0. R Stata Scatterplots (for categorical data) plot(vocabulary ~ education. lwd=3. factor=2).5) twoway scatter y x1. jitter(7) title(xyz) vocabulary. jitter(13) msize(0.dat*/ abline(lm(vocabulary ~ education). data=Vocab) /*Categorical data using mydata.” Source: jitter(education. p.000 observations. making it impossible to tell whether the plotted point plot(jitter(vocabulary. This can be data=Vocab) useful for creating graphs of categorical data when the data not jittered. From Stata 13+ to R Library foreign can only read Stata 12 or older.dta") 29 . To read Stata 13 or newer versions into R.dta") You can convert the file back to Stata (a version that Stata 9‐12 can read) by using the function write.dta13(): mydata <.(see also package –haven-). one option is to use package ‐readstata13.read. file="mydata.dta13("stata13file.dta(mydata.dta() in package –foreign-: library(foreign) write. To install it type: install.packages("readstata13") Then load it. type: library(readstata13) Use the function read. csv") * Increasing # Increasing sort unem mydata = mydata[ order(mydata$unemp). ] sort date mydata = mydata[ order(mydata$date). ] OTR .csv("http://www. decreasing = TRUE).edu/~otorres/quarterly.csv read.princeton. Sort R Stata mydata = insheet using http://www.edu/~otorres/quarterly. ] * Decreasing # Decreasing gsort – date mydata = mydata[ order(mydata$date.princeton. ] # Increasing year but decreasing gdp mydata = mydata[ order(mydata$year. -mydata$gdp). ] * Incresing year but decreasing gdp # xtfrm() when variable is factor (decreasing) gsort year –gdp mydata = mydata[ order(-xtfrm(mydata$date)). ucla.edu/dss • John Fox’s site http://socserv.ca/jfox/ • Quick-R http://www.edu/stat/stata/ • DSS .edu/training/ • Princeton DSS Libguides http://libguides.princeton.ucla.princeton.mcmaster.ats.Stata http://dss/online_help/stats_packages/stata/ • DSS .statmethods.edu/stat/R/ • UCLA Resources to learn and use Stata http://www.princeton.References/Useful links • DSS Online Training Section http://dss.edu/online_help/stats_packages/r 29 .R http://dss.ats.net/ • UCLA Resources to learn and use R http://www. Second Edition / John Fox . Sanford Weisberg. Sage. Achim Zeileis. New York : Radius Press. Greene. 2008 • Applied Econometrics with R / Christian Kleiber. Boston: Pearson Addison Wesley. Keohane. A guide to Analysis Using R / Thomas Lumley. 2nd ed. Watson. : Prentice Hall. 2008 • Complex Surveys. Upper Saddle River. 1994. Jennifer Hill. 2006 30 . Thomson Books/Cole. Springer. Princeton University Press. 2007. Springer. Joseph Hilbe. 2010 • Applied Regression Analysis and Generalized Linear Models / John Fox. • Data analysis using regression and multilevel/hierarchical models / Andrew Gelman. 2010 • Introduction to econometrics / James H. 2008 • R for Stata Users / Robert A. 2007. New York : Cambridge University Press. c1986 • Statistics with Stata (updated for version 9) / Lawrence Hamilton. N. Muenchen.J. Cambridge University Press. • Unifying Political Methodology: The Likelihood Theory of Statistical Inference / Gary King. 2008 • Introductory Statistics with R / Peter Dalgaard. Cambridge . 2011 • Data Manipulation with R / Phil Spector. 1989 • Statistical Analysis: an interdisciplinary introduction to univariate & multivariate methods / Sam Kachigan. Robert O.References/Recommended books • An R Companion to Applied Regression.. Wiley. Stock. Sidney Verba. Springer. • Econometric analysis / William H. 2008. Mark W. • Designing Social Inquiry: Scientific Inference in Qualitative Research / Gary King. 6th ed.. Springer. Sage Publications.