Anda di halaman 1dari 6

Data Processing

with Stata 14.1 Cheat Sheet


For more info see Statas reference manual (stata.com)

Basic Syntax
All Stata functions have the same format (syntax):
[byvarlist1:]
command
[varlist2]
apply the
command across
each unique
combination of
variables in
varlist1

Useful Shortcuts
keyboard buttons
F2
describe data
Ctrl + 8
open the data editor
clear
delete data in memory

Ctrl + 9
open a new .do file
Ctrl + D
highlight text in .do file,
then ctrl + d executes it
in the command line

Arithmetic

PgUp

+ combine (strings)

Tab
cls

scroll through previous commands

add (numbers)

autocompletes variable name after typing part

subtract

clear the console (where results are displayed)

* multiply

Set up
pwd
print current (working) directory
cd "C:\Program Files (x86)\Stata13"
change working directory
dir
display filenames in working directory
fs *.dta
List all Stata data in working directory underlined parts
are shortcuts
capture log close
use "capture"
close the log on any existing do files or "cap"
log using "myDoFile.txt", replace
create a new log file to record your work and results
search mdesc
packages contain
find the package mdesc to install extra commands that
expand Statas toolkit
ssc install mdesc
install the package mdesc; needs to be done once

Import Data
sysuse auto, clear
for many examples, we
load system data (Auto data) use the auto dataset.
use "yourStataFile.dta", clear
load a dataset from the current directory frequently used
commands are
import excel "yourSpreadsheet.xlsx", /*
highlighted in yellow
*/ sheet("Sheet1") cellrange(A2:H11) firstrow
import an Excel spreadsheet
import delimited"yourFile.csv", /*
*/ rowrange(2:11) colrange(1:8) varnames(2)
import a .csv file
webuse set "https://github.com/GeoCenter/StataTraining/raw/master/Day2/Data"
webuse "wb_indicators_long"
set web-based directory and load data from the web

[=exp]

[ifexp]

[inrange]

column to save output as condition: only apply to


apply
a new variable apply the function specific rows
command to
if something is true

bysort rep78 : summarize

price

[weight]

[usingfilename]

[,options]

apply
weights

pull data from a file


(if not loaded)

special options
for command

In this example, we want a detailed summary


with stats like kurtosis, plus mean and median

if foreign == 0 & price <= 9000, detail

To find out more about any command like what options it takes type helpcommand

AT COMMAND PROMPT
PgDn

function: what are


you going to do
to varlists?

/ divide
^ raise to a power

Basic Data Operations


Logic
&

and
! or ~ not
| or

== tests if something is equal


= assigns a value to a variable

== equal
!= not
or
~= equal

if foreign != 1 & price >= 10000


make
Chevy Colt
Buick Riviera
Honda Civic
Volvo 260

foreign
0
0
1
1

price
3,984
10,372
4,499
11,995

< less than


<= less than or equal to
> greater than
>= greater or equal to
if foreign != 1 | price >= 10000
make
Chevy Colt
Buick Riviera
Honda Civic
Volvo 260

foreign
0
0
1
1

price
3,984
10,372
4,499
11,995

Explore Data
VIEW DATA ORGANIZATION

describe make price


display variable type, format,
and any value/variable labels
count
count if price > 5000
number of rows (observations)
Can be combined with logic
ds, has(type string)
lookfor "in."
search for variable types,
variable name, or variable label
isid mpg
check if mpg uniquely
identifies the data

SEE DATA DISTRIBUTION

codebook make price


overview of variable type, stats,
number of missing/unique values
summarize make price mpg
print summary statistics
(mean, stdev, min, max)
for variables
inspect mpg
show histogram of data,
number of missing or zero
observations
histogram mpg, frequency
plot a histogram of the
distribution of a variable

BROWSE OBSERVATIONS WITHIN THE DATA

Missing values are treated as the largest

browse or
Ctrl + 8 positive number. To exclude missing
values, use the !missing(varname) syntax
open the data editor
list make price if price > 10000 & !missing(price)
clist ... (compact form)
list the make and price for observations with price > $10,000
display price[4]
display the 4th observation in price; only works on single values
gsort price mpg (ascending)
gsort price mpg (descending)
sort in order, first by price then miles per gallon
duplicates report
assert price!=.
finds all duplicate values in each variable
verify truth of claim
levelsof rep78
display the unique values for rep78

Tim Essam (tessam@usaid.gov) Laura Hughes (lhughes@usaid.gov)


follow us @StataRGIS and @flaneuseks

inspired by RStudios awesome Cheat Sheets (rstudio.com/resources/cheatsheets)

Change Data Types

Stata has 6 data types, and data can also be missing:


no data

true/false

numbers

words

missing
string int long float double
byte
To convert between numbers & strings:
gen foreignString = string(foreign)
"1"
1 tostring foreign, gen(foreignString)
"1"
decode foreign , gen(foreignString)
"foreign"
1

gen foreignNumeric = real(foreignString)


"1"
"1"
destring foreignString, gen(foreignNumeric)
encode foreignString, gen(foreignNumeric) "foreign"

recast double mpg


generic way to convert between types

Summarize Data
include missing values create binary variable for every rep78
value in a new variable, repairRecord

tabulate rep78, mi gen(repairRecord)


one-way table: number of rows with each value of rep78
tabulate rep78 foreign, mi
two-way table: cross-tabulate number of observations
for each combination of rep78 and foreign
bysort rep78: tabulate foreign
for each value of rep78, apply the command tabulate foreign
tabstat price weight mpg, by(foreign) stat(mean sd n)
create compact table of summary statistics
displays stats
formats numbers for all data

table foreign, contents(mean price sd price) f(%9.2fc) row


create a flexible table of summary statistics
collapse (mean) price (max) mpg, by(foreign) replaces data
calculate mean price & max mpg by car type (foreign)

Create New Variables

generate mpgSq = mpg^2 gen byte lowPr = price < 4000


create a new variable. Useful also for creating binary
variables based on a condition (generate byte)
generate id = _n
bysort rep78: gen repairIdx = _n
_n creates a running index of observations in a group
generate totRows = _N
bysort rep78: gen repairTot = _N
_N creates a running count of the total observations per group
pctile mpgQuartile = mpg, nq = 4
create quartiles of the mpg data
see help egen
egen meanPrice = mean(price), by(foreign)
calculate mean price for each group in foreign for more options
geocenter.github.io/StataTraining

Disclaimer: we are not affiliated with Stata. But we like it.

updated January 2016


CC BY 4.0

Data Transformation

with Stata 14.1 Cheat Sheet


For more info see Statas reference manual (stata.com)

Select Parts of Data (Subsetting)


SELECT SPECIFIC COLUMNS
drop make
remove the 'make' variable
keep make price
opposite of drop; keep only variables 'make' and 'price'

FILTER SPECIFIC ROWS

drop if mpg < 20


drop in 1/4
drop observations based on a condition (left)
or rows 1-4 (right)
keep in 1/30
opposite of drop; keep only rows 1-30
keep if inrange(price, 5000, 10000)
keep values of price between $5,000 $10,000 (inclusive)
keep if inlist(make, "Honda Accord", "Honda Civic", "Subaru")
keep the specified values of make
sample 25
sample 25% of the observations in the dataset
(use set seed # command for reproducible sampling)

webuse set https://github.com/GeoCenter/StataTraining/raw/master/Day2/Data


webuse "coffeeMaize.dta"
load demo dataset

MELT DATA (WIDE LONG)

reshape variables starting


with coffee and maize

unique id create new variable which captures


variable (key) the info in the column names

reshape long coffee@ maize@, i(country) j(year)


new variable
convert a wide dataset to long
TIDY DATASETS have
WIDE
LONG (TIDY) each observation
country
year coffee maize
melt
coffee maize maize
in its own row and
country coffee
2011
2012
2011 2012
Malawi
2011
each variable in its
Malawi
2012
Malawi
Rwanda
2011
own column.
Rwanda
Rwanda
Uganda
Uganda

cast

Uganda

CAST DATA (LONG WIDE)

2012
2011
2012

what will be create new variables


unique id with the year added
variable (key) to the column name

create new variables named


coffee2011, maize2012...

reshape wide coffee maize, i(country) j(year)


convert a long dataset to wide

When datasets are


tidy, they have a
consistent,
standard format
that is easier to
manipulate and
analyze.

xpose, clear varname


transpose rows and columns of data, clearing the data and saving
old column names as a new variable called "_varname"

Replace Parts of Data


rename (rep78 foreign) (repairRecord carType)
rename one or multiple variables

REPLACE MISSING VALUES

useful for cleaning survey datasets


mvdecode _all, mv(9999)
replace the number 9999 with missing value in all variables
mvencode _all, mv(9999)
useful for exporting data
replace missing values with the number 9999 for all variables

Label Data
Value labels map string descriptions to numbers. They allow the
underlying data to be numeric (making logical tests simpler)
while also connecting the values to human-understandable text.
label define myLabel 0 "US" 1 "Not US"
label values foreign myLabel
define a label and apply it the values in foreign
note: data note here
place note in dataset

Tim Essam (tessam@usaid.gov) Laura Hughes (lhughes@usaid.gov)


follow us @StataRGIS and @flaneuseks

id

blue

webuse coffeeMaize2.dta, clear


save coffeeMaize2.dta, replace load demo data
webuse coffeeMaize.dta, clear

pink
id

CHANGE ROW VALUES

replace price = 5000 if price < 5000


replace all values of price that are less than $5,000 with 5000
recode price (0 / 5000 = 5000)
change all prices less than 5000 to be $5,000
recode foreign (0 = 2 "US")(1 = 1 "Not US"), gen(foreign2)
change the values and value labels then store in a new
variable, foreign2

blue

pink

blue

pink

should
contain
the same
variables
(columns)

append using "coffeeMaize2.dta", gen(filenum)


add observations from "coffeeMaize2.dta" to
current data and create variable "filenum" to
track the origin of each observation

MERGING TWO DATASETS TOGETHER


must contain a
common variable
id blue pink (id) id brown

ONE-TO-ONE

id

blue

pink brown _merge


3
3
3

MANY-TO-ONE
id

blue

pink

id brown

id

blue

pink brown _merge

_merge code
1 row only
(master) in ind2
2 row only
(using) in hh2
3 row in
(match) both

3
1
3

3
1
2

webuse ind_age.dta, clear


save ind_age.dta, replace
webuse ind_ag.dta, clear
merge 1:1 id using "ind_age.dta"
one-to-one merge of "ind_age.dta"
into the loaded dataset and create
variable "_merge" to track the origin
webuse hh2.dta, clear
save hh2.dta, replace
webuse ind2.dta, clear
merge m:1 hid using "hh2.dta"
many-to-one merge of "hh2.dta"
into the loaded dataset and create
variable "_merge" to track the origin

FUZZY MATCHING: COMBINING TWO DATASETS WITHOUT A COMMON ID


reclink match records from different data sets using probabilistic matching ssc install reclink
jarowinkler create distance measure for similarity between two strings ssc install jarowinkler

inspired by RStudios awesome Cheat Sheets (rstudio.com/resources/cheatsheets)

FIND MATCHING STRINGS

display strmatch("123.89", "1??.?9")


return true (1) or false (0) if string matches pattern
display substr("Stata", 3, 5)
return the string located between characters 3-5
list make if regexm(make, "[0-9]")
list observations where make matches the regular
expression (here, records that contain a number)
list if regexm(make, "(Cad.|Chev.|Datsun)")
return all observations where make contains
"Cad.", "Chev." or "Datsun"
compare the given list against the first word in make

TRANSFORM STRINGS

Combine Data
id

GET STRING PROPERTIES

display length("This string has 29 characters")


return the length of the string
* user-defined package
charlist make
display the set of unique characters within a string
display strpos("Stata", "a")
return the position in Stata where a is first found

list if inlist(word(make, 1), "Cad.", "Chev.", "Datsun")


return all observations where the first word of the
make variable contains the listed words

ADDING (APPENDING) NEW DATA

CHANGE COLUMN NAMES

label list
list all labels within the dataset

Manipulate Strings

Reshape Data

display regexr("My string", "My", "Your")


replace string1 ("My") with string2 ("Your")
replace make = subinstr(make, "Cad.", "Cadillac", 1)
replace first occurrence of "Cad." with Cadillac
in the make variable
display stritrim(" Too much Space")
replace consecutive spaces with a single space
display trim(" leading / trailing spaces ")
remove extra spaces before and after a string
display strlower("STATA should not be ALL-CAPS")
change string case; see also strupper, strproper
display strtoname("1Var name")
convert string to Stata-compatible variable name
display real("100")
convert string to a numeric or missing value

Save & Export Data

compress
compress data in memory
Stata 12-compatible file
save "myData.dta", replace
saveold "myData.dta", replace version(12)
save data in Stata format, replacing the data if
a file with same name exists
export excel "myData.xls", /*
*/ firstrow(variables) replace
export data as an Excel file (.xls) with the
variable names as the first row
export delimited "myData.csv", delimiter(",") replace
export data as a comma-delimited file (.csv)

geocenter.github.io/StataTraining

Disclaimer: we are not affiliated with Stata. But we like it.

updated March 2016


CC BY 4.0

Data Visualization

BASIC PLOT SYNTAX:

with Stata 14.1 Cheat Sheet

y3

main plot-specific options;


see help for complete set

graph bar (count), over(foreign, gap(*0.5)) intensity(*0.5)


graph hbar draws horizontal bar charts

graph bar (percent), over(rep78) over(foreign)

23
20

graph hbar ...

(asis) (percent) (count) over(<variable>, <options: gap(*#)


relabel descending reverse>) cw missing nofill allcategories
percentages stack bargap(#) intensity(*#) yalternate xalternate

DISCRETE X, CONTINUOUS Y

graph bar (median) price, over(foreign)

twoway pcspike wage68 ttl_exp68 wage88 ttl_exp88


(sysuse nlswide1)

Parallel coordinates plot

twoway scatter mpg weight, jitter(7)

twoway pccapsym wage68 ttl_exp68 wage88 ttl_exp88


Slope/bump plot
(sysuse nlswide1)

vertical, horizontal

half jitter(#) jitterseed(#)


diagonal [aweights(<variable>)]

17
10

vertical horizontal headlabel

THREE VARIABLES
twoway contour mpg price weight, level(20) crule(intensity)

twoway scatter mpg weight, mlabel(mpg)


scatter plot with labelled values

3D contour plot

twoway connected mpg price, sort(price)

regress price mpg trunk weight length turn, nocons


matrix regmat = e(V)
ssc install plotmatrix
plotmatrix, mat(regmat) color(green)

ccuts(#s) levels(#) minmax crule(hue | chue| intensity)


scolor(<color>) ecolor (<color>) ccolors(<colorlist>) heatmap
interp(thinplatespline | shepard | none)

jitter(#) jitterseed(#) sort cmissing(yes | no)


connect(<options>) [aweight(<variable>)]

scatter plot with connected lines and symbols


jitter(#) jitterseed(#) sort
see also
connect(<options>) cmissing(yes | no)

graph hbar ...

bar plot (asis) (percent) (count) (stat: mean median sum min max ...)

over(<variable>, <options: gap(*#) relabel descending reverse


sort(<variable>)>) cw missing nofill allcategories percentages
stack bargap(#) intensity(*#) yalternate xalternate

line

twoway area mpg price, sort(price)


line plot with area shading

heatmap

dot plot (asis) (percent) (count) (stat: mean median sum min max ...)

over(<variable>, <options: gap(*#) relabel descending reverse


sort(<variable>)>) cw missing nofill allcategories percentages
linegap(#) marker(#, <options>) linetype(dot | line | rectangle)
dots(<options>) lines(<options>) rectangles(<options>) rwidth

graph hbox mpg, over(rep78, descending) by(foreign) missing


graph box draws vertical boxplots
box plot

over(<variable>, <options: total gap(*#) relabel descending reverse


sort(<variable>)>) missing allcategories intensity(*#) boxgap(#)
medtype(line | line | marker) medline(<options>) medmarker(<options>)

ssc install vioplot

violin plot over(<variable>, <options: total missing>)>) nofill

vertical horizontal obs kernel(<options>) bwidth(#)


barwidth(#) dscale(#) ygap(#) ogap(#) density(<options>)
bar(<options>) median(<options>) obsopts(<options>)

JUXTAPOSE (FACET)

Plot Placement

twoway scatter mpg price, by(foreign, norescale)

SUPERIMPOSE

total missing colfirst rows(#) cols(#) holes(<numlist>)


compact [no]edgelabel [no]rescale [no]yrescal [no]xrescale
[no]iyaxes [no]ixaxes [no]iytick [no]ixtick [no]iylabel
[no]ixlabel [no]iytitle [no]ixtitle imargin(<options>)

twoway mband mpg weight || scatter mpg weight

sort cmissing(yes | no) vertical, horizontal


base(#)

plot median of the y values


bands(#)

scatter y3 y2 y1 x, marker(i o i) mlabel(var3 var2 var1)


plot several y values for a single x value

graph twoway scatter mpg price in 27/74 || scatter mpg price /*


*/ if mpg < 15 & price > 12000 in 27/74, mlabel(make) m(i)

Laura Hughes (lhughes@usaid.gov) Tim Essam (tessam@usaid.gov)


follow us @flaneuseks and @StataRGIS

ssc install binscatter

binscatter weight mpg, line(none)

plot a single value (mean or median) for each x value

vertical, horizontal base(#) barwidth(#)

medians nquantiles(#) discrete controls(<variables>)


linetype(lfit | qfit | connect | none) aweight[<variable>]

FITTING RESULTS

twoway dot mpg rep78

twoway lfitci mpg weight || scatter mpg weight

dot plot

vertical, horizontal base(#) ndots(#)


dcolor(<color>) dfcolor(<color>) dlcolor(<color>)
dsize(<markersize>) dsymbol(<marker type>)
dlwidth(<strokesize>) dotextend(yes | no)

calculate and plot linear fit to data with confidence intervals

level(#) stdp stdf nofit fitplot(<plottype>) ciplot(<plottype>)


range(# #) n(#) atobs estopts(<options>) predopts(<options>)

twoway lowess mpg weight || scatter mpg weight

twoway dropline mpg price in 1/5

dropped line plot

calculate and plot lowess smoothing

twoway rcapsym length headroom price

calculate and plot quadriatic fit to data with confidence intervals

bwidth(#) mean noweight logit adjust

vertical, horizontal base(#)

twoway qfitci mpg weight, alwidth(none) || scatter mpg weight


level(#) stdp stdf nofit fitplot(<plottype>) ciplot(<plottype>)
range(# #) n(#) atobs estopts(<options>) predopts(<options>)

range plot (y1 y2) with capped lines


vertical horizontal

see also rcap

REGRESSION RESULTS

twoway rarea length headroom price, sort


vertical horizontal sort
cmissing(yes | no)

combine 2+ saved graphs into a single plot

combine twoway plots using ||

twoway bar price rep78

bar plot

range plot (y1 y2) with area shading

graph combine plot1.gph plot2.gph...

mat(<variable) split(<options>) color(<color>) freq

SUMMARY PLOTS

graph dot (mean) length headroom, over(foreign) m(1, ms(S))

vioplot price, over(foreign)

save

scatter plot of each combination of variables

jitter(#) jitterseed(#) sort cmissing(yes | no)


connect(<options>) [aweight(<variable>)]

(asis) (percent) (count) over(<variable>, <options: gap(*#)


relabel descending reverse>) cw missing nofill allcategories
percentages stack bargap(#) intensity(*#) yalternate xalternate

plot size

scatter plot

bar plot

annotations

xline(xint) yline(yint) text(y x "annotation")


axes

graph matrix mpg price weight, half


y2

kdensity mpg, bwidth(3)

facet

by(var)

TWO+ CONTINUOUS VARIABLES


y1

bin(#) width(#) density fraction frequency percent addlabels


addlabopts(<options>) normal normopts(<options>) kdensity
kdenopts(<options>)

grouped bar plot

[if], <plot options>

custom appearance

histogram

DISCRETE

plot-specific options

[in]

<marker, line, text, axis, legend, background options> scheme(s1mono) play(customTheme) xsize(5) ysize(4) saving("myPlot.gph", replace)

histogram mpg, width(5) freq kdensity kdenopts(bwidth(5))

bwidth kernel(<options>
normal normopts(<line options>)

y1 y2 yn x

titles

sysuse auto, clear

smoothed histogram

variables: y first

title("title") subtitle("subtitle") xtitle("x-axis title") ytitle("y axis title") xscale(range(low high) log reverse off noline) yscale(<options>)

For more info see Statas reference manual (stata.com)

ONE VARIABLE
CONTINUOUS

graph <plot type>

twoway rbar length headroom price


range plot (y1 y2) with bars

vertical horizontal barwidth(#) mwidth


msize(<marker size>)

inspired by RStudios awesome Cheat Sheets (rstudio.com/resources/cheatsheets)

regress price mpg headroom trunk length turn


ssc install coefplot
coefplot, drop(_cons) xline(0)
Plot regression coefficients

baselevels b(<options>) at(<options>) noci levels(#)


keep(<variables>) drop(<variables>) rename(<list>)
horizontal vertical generate(<variable>)

regress mpg weight length turn


margins, eyex(weight) at(weight = (1800(200)4800))
marginsplot, noci
Plot marginal effects of regression
horizontal noci

geocenter.github.io/StataTraining

updated February 2016

Disclaimer: we are not affiliated with Stata. But we like it.

CC BY 4.0

Plotting in Stata 14.1

Apply Themes

ANATOMY OF A PLOT
title

annotation
200

Customizing Appearance

For more info see Statas reference manual (stata.com)

plots contain many features

y-axis

y-axis title

150

graph region
inner graph region
inner plot region

marker label

marker

grid lines

3
0

y-line

outer region
inner region
scatter price mpg, graphregion(fcolor("192 192 192") ifcolor("208 208 208"))

20

40

x-axis

specify the fill of the background in RGB or with a Stata color

scatter price mpg, plotregion(fcolor("224 224 224") ifcolor("240 240 240"))

60

80

x-axis title

LINES / BORDERS
arguments for the plot
objects (in green) go in the
options portion of these
commands (in orange)

SYNTAX

marker

<marker
options>

for example:
scatter price mpg, xline(20, lwidth(vthick))

mcolor(none)

COLOR

mcolor("145 168 208")

specify the fill and stroke of the marker


in RGB or with a Stata color

mfcolor("145 168 208")

msize(medium)

SIZE / THICKNESSS

ehuge
vhuge
huge
vlarge

color("145 168 208")

Oh

Dh

Th

Sh

oh

dh

th

sh

none i

jitter(#)

jitterseed(#)

randomly displace the markers

set seed

lcolor(none)

marker mlcolor("145 168 208")

tlcolor("145 168 208")


glcolor("145 168 208")

lwidth(medthick)

marker

specify the thickness


(stroke) of a line:

tick marks
grid lines

vvvthick
vvthick
vthick
thick
medthick
medium

medsmall
small
vsmall
tiny
vtiny

specify the marker symbol:

titles

marker label

lcolor("145 168 208")

specify the stroke color of the line or border

line

axes
grid lines

solid
dash
dot

medthin
thin
vthin
vvthin
vvvthin
none

lpattern(dash)
glpattern(dash)

specify the
line pattern

longdash

longdash_dot

shortdash

shortdash_dot

dash_dot

blank

axes off no axis/labels


noline
tick marks noticks
tick marks tlength(2)
grid lines nogrid nogmin nogmax
axes

tick marks xlabel(#10, tposition(crossing))


number of tick marks, position (outside | crossing | inside)

Laura Hughes (lhughes@usaid.gov) Tim Essam (tessam@usaid.gov)


follow us @flaneuseks and @StataRGIS

title(...)
subtitle(...)
xtitle(...)
ytitle(...)

annotation

text(...)

specify the color of the text

USING A SAVED THEME

twoway scatter mpg price, scheme(customTheme)


Create custom themes by

help scheme entries saving options in a .scheme file


see all options for setting scheme properties
adopath ++ "~/<location>/StataThemes"
set path of the folder (StataThemes) where custom
.scheme files are saved
set as default scheme

set scheme customTheme, permanently


change the theme
axis labels

xlabel(...)
ylabel(...)
legend

legend(...)

install William Buchanans package to generate custom


schemes and color palettes (including ColorBrewer)

USING THE GRAPH EDITOR

twoway scatter mpg price, play(graphEditorTheme)

color(none)

Select the
Graph Editor

marker label mlabcolor("145 168 208")

labcolor("145 168 208")

axis labels

mlwidth(thin)
tlwidth(thin)
glwidth(thin)

Schemes are sets of graphical parameters, so you dont


have to specify the look of the graphs every time.

net inst brewscheme, from("https://wbuchanan.github.io/brewscheme/") replace

<marker
options>

medium

msymbol(Dh)

APPEARANCE

medlarge

tick marks

axes

marker

tick marks

tick marks

TEXT

<line options> <marker xscale(...) grid lines


options> yscale(...)
xline(...)
xlabel(...)
yline(...)
legend
ylabel(...)
legend(region(...))

grid lines

specify the marker size:

large

POSITION

mfcolor(none)

specify the fill of the marker

line

100

y2
Fitted values

legend

specify the fill of the plot background in RGB or with a Stata color

SYMBOLS

line

50

y-axis labels

plot region

10

100

y-axis title

titles

subtitle

specify the size of the text:


size(medsmall)
marker label mlabsize(medsmall)
axis labels labsize(medsmall)

Click
Record

28 pt. vhuge
20 pt.
16 pt.
14 pt.
12 pt.
11 pt.

10 pt. medsmall
8 pt. small
huge
6 pt.
vsmall
4 pt.
tiny
vlarge
2 pt.
half_tiny
large
1.3 pt. third_tiny
medlarge 1 pt. quarter_tiny
medium 1 pt minuscule

Double click on
symbols and areas
on plot, or regions
on sidebar to
customize
Unclick
Record

marker label mlabel(foreign)


label the points with the values
of the foreign variable
axis labels

nolabels

axis labels

format(%12.2f )

no axis labels
change the format of the axis labels

legend

off

legend

label(# "label")

turn off legend

change legend label text

marker label mlabposition(5)


label location relative to marker (clock position: 0 12)

inspired by RStudios awesome Cheat Sheets (rstudio.com/resources/cheatsheets)

Save theme
as a .grec file

Save Plots
graph twoway scatter y x, saving("myPlot.gph") replace
save the graph when drawing
graph save "myPlot.gph", replace
save current graph to disk
graph combine plot1.gph plot2.gph...
combine 2+ saved graphs into a single plot
graph export "myPlot.pdf", as(.pdf)
see options to set
export the current graph as an image file size and resolution
geocenter.github.io/StataTraining

Disclaimer: we are not affiliated with Stata. But we like it.

updated June 2016


CC BY 4.0

Examples use auto.dta (sysuse auto, clear)


unless otherwise noted

univar price mpg, boxplot


ssc install univar
calculate univariate summary, with box-and-whiskers plot
stem mpg
return stem-and-leaf display of mpg
frequently used commands are
summarize price mpg, detail
highlighted in yellow
calculate a variety of univariate summary statistics
for Stata 13: ci mpg price, level (99)
ci mean mpg price, level(99)
r compute standard errors and confidence intervals
correlate mpg price
return correlation or covariance matrix
pwcorr price mpg weight, star(0.05)
return all pairwise correlation coefficients with sig. levels
mean price mpg
estimates of means, including standard errors
proportion rep78 foreign
estimates of proportions, including standard errors for
e
categories identified in varlist
ratio
estimates of ratio, including standard errors
total price
estimates of totals, including standard errors

TIME SERIES OPERATORS


L.
F.
D.
S.

lag x t-1
lead x t+1
difference x t-x t-1
seasonal difference x t-xt-1

tabulate foreign rep78, chi2 exact expected


tabulate foreign and repair record and return chi2
and Fishers exact statistic alongside the expected values
ttest mpg, by(foreign)
estimate t test on equality of means for mpg by foreign
r prtest foreign == 0.5
one-sample test of proportions
ksmirnov mpg, by(foreign) exact
Kolmogorov-Smirnov equality-of-distributions test
ranksum mpg, by(foreign) exact
equality tests on unmatched data (independent samples)
anova systolic drug webuse systolic, clear
analysis of variance and covariance
e pwmean mpg, over(rep78) pveffects mcompare(tukey)
estimate pairwise comparisons of means with equal
variances include multiple comparison adjustment

USEFUL ADD-INS

measure something

CATEGORICAL VARIABLES

identify a group to which


an observations belongs

INDICATOR VARIABLES
denote whether
T F
something is true or false

1900

SURVIVAL ANALYSIS

SURVEY DATA

stores results as e -class

more details at http://www.stata.com/manuals14/u25.pdf

o.
#

omit a variable or indicator


specify interactions

regress price io(2).rep78


regress price mpg c.mpg#c.mpg

specify rep78 variable to be an indicator variable


set the third category of rep78 to be the base category
set the base to most frequently occurring category for rep78
treat mpg as a continuous variable and
specify an interaction between foreign and mpg
set rep78 as an indicator; omit observations with rep78 == 2
create a squared mpg term to be used in regression

##

specify factorial interactions

regress price c.mpg##c.mpg

create all possible interactions with mpg (mpg and mpg2)

1970

1980

1990

webuse nhanes2b, clear

svyset psuid [pweight = finalwgt], strata(stratid)


declare survey design for a dataset
r
svydescribe
report survey data details
svy: mean age, over(sex)
estimate a population mean for each subpopulation
svy, subpop(rural): mean age
estimate a population mean for rural areas
e
svy: tabulate sex heartatk
report two-way table with tests of independence
svy: reg zinc c.age##c.age female weight rural
estimate a regression using survey weights

regress price mpg weight, robust


estimate ordinary least squares (OLS) model
on mpg weight and foreign, apply robust standard errors
regress price mpg weight if foreign == 0, cluster(rep78)
regress price only on domestic cars, cluster standard errors
rreg price mpg weight, genwt(reg_wt)
estimate robust regression to eliminate outliers
probit foreign turn price, vce(robust)
ADDITIONAL MODELS
estimate probit regression with
built-in Stata principal components analysis
pca
command
robust standard errors
factor analysis
factor
poisson nbreg
count outcomes
logit foreign headroom mpg, or
censored data
tobit
estimate logistic regression and
instrumental variables
ivregress ivreg2
report odds ratios
difference-in-difference
bootstrap, reps(100): regress mpg /* rddiff sscuser-written
install ivreg2 regression discontinuity
*/ weight gear foreign
xtabond xtabond2 dynamic panel estimator
estimate regression with bootstrapping psmatch2
propensity score matching
jackknife r(mean), double: sum mpg synth
synthetic control analysis
Blinder-Oaxaca decomposition
jackknife standard error of sample mean oaxaca
EXAMPLE
regress price i.rep78
regress price ib(3).rep78
fvset base frequent rep78
regress price i.foreign#c.mpg i.foreign

id 4

webuse drugtr, clear

DESCRIPTION
specify indicators
specify base indicator
command to change base
treat variable as continuous

id 3

stset studytime, failure(died)


r declare survey design for a dataset
stsum
summarize survival-time data
stcox drug age
e
estimate a cox proportional hazard model

OPERATOR
i.
ib.
fvset
c.

Tim Essam (tessam@usaid.gov) Laura Hughes (lhughes@usaid.gov)


follow us @StataRGIS and @flaneuseks

1950

2-period lag x t-2


2-period lead x t+2
difference of difference xt-xt1-(xt1-xt2)
lag-2 (seasonal difference) xtxt2

id 2

compact time series into means, sums and end-of-period values


tscollap
carryforward carry non-missing values forward from one obs. to the next
identify spells or runs in time series
tsspell

Estimation with Categorical & Factor Variables


CONTINUOUS VARIABLES

L2.
F2.
D2.
S2.

1850

id 1

100

webuse nlswork, clear

xtset id year
declare national longitudinal data to be a panel
xtdescribe
xtline plot
report panel aspects of a dataset
wage relative to inflation
r
xtsum hours
summarize hours worked, decomposing
standard deviation into between and
within components
xtline ln_wage if id <= 22, tlabel(#3)
plot panel data as a line plot
xtreg
ln_w c.age##c.age ttl_exp, fe vce(robust)
e
estimate a fixed-effects model with robust standard errors
4

200

1 Estimate Models

Statistical Tests

PANEL / LONGITUDINAL

inspired by RStudios awesome Cheat Sheets (rstudio.com/resources/cheatsheets)

2 Diagnostics

not appropriate after robust cluster( )

estat hettest test for heteroskedasticity


r
ovtest test for omitted variable bias
vif
report variance inflation factor
dfbeta(length)
Type help regress postestimation plots
calculate measure of influence for additional diagnostic plots
rvfplot, yline(0)
avplots
plot residuals
plot all partialmpg
rep78
against fitted
regression leverage
values
plots in one graph
Fitted values
weight
headroom

3 Postestimation

price

Summarize Data

webuse sunspot, clear

tsset time, yearly


declare sunspot data to be yearly time series
tsreport
r
report time series aspects of a dataset
generate lag_spot = L1.spot
create a new variable of annual lags of sun spots tsline plot
Number of sunspots
tsline spot
plot time series of sunspots
e
arima spot, ar(1/2)
estimate an auto-regressive model with 2 lags

price

Results are stored as either r -class or e -class. See Programming Cheat Sheet

TIME SERIES

price

For more info see Statas reference manual (stata.com)

By declaring data type, you enable Stata to apply data munging and analysis functions specific to certain data types

price

with Stata 14.1 Cheat Sheet

Declare Data

Residuals

Data Analysis

commands that use a fitted model

regress price headroom length Used in all postestimation examples


display _b[length]
display _se[length]
return coefficient estimate or standard error for mpg
from most recent regression model
margins, dydx(length) returns e-class information when post option is used
return the estimated marginal effect for mpg
r
margins, eyex(length)
return the estimated elasticity for price
predict yhat if e(sample)
create predictions for sample on which model was fit
predict double resid, residuals
calculate residuals based on last fit model
test mpg = 0
r test linear hypotheses that mpg estimate equals zero
lincom headroom - length
test linear combination of estimates (headroom = length)
geocenter.github.io/StataTraining

Disclaimer: we are not affiliated with Stata. But we like it.

updated June 2016


CC BY 4.0

Programming

with Stata 14.1

Cheat Sheet

For more info see Statas reference manual (stata.com)

1 Scalars

both r- and e-class results contain scalars

scalar x1 = 3
create a scalar x1 storing the number 3
scalar a1 = I am a string scalar
create a scalar a1 storing a string

2 Matrices

Scalars can hold


numeric values or
arbitrarily long strings

DISPLAYING & DELETING BUILDING BLOCKS

[scalar | matrix | macro | estimates] [list | drop] b


list contents of object b or drop (delete) object b
[scalar | matrix | macro | estimates] dir
list all defined objects for that class
matrix list b
matrix dir
scalar drop x1
list contents of matrix b list all matrices delete scalar x1

GLOBALS

public or private variables storing text

available through Stata sessions

LOCALS

R- AND E-CLASS: Stata stores calculation results in two* main classes:

return results from general commands


such as summary or tabulate

return results from estimation


commands such as regress or mean

To assign values to individual variables use:


r individual numbers or strings
e rectangular array of quantities or expressions
e pointers that store text (global or local)

1 SCALARS
2 MATRICES
3 MACROS

Loops: Automate Repetitive Tasks

ANATOMY OF A LOOP

objects to repeat over


temporary variable used
only within the loop
requires local macro notation

* theres also s- and n-class

PUBLIC

available only in programs, loops, or .do files PRIVATE

mean price
e ereturn list
returns list of scalars, macros,
matrices and functions

summarize price, detail


r return list
returns a list of scalars

scalars:
r(N)
r(mean)
r(Var)
r(sd)

=
=
=
=

74
6165.25...
86995225.97...
2949.49...
...

Results are replaced


each time an r-class
/ e-class command
is called

scalars:

e(df_r)
e(N_over)
e(N)
e(k_eq)
e(rank)

=
=
=
=
=

73
1
73
1
1

generate meanN = e(N)


create a new variable equal to
obs. in estimation command
preserve create a temporary copy of active dataframe
restore points to test
restore restore temporary copy to original point set
code that changes data
generate p_mean = r(mean)
create a new variable equal to
average of price

ACCESSING ESTIMATION RESULTS

After you run any estimation command, the results of the estimates are
stored in a structure that you can save, view, compare, and export

regress price weight


Use estimates store
estimates store est1
to compile results
store previous estimation results est1 in memory for later use
ssc install estout
eststo est2: regress price weight mpg
eststo est3: regress price weight mpg foreign
estimate two regression models and store estimation results
estimates table est1 est2 est3
print a table of the two estimation results est1 and est2

EXPORTING RESULTS

see also while

Stata has three options for repeating commands over lists or values:
foreach, forvalues, and while. Though each has a different first line,
the syntax is consistent:
foreach x of varlist var1 var2 var3 {

Many Stata commands store results in types of lists. To access these, use return or
ereturn commands. Stored results can be scalars, macros, matrices or functions.

matrix ad2 = a , d
matrix ad1 = a \ d
row bind matrices
column bind matrices
matselrc b x, c(1 3) findit matselrc
select columns 1 & 3 of matrix b & store in new matrix x
mat2txt, matrix(ad1) saving(textfile.txt) replace
export a matrix to a text file
ssc install mat2txt

global pathdata "C:/Users/SantasLittleHelper/Stata"


define a global variable called pathdata
cd $pathdata add a $ before calling a global macro
change working directory by calling global macro
global myGlobal price mpg length
summarize $myGlobal
summarize price mpg length using global

basic components of programming

4 Access & Save Stored r- and e-class Objects

e-class results are stored as matrices

matrix a = (4\ 5\ 6)
matrix b = (7, 8, 9)
create a 3 x 1 matrix
create a 1 x 3 matrix
matrix d = b' transpose matrix b; store in d

3 Macros

Building Blocks

The estout and outreg2 packages provide numerous, flexible options for making tables
after estimation commands. See also putexcel command.

command `x', option


...
}

open brace must


appear on first line

command(s) you want to repeat


can be one line or many

close brace must appear


on final line by itself

FOREACH: REPEAT COMMANDS OVER STRINGS, LISTS, OR VARIABLES


foreach x in|of [ local, global, varlist, newlist, numlist ] {
list types: objects over which the
Stata commands referring to `x'
commands will be repeated
}
loops repeat the same command
STRINGS
over different arguments:
sysuse "auto.dta", clear
foreach x in auto.dta auto2.dta {
same as...
tab rep78, missing
sysuse "`x'", clear
tab rep78, missing
sysuse "auto2.dta", clear
}
tab rep78, missing
LISTS
foreach x in "Dr. Nick" "Dr. Hibbert" {
display length("Dr. Nick")
display length ( "` x '" )
display length("Dr. Hibbert")
}
When calling a command that takes a string,
surround the macro name with quotes.

VARIABLES
foreach x in mpg weight {
summarize `x'
}

must define list type

foreach x of varlist mpg weight {


summarize `x'
}

foreach in takes any list


as an argument with
elements separated by
spaces
foreach of requires you
to state the list type,
which makes it faster

summarize mpg
summarize weight

FORVALUES: REPEAT COMMANDS OVER LISTS OF NUMBERS


iterator

forvalues i = 10(10)50 {
display `i'
numeric values over
which loop will run
}

DEBUGGING CODE

Use display command to


show the iterator value at
each step in the loop

ITERATORS

display 10
display 20
...

i = 10/50
10, 11, 12, ...
i = 10(10)50
10, 20, 30, ...
i = 10 20 to 50 10, 20, 30, ...

local myLocal price mpg length


esttab est1 est2, se star(* 0.10 ** 0.05 *** 0.01) label
see also capture and scalar _rc
set trace on (off )
create local variable called myLocal with the
create summary table with standard errors and labels
trace
the
execution
of
programs
for
error
checking
strings price mpg and length
esttab using auto_reg.txt, replace plain se
sysuse auto, clear
PUTTING
IT
ALL
TOGETHER
summarize ` myLocal ' add a ` before and a ' after local macro name to call
export summary table to a text file, include standard errors
pull out the first word
summarize contents of local myLocal
generate
car_make
=
word(make,
1)
from the make variable
outreg2 [est1 est2] using auto_reg2.txt, see replace
levelsof rep78, local(levels)
calculate unique groups of
export summary table to a text file using outreg2 syntax
levelsof car_make, local(cmake)
define the
create a sorted list of distinct values of rep78,
car_make and store in local cmake
local i to be
local i = 1
store results in a local macro called levels
an iterator
Additional Programming Resources
store the length of local
local cmake_len : word count `cmake'
local varLab: variable label foreign can also do with value labels
cmake in local cmake_len
store the variable label for foreign in the local varLab
 bit.ly/statacode
foreach x of local cmake {
download all examples from this cheat sheet in a .do file
display in yellow "Make group `i' is `x'"
TEMPVARS & TEMPFILES special locals for loops/programs
ssc install adolist
adoupdate
adolist
if `i' == `cmake_len' {
initialize
a
new
temporary
variable
called
temp1
tempvar temp1
Update user-written .ado files
List/copy user-written .ado files
tests the position of the
display "The total number of groups is `i'"
save squared mpg values in temp1
generate `temp1' = mpg^2
iterator,
executes
contents
 net install package, from (https://raw.githubusercontent.com/username/repo/master) in brackets when the
summarize the temporary variable temp1
}
summarize `temp1'
install a package from a Github repository
condition is true
local i = `++i'
increment iterator by one
tempfile myAuto create a temporary file to see also
https://github.com/andrewheiss/SublimeStataEnhanced
}
save `myAuto'
be used within a program tempname
configure Sublime text for Stata 11-14
Tim Essam (tessam@usaid.gov) Laura Hughes (lhughes@usaid.gov)
follow us @StataRGIS and @flaneuseks

inspired by RStudios awesome Cheat Sheets (rstudio.com/resources/cheatsheets)

geocenter.github.io/StataTraining

Disclaimer: we are not affiliated with Stata. But we like it.

updated June 2016

CC BY 4.0

Anda mungkin juga menyukai