STATA Manual Guide

STATA WORKSHOP DAY 2 (BASIC ANALYSIS BY USING STATA SOFTWARE)
Introduction
This note is created as a guide to STATA course. This note will run through a sample data which is in
STATA format (student_analysis.dta). The sample data is a continuation from previous workshop
(Day 1). As now we will focus on basic statistical analysis after exploring and cleaning of data.
student_analysis.dta
Analysis
1.
2.
3.
4.
5.
6.
7.
One sample t-test

Independent t-test
Paired t-test
One-way Analysis of Variance
Categorical Test
Estimation of Risks
Correlation
Opening Log File (filename_date.log or filename_date.scml)

Opening a log file must be done before anything as the log file will capture all the output including
commands that has been displayed by results window.
1. Select: File > Log > Begin
2. A dialog box of Begin logging Stata ouput is open

Notes:
Two format of saving log file:
(1) *.scml The results are
mimicking those displayed by
results window.
(2) *.log The results are
displayed by notepad and
can be opened on other
computers without having
the STATA software.
Command:
. log using " C:\Users\filename_date.log"
Or
. log using " C:\Users\filename_date.scml"
ANALYSIS 1: One Sample t-test
Research question: The researchers randomly recruited 438 students and assess their systolic
pressure. A standard population mean systolic pressure is 120 mmHg.
Step 1: Hypothesis
H0: The mean systolic pressure is 120 mmHg. (or the mean systolic pressure is 120 mmHg)
HA: The mean systolic pressure is different than 120 mmHg.
Step 2: Level of significance
= 0.05
Step 3: Checking assumptions
1. Random sample
2. Independent sample
Step 4: Statistical test
1. Select: Statistics > Summaries, tables, and test > Classical tests of hypotheses > t-test (meancomparison test)
2. A dialog box is open
Command: ttest variable == testvalue

3. Output:
Step 5: Interpretation
95% confidence interval of mean difference of systolic does not include zero.
p-value < 0.001, reject H0
Step 6: Conclusion
At the 5% level of significance, the mean systolic pressure is difference than 120. The mean systolic
pressure is 115.65 (114.69, 116.60), lower than the mean systolic pressure of population.
Step 7: Presentation of results
Table 1: Comparison of mean systolic pressure to the population of 120 (n =438)
Parameter
Mean (SD)
95% Confidence Interval
t-statistica (df)
Systolic pressure
115.65 (10.21)
114.69, 116.60
-8.93 (437)
a
p-value
<0.001
One sample t-test analysis was applied
ANALYSIS 2: Independent t-test

Research question: The researchers wish to know the difference of BMI between males and females.
Step 1: Hypothesis
H0: The mean BMI between males and females are the same.
HA: The mean BMI between males and females are different.
= 0.05
Step 3: Assumptions
1. Random sample
2. Two samples are independent
3. Two populations are normally distributed
There are two ways of checking normal distribution:
(i) Histogram with an overlaid normal curve
a) Select: Graphics > Histogram
b) A dialog box is open
Command: histogram outcomevar, normal by(groupingvar)

4
c) Output:
d) Interpretation: Normality assumption is met.

(ii) Box and whisker plot
a) Select: Graphics > Box plot
Command: graph box outcomevar, by(groupingvar)
c) Output:
d) Interpretation: Normality assumption is met

4. Two populations have the same variances (homogeneity of variances)
Checking homogeneity of variances through Levenes test:
a) Select: Statistics > Summaries, tables, and tests > Classical tests of hypotheses > Variancecomparison test
Command: sdtest outcomevar, by(groupingvar)

6
c) Output:
Notes:
Hypothesis of Levenes test
H0: The variances between
groups are the same
HA: The variances between
groups are different (onetailed)
d) Interpretation:
p value > 0.05 ( Do not reject H0)
The variances are equal. Thus, the assumption of Levenes test is met.
Step 4: Stastistical test
1. Select: Statistics > Summaries, tables, and test > Classical tests of hypotheses > t-test (meancomparison test)
2. A dialog box is open:
Command: ttest outcomevar, by(groupingvar)

7
3. Output:
95% confidence interval of mean difference BMI does not include zero.
Step 6: Conclusion
At the 5% level of confidence, the mean BMI are different between males and females. The mean
BMI of males (23.97 0.26) is higher than females (20.89 0.22).
Table 2: Mean comparison of BMI between gender (n = 438)
Group (n)
Mean (SD)
Male (196)
Female (242)
23.97 (3.59)
20.89 (3.49)
Mean difference t-statistica

(95% CI)
(df)
p-value
3.08 (2.41, 3.75)
< 0.001
9.07(436)
Independent t-test was applied
ANALYSIS 3: Paired t-test

Research question: The researchers conducted an extensive aerobic activity to assess the change of
BMI before and after the activity.
Step 1: Hypothesis
H0: There is no difference of mean BMI before and after aerobic.
HA: The mean BMI before and after aerobic are different.
= 0.05

1. Random sample
2. The two samples are dependent.
3. The difference in population means is normally distributed.
Since this analysis involved paired samples, normality testing is based on the difference of two
samples.
a) Select: Data > Create or change data > Create new variable
Command: generate newvardiff = prevar - postvar

c) Output:
There are two ways of checking normality of distribution:

Command: histogram newvardiff, normal

c) Output:
d) Interpretation: Normality assumption of BMI difference is met.

10
Command: graph box newvardiff

c) Output:
d) Interpretation: Normality assumption of BMI difference is met.

1. Select: Summaries, tables, and tests > Classical tests of hypotheses > t test (mean-comparison
test)
11
Command: ttest prevar == postvar

3. Output:
95% confidence interval of mean difference of BMI does not include zero.
Step 6: Conclusion
At the 5% level of significance, the means BMI before and after aerobic activity are different. The
mean BMI of before aerobic is higher than that after aerobic. The aerobic activity is effective in
lowering the BMI status.
Table 3: Comparison of mean BMI before and after an extensive aerobic activity (n = 438).
Group
Mean (SD)
Before
22.27 (3.85)
After
21.41 (3.97)
a
Paired t-test was applied.
Mean difference t-statistica (df)

(95% CI)
0.86 (0.78, 0.94)
22.01 (437)
p-value
< 0.001
12
ANALYSIS 4: One-way Analysis of Variance (One-way ANOVA)

Research question: The researchers wish to know whether there are different in mean height among
four races (Malay, Chinese, Indian, and Others).
Step 1: Hypothesis
H0: The mean height among race (Malay, Chinese, Indian, and Others) are the same.
HA: The mean height among race (Malay, Chinese, Indian, and Others) are different.
= 0.05
1. Random sample
2. Independent samples
3. Two populations are normally distributed
There are two ways of checking normal distribution:
Command: histogram outcomevar, normal by(groupingvar)

13
c) Output:
d) Interpretation: Normality assumption is met.

Command: graph box outcomevar, by(groupingvar)
14
c) Output:
d) Interpretation: Normality assumption is met

4. Two populations have the same variances (homogeneity of variances)
Checking homogeneity of variances through Barletts test:
a) Select: Statistics > Linear models and related > ANOVA/MANOVA > One-way ANOVA
Command: oneway outcomevar groupingvar

15
c) Output:
Notes:
Hypothesis of Barletts test
H0: The variances between groups are the
same
HA: The variances between groups are
different (one-tailed)
d) Interpretation:
p value > 0.05 ( Do not reject H0)
The variances are equal. Thus, the assumption of Barletts test is met.
1. Select: Statistics > Linear models and related > ANOVA/MANOVA > One-way ANOVA
Command: oneway outcomevar groupingvar
16
3. Output:
Step 6: Conclusion
At the 5% level of significance, the mean height are different among races.
Additional analysis: Post-hoc analysis
Post-hoc analysis is to identify which pair of groups have the significant difference in mean of BMI.
Possible comparison pair of height among groups:
-
Malay vs Chinese
Malay vs Indian
Malay vs Others
Indian vs Chinese
Indian vs Others
Chinese vs Others
There are six

comparison pairs of
height among races
a) Select: Statistics > Linear models and related > ANOVA/MANOVA > One-way ANOVA
17
Command: oneway outcomevar groupingvar, bonferroni

c) Output:
p-value of comparison pairs

among races
d) Interpretation:
Malay vs Chinese:- p-value = 0.001, reject H0
Malay vs Indian:- p-value = 0.001, reject H0
Malay vs Others:- p-value > 0.008, do not reject H0
Chinese cs Indian:- p-value > 0.008, do not reject H0
Chinese vs Others:- p-value > 0.008, do not reject H0
Indian vs Others:- p-value > 0.008, do not reject H0
Bonferroni correction:
The alpha value (/npair) = 0.008
Step 6: Conclusion
At the 5% level of significance, mean height are significant among races. Significant difference are
reported between Malay and Chinese,and Malay and Indian.
18

1. Select: Statistics > Summaries, tables, and tests > Other tables > Tables of means, std. dev. and
frequencies
Command: oneway outcomevar groupingvar, bonferroni

3. Output:
Table 4: Comparison of mean height among races (n = 438)

Races
n
Mean (SD)
Malay
261
162.85 (8.06)
Chinese
138
165.99 (7.85)
Indian
28
168.89 (9.19)
Others
11
169.75 (5.52)
F-statistica(df)
9.59 (3)
p-valueb
< 0.001
One-way ANOVA test was applied

b
Malay vs Chinese:- p-value = 0.001, reject H0; Malay vs Indian:- p-value = 0.001, reject H0; Malay vs Others:- pvalue > 0.008, do reject H0; Chinese cs Indian:- p-value > 0.05, do not reject H0; Chinese vs Others:- p-value >
0.05, do not reject H0; Indian vs Others:- p-value > 0.95, do not reject H0.
19
ANALYSIS 5: Categorical Test

Research question: The researchers would like to determine an association between hypertension
(normal or hypertensive) and gender (male or female).
Step 1: Hypothesis
H0: There is no association between hypertension and gender.
HA: There is an association between hypertension and gender.
= 0.05
1. Independent samples
2. Two variables are categorical.
3. If less than 20% of the cells have expected frequency < 5, then use Chi-square test. If equal or
more than 20% of the cells have expected frequency < 5, then use Fishers exact test.
Checking expected frequency assumption:
1. Select: Statistics > Summaries, tables, and tests > Frequency tables > Two-way table with
measures of association
Command: tabulate indepvar depvar, expected

20
3. Output:
0 cells (0%) have expected

frequency less than 5,
therefore Chi-square test
will be applied
4. Interpretation: Assumption is met.

Step 4: Statistical test (Pearson Chi-square test)
1. Select: Statistics > Summaries, tables, and tests > Frequency tables > Two-way table with
measures of association
Command: tabulate indepvar depvar, chi2 row
21
3. Output:
p-value < 0.05
2 statistic = 4.29
Step 6: Conclusion
There is an association between hypertension and gender. There is higher proportion in males group
compared to those in females group.
Table 5: Association between hypertension and gender (n=438)
Gender
Hypertension, n (%)
2 statistica (df)
Hypertensive
Normal
Male
14 (7.1)
182 (92.9)
Female
7 (2.9)
235 (97.1)
4.29 (1)
a
p-value
0.038
Pearson Chi-square test was applied; Level of significance was set at 5%.
ANALYSIS 6: Estimation of Risk

Research question: The researchers wish to estimate risk of hypertension among gender.
Step 1: Hypothesis
H0: There is an equal risk of hypertension between males and females (OR = 1)
HA: There is an increased (OR > 1) or reduced risk (OR < 1) of hypertension between males and
female.
= 0.05
Step 3: Checking assumption
1. There are cases and non cases observations (factor: gender).
2. There are exposed and non exposed observations (disease: hypertension).
22

1. Select: Statistics > Epidemiology and related > Tables for epidemiologists > Cohort study risk-ratio
etc. calculator
2. A dialog box is open of csi Cohort studies is open.
Command: csi a b c d, or
3. Output:
OR = 2.58 (95% CI: 1.05, 6.35). The 95% CI does not include 1, reject H0.
23
Step 6: Conclusion
There is 2.58 times of higher odds of getting hypertension among males as compared to females.
Table 6: Association between hypertension and gender (n=438)
Gender
Hypertension, n (%)
OR (95% CI)
Hypertensive
Normal
Male
14 (7.1)
182 (92.9)
Female
7 (2.9)
235 (97.1)
2.58 (1.05, 6.35)
a
2 statistica (df)
p-value
4.29 (1)
0.038
Pearson Chi-square test was applied; Level of significance was set at 5%.
ANALYSIS 7: Correlation
Research question: The researchers wish to identify the relationship between height and weight.
Step 1: Hypothesis
H0: There is no relationship between weight (kg) and systolic pressure (mmHg).
HA: There is a relationship between weight (kg) and systolic pressure (mmHg).
1. Distribution of weight by histogram
Command: histogram var, normal
24
c) Output:
d) Interpretation:
Distribution of weight is approximately normal.
2. Distribution of weight by box and whisker plot
a) Select: Graphics > Boxplot
Command: graph box var
25
c) Output:
d) Interpretation:
Distribution of weight is approximately normal.
3. Distribution of systolic pressure by histogram and box and whisker plot.
Interpretation: Distribution of systolic pressure is normal.

4. Relationship between weight and systolic pressure
a) Select: Graphics > Twoway graph (scatter, line, etc.)
26
Command: twoway (scatter var1 var2)

c) Output:
d) Interpretation: The scatter plot is elliptical in shape.

5. Direction of relationship between height and systolic blood pressure
a) Select: Graphics > Twoway graph (scatter, line, etc.)
27
Command:
. twoway (scatter vary varx) (lfit vary varx)
c) Output:
d) Interpretation: There is a positive correlation between weight and systolic blood pressure.
6. Strength of relationship between height and systolic blood pressure
a) Select: Summaries, tables, and tests > Summary and descriptive statistics > Pairwise
correlations
28
Command: pwcorr var1 var2, sig star (5)
c) Output:
Notes:
Correlation coefficient (r)
r < 0.25 poor relationship
0.26 < r < 0.50 fair
0.51 < r < 0.75 good
0.76 < r < 1.00 - excellent
Step 3: Conclusion
There is a significant, positive and fair correlation between weight and systolic blood pressure
(r=0.47, p<0.001).
Table 7: Correlation between weight and systolic blood pressure (n=438)
Variable
ra
p-value
Weight (kg)
Systolic pressure (mmHg)
a
0.47
< 0.001
Pearson correlation test was applied; Level of significance was set at 5%.
29
Saving History of Commands in filename.do

1. Go to Review window > Select all the commands > right click > Send to Do-file Editor
2. A Do-file Editor window is open with the list of commands
3. Select: File > Save
4. A dialog box of Save Stata Do-file is open
30
Organising a filename.do
1. Open analysis_date.do > Create comments and notes as below
Notes:
Green text: Comments/ notes
/* : opening symbol for hiding selective commands
*/ : closing symbol for hiding selective commands
2. Click on Execute (do)
to rerun all commands or selective commands
3. STATA reruns the previous work
31

STATA Manual Guide

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

STATA Manual Guide

Diunggah oleh

Hak Cipta:

Format Tersedia

STATA WORKSHOP DAY 2 (BASIC ANALYSIS BY USING STATA SOFTWARE)

One sample t-test

Opening Log File (filename_date.log or filename_date.scml)

2. A dialog box of Begin logging Stata ouput is open

2. A dialog box is open

Command: ttest variable == testvalue

One sample t-test analysis was applied

ANALYSIS 2: Independent t-test

b) A dialog box is open

Command: histogram outcomevar, normal by(groupingvar)

d) Interpretation: Normality assumption is met.

b) A dialog box is open

Command: graph box outcomevar, by(groupingvar)

d) Interpretation: Normality assumption is met

b) A dialog box is open

Command: sdtest outcomevar, by(groupingvar)

2. A dialog box is open:

Command: ttest outcomevar, by(groupingvar)

Mean difference t-statistica

3.08 (2.41, 3.75)

Independent t-test was applied

ANALYSIS 3: Paired t-test

Step 3: Checking assumptions

b) A dialog box is open

Command: generate newvardiff = prevar - postvar

There are two ways of checking normality of distribution:

b) A dialog box is open

Command: histogram newvardiff, normal

d) Interpretation: Normality assumption of BMI difference is met.

b) A dialog box is open

Command: graph box newvardiff

d) Interpretation: Normality assumption of BMI difference is met.

2. A dialog box is open

Command: ttest prevar == postvar

Mean difference t-statistica (df)

ANALYSIS 4: One-way Analysis of Variance (One-way ANOVA)

b) A dialog box is open

Command: histogram outcomevar, normal by(groupingvar)

d) Interpretation: Normality assumption is met.

b) A dialog box is open

Command: graph box outcomevar, by(groupingvar)

d) Interpretation: Normality assumption is met

b) A dialog box is open

Command: oneway outcomevar groupingvar

2. A dialog box is open

Command: oneway outcomevar groupingvar

There are six

b) A dialog box is open

Command: oneway outcomevar groupingvar, bonferroni

p-value of comparison pairs

Step 7: Presentation of results

2. A dialog box is open

Command: oneway outcomevar groupingvar, bonferroni

Table 4: Comparison of mean height among races (n = 438)

One-way ANOVA test was applied

ANALYSIS 5: Categorical Test

2. A dialog box is open

Command: tabulate indepvar depvar, expected

0 cells (0%) have expected

4. Interpretation: Assumption is met.

2. A dialog box is open

Command: tabulate indepvar depvar, chi2 row

ANALYSIS 6: Estimation of Risk

Step 4: Statistical test