Anda di halaman 1dari 11

Determining the Association between Distance and Traffic on Opinions Regarding

School Times

Abstract:
This project was done with the intention of figuring out whether a student’s distance
from LASA is associated with their opinion on School times or if the time spent in traffic
has is associated with their opinion. After analysis, p-values of 0.916 and 0.948 led to
the conclusion that there is little to no association between distance from LASA and
opinions or between time spent in traffic and opinions.

Robin Sam
Period 5B: AP Statistics
20 May 2019
Table of Contents
Introduction 3

Data Collection 3

Data Analysis 4

Conclusion 6

Appendix 7
Introduction
Earlier this school year, the LASA administration was seriously considering changing the
start time of the school day for the 2019-2020 school year. For this reason, I thought it would be
useful if research was conducted to determine what student’s opinions were regarding changing
the school times for next year and if these opinions were associated with factors such as
distance traveled to get to school and the traffic on the way to school. If this project produces
significant results, then there will be sufficient evidence to support the claim that the distance a
student lives from school is associated with the student’s opinion on school times. The amount
of traffic a student experiences on the way to school (recorded in minutes) is a variable to
approximate how bad traffic is in Austin overall and can be used to help administration decide
how to change school start times. I used the variables of distance from school and the traffic on
the way to school since I think they may be the only factors that are associated with a student’s
opinion on start times over any variable. Perhaps students who live further away want to have
more time to sleep so they want school times to start later and if they have a lot of traffic on the
way, they would want later start times to avoid being counted tardy. Although the time at which
a student goes to sleep may also be associated with their opinions, in my opinion, the amount of
distance it takes to get to school will have be correlated with their opinion more as they will have
to wake up very early or late depending on how far away they live.

Data Collection
I began collecting data by first creating a survey that asked the question ​"How far do you
live from LASA? (in miles)"​ to determine the distance students lived from LASA. Underneath the
question was a ​link​ to make it much easier for students to figure out their distance by simply
requiring them to put in only their address and record the shortest route to make the collection
uniform. The link was included to prevent students from recording incorrect or exaggerated
distances due to not knowing LASA's address and there was also a disclaimer that my survey
was not recording addresses to prevent students from refusing to take the survey due to
thinking that the survey collected addresses in some way. This questions had a bound that
prevented students from recording answers lower than 0 miles and higher than 50 miles to
maintain accurate answers. The other questions were ​"How much time on average do you think
you spend in traffic on the way to school? (in minutes)" ​to help me determine the traffic that
students experience on the way to school and ​"When should school days start next year
compared to this year?" ​to help me determine students' opinions on school start times for the
next school year. The idea of traffic was defined as ​"any time spent on the road where vehicles
are frequently braking" ​to prevent any confusion in regards what traffic could be and the time
recorded was based off of how much time students thought they spent in traffic which also had
a bound to prevent times less than 0 minutes.. The question asking students for their opinion
had three options ​"Earlier time",​ ​"Same time"​, and ​"Later time"​ to make student answers
uniform. Here is the ​survey​ I made to record data. I then contacted all the English teachers at
LASA to give them the survey. I chose the English department since all students take English
regardless of their grade level thereby making the English department a good estimate for the
student population. I then asked the teachers that allowed me to survey their classes to
randomly pick one of their classes to prevent teachers from picking a class that they think is the
smartest or most apt in taking the survey. I additionally asked teachers to request that all
students in that randomly picked class to take my survey to avoid nonresponse bias as much as
possible.

Data Analysis
To conduct my analysis, I used ANOVA tests which can be used to determine if there is
a difference in the means of several categorical groups where the null hypothesis is that there is
no difference among the groups and the alternative hypothesis is that there is a difference
across the groups. ANOVA compares the variation between groups to the variation within
groups and this ratio, called the F-statistic, and if this ratio is really high (i.e. there is a significant
difference in the variation between groups), then the resulting F-statistic will produce a p-value
small enough to conclude that there is a significant difference in means across the different
groups. As seen in the image below, there is a significant difference in the means across the
three groups which results in a very high F-statistic and a really small p-value letting us reject
the null hypothesis and accept the alternative hypothesis with sufficient evidence and
additionally letting us include that there is an association between the quantitative variable and
the qualitative variable.

To conduct a test of association between distance from LASA and opinions, an ANOVA
test was conducted with the three categorical groups being Earlier Time, Same Time and Later
Time (based on the options for school times next year). The null hypothesis is that the
“​population mean of distances for all opinion groups are the same​” and the alternative
hypothesis is that the “​population mean of distances for all opinion groups are not the same (at
least 1 is different)​”. The random condition for ANOVA was met since the data was a random
sample that is representative of the LASA population since every student takes an English class
and random classes were chosen from each English teacher’s schedule. The normality
condition for ANOVA was met since the histograms (Fig 3-5 in appendix) for each group were
roughly normal and the sample sizes were all greater than or equal to 30 except for the Earlier
Time group which was of size 10 and whose histogram was skew right so proceed with caution.
The three groups all have equal variance as seen with the ANOVA representation (Fig 1) where
the variances from the population mean of distances is almost the same across all three groups.
The independence condition is also met since all three groups were not related to each other in
any way possible since the survey were done individually and without collaboration.

Fig 1.

The ANOVA test was then conducted using R and the F-value is 0.088 and the p-value
that corresponds to this is 0.916. The F-value is a test statistic used to determine if group
means are equal and can be calculated by the following equation
F = variation between sample means / variation within the samples which can then be used to find a
p-value. The traditional alpha value of 0.05 was used and when used to compare, the p-value is
greater than the alpha value and therefore the null hypothesis fails to be rejected and there is
not sufficient evidence to support the alternative hypothesis.
Additional analysis was then conducted to see if there is an association between time
spent in traffic on the way to LASA and opinions on school start times. The null hypothesis is
that the “​population mean of time spent in traffic for all opinion groups are the same​” and the
alternative hypothesis is that the “​population mean of time spent in traffic for all opinion groups
are not the same (at least 1 is different)”​ . Condition checking for determining association
between time spent in traffic and opinions will mirror the condition checking done for
determining association between distance and opinions. The random condition for ANOVA was
met as shown in the previous analysis. The normality condition was met since the histograms
(Fig 6-8 in appendix) for each group were roughly normal and the sample sizes were all greater
than or equal to 30 except for the Earlier Time group which was of size 10 and whose histogram
was skew right so proceed with caution. The three groups all have equal variance as seen with
the ANOVA representation (Fig 2) where the variances from the population mean of times is
almost the same across all three groups. The independence condition for ANOVA was met as
shown in the previous analysis.
Fig 2.

The ANOVA test was then conducted using R and the F-value is 0.054 and the
corresponding p-value is 0.948. Since the p-value is greater than the alpha value of 0.05, the
null hypothesis fails to be rejected and there is not sufficient evidence to support the alternative
hypothesis.

Conclusion
After conducting analysis of the data that I collected, I came to the conclusion that
neither distance from LASA nor the time spent in traffic on the way to school is associated with
opinions on school times. Since the p-values for determining association between distance and
opinion and determining association between time and opinion were 0.916 and 0.948
respectively, no association could be established as both were greater than the alpha value of
0.05 that was used. As a result, my research was not of much use as I was not able to establish
any meaningful relationship that was significant between distance or time taken by students
travelling to LASA and opinions on school times for LASA next year. Some problems that I
perceive could have occurred was the fact that there might have been some people who took
the survey more than once as students could take more than one english class with different
teachers. This could cause some data to be repeated but since my form was anonymous, there
was no effective way of determining which data points were repeats. This repeat in data points
could affect my results but another concern I have is the fact that perhaps not all students in the
randomly chosen classes took the survey which could prevent me from having an accurate
representation of the entire LASA population. Another problem I encountered was that a few of
the English teachers didn’t respond to my request to have them give my survey to one of their
classes. This could have also made my data less representative of the LASA population as I
would have received less data than I should have. If this study was repeated, I would probably
not make this survey anonymous to be able to filter out repeated data points and I would also
ask teachers to perhaps make the survey for a completion grade to be able to reliably get all
students in randomly chosen classes to respond to the survey. I could also follow up more often
on teachers that don’t respond to improve my data’s representation of the LASA student body.

Appendix
Graphics (used for conditions):

Fig 3.

Fig 4.
Fig 5.

Fig 6.

Fig 7.
Fig 8.

Raw Data:
Link to spreadsheet with data :
https://docs.google.com/spreadsheets/d/1vgfUuuHsHq7k54kRW915DD0jUv032jqOR742cSmp8
3A/edit?usp=sharing​.

Code:
> library("ggplot2")
>
> #Split data into three groups of Earlier Time, Same Time & Later Time
> Earlier <- Survey_Spring[Survey_Spring$Opinion == "Earlier time",]
> Same <- Survey_Spring[Survey_Spring$Opinion == "Same time",]
> Later <- Survey_Spring[Survey_Spring$Opinion == "Later Time",]
>
> #Histogram to check Normality of Distance for people responding Earlier time
> a <- ggplot(Earlier, aes(Distance))
> a + geom_histogram(binwidth = 0.5) + labs(x = "Distance (in miles)", title = "Histogram for
Distance from LASA (Earlier Time)")+scale_x_continuous(breaks = seq(0, 50, by = 5))
>
> #Histogram to check Normality of Distance for people responding Same time
> b <- ggplot(Same, aes(Distance))
> b + geom_histogram(binwidth = 0.5) + labs(x = "Distance (in miles)", title = "Histogram for
Distance from LASA (Same Time)")+scale_x_continuous(breaks = seq(0, 50, by = 5))
>
> #Histogram to check Normality of Distance for people responding Later time
> c <- ggplot(Later, aes(Distance))
> c + geom_histogram(binwidth = 0.5) + labs(x = "Distance (in miles)", title = "Histogram for
Distance from LASA (Later Time)")+scale_x_continuous(breaks = seq(0, 50, by = 5))
>
> #ANOVA representation
> ggplot(Survey_Spring, aes(x = Opinion, y = Distance)) + geom_boxplot() + labs(title =
"ANOVA Representation of Opinions vs Distance")
>
> #ANOVA Test
> #Null hypothesis = population mean of distances for all opinion groups are the same
> #Alternative hypothesis = population mean of distances for all opinion groups are not the
same (at least 1 is different)
> result <- aov(Distance ~ Opinion, data = Survey_Spring) #compute anova test
> summary(result) #p-value or Pr(>F) is 0.916 which is greater than 0.05 so fail to reject null
hypothesis and not enough evidence to support alternative hypothesis
Df Sum Sq Mean Sq F value
Opinion 2 10 4.81 0.088
Residuals 95 5185 54.58
Pr(>F)
Opinion 0.916
Residuals
>
>
>
> ####################Additional Analaysis#######################
> #I would like to see if minutes spent in traffic affects opinion
>
> #Split data into three groups of Earlier Time, Same Time & Later Time just like before
> Earlier_m <- Survey_Spring[Survey_Spring$Opinion == "Earlier time", ]
> Same_m <- Survey_Spring[Survey_Spring$Opinion == "Same time",]
> Later_m <- Survey_Spring[Survey_Spring$Opinion == "Later Time",]
>
> #Histogram to check Normality of Time spent in traffic for people responding Earlier time
> c <- ggplot(Earlier_m, aes(Time))
> c + geom_histogram(binwidth = 0.5) + labs(x = "Time (in minutes)", title = "Histogram for Time
spent in Traffic (Earlier Time)")+scale_x_continuous(breaks = seq(0, 110, by = 5))
>
> #Histogram to check Normality of Time spent in traffic for people responding Same time
> d <- ggplot(Same_m, aes(Time))
> d + geom_histogram(binwidth = 0.5) + labs(x = "Time (in minutes)", title = "Histogram for Time
spent in Traffic (Same Time)")+scale_x_continuous(breaks = seq(0, 110, by = 10))
>
> #Histogram to check Normality of Time spent in traffic for people responding Later time
> e <- ggplot(Later_m, aes(Time))
> e + geom_histogram(binwidth = 0.5) + labs(x = "Time (in minutes)", title = "Histogram for Time
spent in Traffic (Later Time)")+scale_x_continuous(breaks = seq(0, 110, by = 5))
>
> #ANOVA representation
> ggplot(Survey_Spring, aes(x = Opinion, y = Time)) + geom_boxplot() + labs(title = "ANOVA
Representation of Opinions vs Time")
>
> #ANOVA Test
> #Null hypothesis = population mean of time spent in traffic for all opinion groups are the same
> #Alternative hypothesis = population mean of time spent in traffic for all opinion groups are not
the same (at least 1 is different)
> result_m <- aov(Time ~ Opinion, data = Survey_Spring) #compute anova test
> summary(result_m) #p-value or Pr(>F) is 0.948 which is greater than 0.05 so fail to reject null
hypothesis and not enough evidence to support alternative hypothesis
Df Sum Sq Mean Sq F value
Opinion 2 42 20.9 0.054
Residuals 95 36937 388.8
Pr(>F)
Opinion 0.948
Residuals

Works Cited:
Editor, Minitab Blog. “Understanding Analysis of Variance (ANOVA) and the F-Test.” ​Minitab
Blog,​
blog.minitab.com/blog/adventures-in-statistics-2/understanding-analysis-of-variance-ano
va-and-the-f-test.
Research Point: ANOVA Explained.” Edanz Editing, 15 Apr. 2013,
www.edanzediting.com/blogs/statistics-anova-explained.

Anda mungkin juga menyukai