Statistics of Sleep in Mammalian Species

ACMS30600 Final Project Francisco Huizar
Introduction:
The data set contains values pertaining to sleep in sixty-two different mammalian species. The
data set includes the following for the mammalian species: body weight in kilograms, brain
weight in kilograms, number of hours per day of slow wave sleeping, number of hours per day of
paradoxical sleep, the total time of sleep per day, the maximum life span in years, gestation time
in days, predation index, sleep exposure index, and overall danger index. The predation index
can have values between one and five, with a value of one denoting that the mammal is least
likely to be preyed upon and a value of five denoting that the mammal is most likely to be preyed
upon. The sleep exposure index can have values between one and five, with a value of one
denoting that the mammal sleeps in a well-protected area and a value of five denoting the
mammal sleeps in an exposed area. The overall danger index can have values between one and
five, with a value of one denoting that the mammal is in a low amount of danger from other
animals, and a value of five denoting that the mammal is in a high amount of danger from other
animals. The variables and their descriptions are summarized in Table 1.
The response variable of interest is the total hours of sleep per day for mammalian species. All
other variables are used as predictor variables to model the total hours of sleep per day. Sleep
plays a significant role in the lives of mammalian species. Though sleep is regarded as an
absolutely necessary function, the reason of why mammals need sleep is not so discernible.
Currently, there are four theories of why mammals sleep: the inactivity theory, the energy
conservation theory, the restorative theory, and the brain plasticity theory.[1] This data set applies
directly to the inactivity theory. The inactivity theory suggests that inactivity at night is an
adaptation that served a survival function by keeping organisms out of harms way at times when
1
they would be particularly vulnerable.[1] Therefore, animals that were able to be still and quiet
during hours of vulnerability would have a higher chance of survival compared to animals that
remained active during hours of vulnerability. Animals that remained inactive would then not be
subject to predation, and through the process of natural selection, the behavioral strategy of
inactivity evolved into the function of sleeping.[1] Through data analysis on the provided data set,
a deeper understanding of factors playing a role in the inactivity theory can be observed. Using
this, the validity of the inactivity theory can be observed.
Variable Description
BodyWt Body weight (kg)
BrainWt Brain weight (kg)
NonDreaming Slow wave sleep (hrs/day)
Dreaming Paradoxical sleep (hrs/day)
TotalSleep Total hours of sleep (hrs/day)
LifeSpan Maximum life span (years)
Gestation Gestation time (days)
Predation Predation index (1-5)
1 := least likely to be preyed upon
5 := most likely to be preyed upon
Exposure Sleep exposure index
1 := Least exposed sleeping environment
5 := Most exposed sleeping environment
Danger Overall danger index
1 := Least danger from other animals
5 := Most danger from other animals
Table 1. Description of the data set.
Data:
The data was obtained from a research article observing the ecological and constitutional
correlates involved in sleep in mammals.[2] The data was downloaded as a text file from a
website. However, the original electronic data file provided on the website was obtained from the
Carnegie Mellon University statistics library database.[3] Unfortunately, there are several missing
values in the data due to gaps in knowledge during the time that the observations were made.
Therefore, the data analysis observations made from modeling this data set should be approached
2
with caution and further evidence or current data would be needed to fortify the results of this
model.
Regression Analysis:
Explanatory Data Analysis:
Initial scatterplots were created of explanatory variables in order to perform explanatory
data analysis. Because there are more than three explanatory variables, scatterplots were
created for the most relevant variables. Linear regression lines were applied to the
scatterplots in red in order to observe possible trends.
The first scatter plot observes the correlation between the total hours of sleep per day in
mammals versus the overall danger index (Figure 1). This explanatory variable was
chosen because the overall danger index variable value takes into account the predation
index and exposure index; thus, this scatter plot is a quasi-representation of the predation
index, exposure index, and danger index variables and their correlation to the total hours
of sleep. Through observation of this scatterplot, there is a clear distinction that mammals
that have a lower score on the danger index variable, sleep for greater amounts of time.
This observation makes sense because if a mammal is less likely to be in danger, the
mammal would have more time available during the day to dedicate to sleep.
The second scatterplot observes the correlation between the total hours of sleep per day in
mammals versus the lifespan of the mammal (Figure 2). This explanatory variable was
chosen because according to the inactivity theory, sleep is crucial to an animals survival.
Therefore, an animal that sleeps longer would have a longer maximum lifespan. Through
3
observation of this scatterplot, there is a clear distinction that as the maximum lifespan of
a mammal increases, the total hours of sleep per day decreases. This would seemingly
contradict the inactivity theory; however, the inactivity theory only applies during periods
of time when the animal is most vulnerable. Essentially, an animal that is inactive only
during periods of time when the animal is most vulnerable will survive more often.
Sleeping throughout the entire day would render an animal subject to predation because it
is unable to protect itself from predators if it sleeps for a larger part of the day. Thus, the
observed trend does make sense.
The third scatterplot observes the correlation between the total hours of sleep per day in
mammals versus the gestation time of the mammalian species (Figure 3). This
explanatory variable was chosen because pregnant mothers are expected to require
longer periods of inactivity in order to allow optimal gestation of offspring. Through
observation of this scatterplot, there is a clear distinction that as the number of days
required for gestation in mammalian species, the total hours of sleep per day decreases.
This observation makes sense because mammals that require less days for gestation
would be required to sleep more to allow the offspring to develop faster and be birthed
sooner.
A correlation test was then performed on the explanatory variables in order to observe if
multicollinearity is present amongst the variables (Table 2). The most highly correlated
explanatory variables are bodyweight and brain weight with a correlation of 0.93416384,
followed by the correlation of 0.9160424 between predation index and danger index, and
4
finally the correlation of 0.7872031between exposure index and danger index. The
correlations observed do make sense because brain weight is a part of body weight, and
the overall danger index value is evaluated taking into account the exposure index and
predation index. To test if there is a multicollinearity problem, the variance inflation
factors (VIF) were calculated. The VIF for body weight is 25.83416 and the VIF for
overall danger index is 25.64432. These VIFs are greater than the rule-of-thumb
threshold of 10, therefore there is a severe multicollinearity problem. In order to correct
this, a principal components model will be used in the linear regression analysis because
principal components eliminate multicollinearity.
5
6
Total Sleep vs. Danger Index
20
15
Total hours of sleep/day
10
5
1 2 3 4 5
Danger Index (1-5)
Figure 1. Scatterplot of Total Sleep VS. Overall Danger Index in mammalian species
7
Total Sleep vs. Lifespan
20
15
10
5
0 20 40 60 80 100
Maximum lifespan (years)
Figure 2. Scatterplot of Total Sleep VS. Maximum lifespan of mammalian species
8
Total Sleep vs. Gestation Time
20
15
10
5
0 100 200 300 400 500 600
Gestation Time (days)
Figure 3. Scatterplot of Total Sleep VS. Gestation Time in mammalian species
9
Table 2. Correlation amongst predictor variables.
Linear Regression Analysis:

The R2 and Ra2 for the saturated model are both one. This value is expected because the
saturated model used all explanatory variables to calculate predicted values. A single
analysis of variance F-Test was used in order to test for the removal of a subset of
variables from the model. The reduced model has NonDreaming, Dreaming, LifeSpan,
Predation, Exposure, and Danger as the predictor variables. When testing the utility of the
model using the analysis of variance F-Test, the null hypothesis is that none of the
predictor variables are related to total hours of sleep, and the alternative hypothesis is that
at least one of the predictor variables is related to total hours of sleep. Using the reduced
model, the calculated F-statistic was labeled as infinity in R; thus, the calculated p-value
for this test was zero. This means that we can reject the null hypothesis and can conclude
that at least one of the values of the predictor variables cannot be removed. Further F-
tests were not used because repeated F-test lead to type I error.
In order to optimize variable selection, the principle components data reduction method
was applied to the model. The principal components method was used because the
predictor variables had shown to have multicollinearity. The principal component
analysis is shown in Table 3 and the corresponding Screeplot is shown in Figure 4.
Through observation of the screeplot, the most significant elbow occurs at PC3.
However, there is still notable reduction in variances when moving from PC3 to PC4.
Thus, the optimal number of PCs to include is four. However, due to missing values, the
optimal model with this principal component analysis cannot be constructed because
variable lengths would be different. In order to account for this, a new vector of
TotalSleep was created and indexed such that the values aligned with the X-Scaling of the
PC analysis. The R2 for this optimal model is 0.9903 and the Ra2 is 0.9893, both of which
are only slightly lower than the values of the saturated model.
To test for influential observations and outliers in this new model, standard normal values
and cooks distance values were calculated. In this model, the 22nd observation proved to
be an outlier with a standard normal value of -3.01249654. The 50th percentile of the F-
distribution with the four predictor variables and fourty-two observations from the
newTotalSleep response variable is 0.8863808. Any observations with cooks distances
greater than this value are said to be influential. The 22nd observation proved to be an
influential value as well with a cooks distance value of 0.9390591. Thus, the model
needs to be refit without the influential observation. The summary of this final model is
displayed in Figure 5.
The final model used for the data set is the principal component model with the 22nd
observation removed. This is the best model for the data set because the principal
component analysis removed the multicollinearity that was present amongst the predictor
variables in the original saturated model. Furthermore, this final model is free of any
influential observations and free of outliers that would potentially skew a prediction made
by the model. This model is shown to have an excellent fit with an R2 of 0.9926 and a Ra2
of 0.9918 (Figure 5).
10
A residual values versus fitted values plot is shown in Figure 6. This plot shows no
observable patterning in between the residual values and the fitted values. This is because
the points on the plot show no pattern and are centered randomly around zero. A normal
plot of the residuals is shown in Figure 7. Through observation of this plot, there are no
major departures from linearity therefore the model can be said to be normal. Further
investigation led to the plotting of the histogram of the residuals (Figure 8). This
histogram plot does show light right-skewness which is nearly visible in the light tailing
present in Figure 7. However, the skew is not major enough to require a different model
be constructed.
In order to test validation of the model, the boostrap validation method was used. The
bootstrap evaluation R2 was calculated to be 0.9905545 from 100 bootstrap samples. This
evaluation R2 is slightly lower than the final model R2 of 0.9926. If the final model was
overfit, each bootstrap model created would perform poorly on the original data set. Thus
the evaluation R2 would be lower than the final model R2. This is the case for the boostrap
analysis run, thus overfit was adjusted for. However, the percent different between the
final model R2 and the evaluation R2 is 0.21%; thus, the overfit of the final model is
minute and would not warrant a new model be created.
Table 3. Principle Component Analysis
11
pca
3
Variances
2
1
0
1 2 3 4 5 6 7 8 9
Figure 4. Screeplot of Principal Component Analysis
12
Figure 5. Final Model for Data Set
13
Residuals Vs. Fitted Values
1.0
0.5
0.0
Residuals
-0.5
-1.0
5 10 15 20
Fitted Values
Figure 6. Residual Values VS. Fitted Values
14
Normal Q-Q Plot
1.0
0.5
Sample Quantiles
0.0
-0.5
-1.0
-2 -1 0 1 2
Theoretical Quantiles
Figure 7. Normal Plot of the residuals
15
Histogram of the Final Model Residuals
20
15
Frequency
10
5
0
-1.5 -1.0 -0.5 0.0 0.5 1.0
residuals
Figure 8. Histogram Plot of the Residuals
Results:
The confidence intervals for the fitted values are displayed in Table 4 and the prediction intervals
for the fitted values are displayed in Table 5. The confidence intervals for the slope parameters
are displayed in Figure 9. Table 4 depicts the 95% confidence intervals for the mean total sleep
in mammals for the principal component values. This table is viable for observing the expected
total hours of sleep of any mammalian species. For example, the first fitted value of the
16
confidence interval is 8.553641. There is 95% confidence that the true mean fitted value of total
hours of sleep required by a mammalian species lies between the interval [8.298168, 8.809114].
Table 5 depicts the 95% prediction intervals for individual mammalian species with given
principal component values. Using the fifth predicted value of 5.285691, there is 95% confidence
that an individual mammalian species will require a total number of hours of sleep between the
interval [4.383291, 6.188091]. Figure 9 depicts the 95% confidence intervals for the slope
parameters of the final model. Using the first slope parameter of -2.07377, there is 95%
confidence that the true slope parameter lies in the interval [-2.144123, -2.003424]. These
inferences are important to the research question, regarding total hours of sleep required for
mammalian species, because the confidence intervals are likely to contain the true values of the
population parameters while the model only captures the sample value parameters. Due to
randomness of samples, the true value of the population parameters will fall between confidence
intervals of the sample parameters 95% of the time.
17
Table 4. Confidence Intervals of fitted values.
18
Table 5. Prediction Intervals of fitted values.
19
Figure 9. Confidence intervals of slope parameters
Conclusion:
The principal component model satisfactorily describes the total hours of sleep for mammalian
species. This is because the model has a high R2 value of 0.9926 and has very little overfitting
after undergoing bootstrap testing. Furthermore, the outliers and influential observations of the
original model were removed. Lastly, the final model removes multicollinearity because the
model was constructed using the principal components method. Using this model, there seems to
be some support for the inactivity theory because of the goodness of fit; however, there may be
other factors that are not being considered. For example, a counterargument to the inactivity
theory, an animal that remains awake and conscious is surely to have greater survival rates due to
having the ability to react to an attacking predator. Further data will need to be collected to
improve the model.
Though the model did show to have good fit, it can be improved. This was because there were
missing data values for some mammalian species; however, this was somewhat accounted for
when re-indexing the final model to not take into account mammals with missing values.
Furthermore, the data set was collected in 1976; thus, there is sure to be more data available on
20
the topic to further add to the completeness of the model. A possible predictor variable that
would help add to the fit of the model would be to potentially have a predictor variable that
evaluates the number of predators in the given species environments. This would largely play a
role in the inactivity theory because if an environment has more predators, the rate of survival of
an animal lower on the food chain would decrease significantly. Further questions regarding the
model would be: How would data taken today differ from the data taken in 1976? There has
undoubtedly been shifts in environmental climates, mammalian habitats, and ecological
interactions between species in the past 40 years. Thus, would new data show similar trends to
the data used for this model?
21
References
1. Why Do We Sleep, Anyway? | Healthy Sleep. Accessed December 7, 2016.
http://healthysleep.med.harvard.edu/healthy/matters/benefits-of-sleep/why-do-we-sleep.
2. Allison, T., and D. V. Cicchetti. Sleep in Mammals: Ecological and Constitutional
Correlates. Science 194, no. 4266 (November 12, 1976): 73234.
doi:10.1126/science.982039.
3. OzDASL: Sleep in Mammals. Accessed December 8, 2016.
http://www.statsci.org/data/general/sleep.html.
22

Statistics of Sleep in Mammalian Species

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Statistics of Sleep in Mammalian Species

Diunggah oleh

Hak Cipta:

Format Tersedia

ACMS30600 Final Project Francisco Huizar

animals. The variables and their descriptions are summarized in Table 1.

this, the validity of the inactivity theory can be observed.

Explanatory Data Analysis:

Initial scatterplots were created of explanatory variables in order to perform explanatory

scatterplots in red in order to observe possible trends.

observed trend does make sense.

longer periods of inactivity in order to allow optimal gestation of offspring. Through

predation index. To test if there is a multicollinearity problem, the variance inflation

threshold of 10, therefore there is a severe multicollinearity problem. In order to correct

principal components eliminate multicollinearity.

Total Sleep vs. Danger Index

Danger Index (1-5)

Total Sleep vs. Lifespan

Maximum lifespan (years)

Figure 2. Scatterplot of Total Sleep VS. Maximum lifespan of mammalian species

Total Sleep vs. Gestation Time

0 100 200 300 400 500 600

Gestation Time (days)

Figure 3. Scatterplot of Total Sleep VS. Gestation Time in mammalian species

Table 2. Correlation amongst predictor variables.

Linear Regression Analysis:

Table 3. Principle Component Analysis

Figure 4. Screeplot of Principal Component Analysis

Figure 5. Final Model for Data Set

Residuals Vs. Fitted Values

Figure 6. Residual Values VS. Fitted Values

Normal Q-Q Plot

Figure 7. Normal Plot of the residuals

Histogram of the Final Model Residuals

-1.5 -1.0 -0.5 0.0 0.5 1.0

Figure 8. Histogram Plot of the Residuals

intervals of the sample parameters 95% of the time.

Table 4. Confidence Intervals of fitted values.

Table 5. Prediction Intervals of fitted values.

Figure 9. Confidence intervals of slope parameters

improve the model.

undoubtedly been shifts in environmental climates, mammalian habitats, and ecological

the data used for this model?

1. Why Do We Sleep, Anyway? | Healthy Sleep. Accessed December 7, 2016.

2. Allison, T., and D. V. Cicchetti. Sleep in Mammals: Ecological and Constitutional

Correlates. Science 194, no. 4266 (November 12, 1976): 73234.

3. OzDASL: Sleep in Mammals. Accessed December 8, 2016.

Anda mungkin juga menyukai