Anda di halaman 1dari 22

ACMS30600 Final Project Francisco Huizar

Introduction:

The data set contains values pertaining to sleep in sixty-two different mammalian species. The

data set includes the following for the mammalian species: body weight in kilograms, brain

weight in kilograms, number of hours per day of slow wave sleeping, number of hours per day of

paradoxical sleep, the total time of sleep per day, the maximum life span in years, gestation time

in days, predation index, sleep exposure index, and overall danger index. The predation index

can have values between one and five, with a value of one denoting that the mammal is least

likely to be preyed upon and a value of five denoting that the mammal is most likely to be preyed

upon. The sleep exposure index can have values between one and five, with a value of one

denoting that the mammal sleeps in a well-protected area and a value of five denoting the

mammal sleeps in an exposed area. The overall danger index can have values between one and

five, with a value of one denoting that the mammal is in a low amount of danger from other

animals, and a value of five denoting that the mammal is in a high amount of danger from other

animals. The variables and their descriptions are summarized in Table 1.

The response variable of interest is the total hours of sleep per day for mammalian species. All

other variables are used as predictor variables to model the total hours of sleep per day. Sleep

plays a significant role in the lives of mammalian species. Though sleep is regarded as an

absolutely necessary function, the reason of why mammals need sleep is not so discernible.

Currently, there are four theories of why mammals sleep: the inactivity theory, the energy

conservation theory, the restorative theory, and the brain plasticity theory.[1] This data set applies

directly to the inactivity theory. The inactivity theory suggests that inactivity at night is an

adaptation that served a survival function by keeping organisms out of harms way at times when

1
ACMS30600 Final Project Francisco Huizar

they would be particularly vulnerable.[1] Therefore, animals that were able to be still and quiet

during hours of vulnerability would have a higher chance of survival compared to animals that

remained active during hours of vulnerability. Animals that remained inactive would then not be

subject to predation, and through the process of natural selection, the behavioral strategy of

inactivity evolved into the function of sleeping.[1] Through data analysis on the provided data set,

a deeper understanding of factors playing a role in the inactivity theory can be observed. Using

this, the validity of the inactivity theory can be observed.

Variable Description
BodyWt Body weight (kg)
BrainWt Brain weight (kg)
NonDreaming Slow wave sleep (hrs/day)
Dreaming Paradoxical sleep (hrs/day)
TotalSleep Total hours of sleep (hrs/day)
LifeSpan Maximum life span (years)
Gestation Gestation time (days)
Predation Predation index (1-5)
1 := least likely to be preyed upon
5 := most likely to be preyed upon
Exposure Sleep exposure index
1 := Least exposed sleeping environment
5 := Most exposed sleeping environment
Danger Overall danger index
1 := Least danger from other animals
5 := Most danger from other animals
Table 1. Description of the data set.

Data:

The data was obtained from a research article observing the ecological and constitutional

correlates involved in sleep in mammals.[2] The data was downloaded as a text file from a

website. However, the original electronic data file provided on the website was obtained from the

Carnegie Mellon University statistics library database.[3] Unfortunately, there are several missing

values in the data due to gaps in knowledge during the time that the observations were made.

Therefore, the data analysis observations made from modeling this data set should be approached

2
ACMS30600 Final Project Francisco Huizar

with caution and further evidence or current data would be needed to fortify the results of this

model.

Regression Analysis:

Explanatory Data Analysis:

Initial scatterplots were created of explanatory variables in order to perform explanatory

data analysis. Because there are more than three explanatory variables, scatterplots were

created for the most relevant variables. Linear regression lines were applied to the

scatterplots in red in order to observe possible trends.

The first scatter plot observes the correlation between the total hours of sleep per day in

mammals versus the overall danger index (Figure 1). This explanatory variable was

chosen because the overall danger index variable value takes into account the predation

index and exposure index; thus, this scatter plot is a quasi-representation of the predation

index, exposure index, and danger index variables and their correlation to the total hours

of sleep. Through observation of this scatterplot, there is a clear distinction that mammals

that have a lower score on the danger index variable, sleep for greater amounts of time.

This observation makes sense because if a mammal is less likely to be in danger, the

mammal would have more time available during the day to dedicate to sleep.

The second scatterplot observes the correlation between the total hours of sleep per day in

mammals versus the lifespan of the mammal (Figure 2). This explanatory variable was

chosen because according to the inactivity theory, sleep is crucial to an animals survival.

Therefore, an animal that sleeps longer would have a longer maximum lifespan. Through

3
ACMS30600 Final Project Francisco Huizar

observation of this scatterplot, there is a clear distinction that as the maximum lifespan of

a mammal increases, the total hours of sleep per day decreases. This would seemingly

contradict the inactivity theory; however, the inactivity theory only applies during periods

of time when the animal is most vulnerable. Essentially, an animal that is inactive only

during periods of time when the animal is most vulnerable will survive more often.

Sleeping throughout the entire day would render an animal subject to predation because it

is unable to protect itself from predators if it sleeps for a larger part of the day. Thus, the

observed trend does make sense.

The third scatterplot observes the correlation between the total hours of sleep per day in

mammals versus the gestation time of the mammalian species (Figure 3). This

explanatory variable was chosen because pregnant mothers are expected to require

longer periods of inactivity in order to allow optimal gestation of offspring. Through

observation of this scatterplot, there is a clear distinction that as the number of days

required for gestation in mammalian species, the total hours of sleep per day decreases.

This observation makes sense because mammals that require less days for gestation

would be required to sleep more to allow the offspring to develop faster and be birthed

sooner.

A correlation test was then performed on the explanatory variables in order to observe if

multicollinearity is present amongst the variables (Table 2). The most highly correlated

explanatory variables are bodyweight and brain weight with a correlation of 0.93416384,

followed by the correlation of 0.9160424 between predation index and danger index, and

4
ACMS30600 Final Project Francisco Huizar

finally the correlation of 0.7872031between exposure index and danger index. The

correlations observed do make sense because brain weight is a part of body weight, and

the overall danger index value is evaluated taking into account the exposure index and

predation index. To test if there is a multicollinearity problem, the variance inflation

factors (VIF) were calculated. The VIF for body weight is 25.83416 and the VIF for

overall danger index is 25.64432. These VIFs are greater than the rule-of-thumb

threshold of 10, therefore there is a severe multicollinearity problem. In order to correct

this, a principal components model will be used in the linear regression analysis because

principal components eliminate multicollinearity.

5
ACMS30600 Final Project Francisco Huizar

6
ACMS30600 Final Project Francisco Huizar

Total Sleep vs. Danger Index

20
15
Total hours of sleep/day

10
5

1 2 3 4 5

Danger Index (1-5)

Figure 1. Scatterplot of Total Sleep VS. Overall Danger Index in mammalian species

7
ACMS30600 Final Project Francisco Huizar

Total Sleep vs. Lifespan

20
15
Total hours of sleep/day

10
5

0 20 40 60 80 100

Maximum lifespan (years)

Figure 2. Scatterplot of Total Sleep VS. Maximum lifespan of mammalian species

8
ACMS30600 Final Project Francisco Huizar

Total Sleep vs. Gestation Time

20
15
Total hours of sleep/day

10
5

0 100 200 300 400 500 600

Gestation Time (days)

Figure 3. Scatterplot of Total Sleep VS. Gestation Time in mammalian species

9
ACMS30600 Final Project Francisco Huizar

Table 2. Correlation amongst predictor variables.

Linear Regression Analysis:


The R2 and Ra2 for the saturated model are both one. This value is expected because the
saturated model used all explanatory variables to calculate predicted values. A single
analysis of variance F-Test was used in order to test for the removal of a subset of
variables from the model. The reduced model has NonDreaming, Dreaming, LifeSpan,
Predation, Exposure, and Danger as the predictor variables. When testing the utility of the
model using the analysis of variance F-Test, the null hypothesis is that none of the
predictor variables are related to total hours of sleep, and the alternative hypothesis is that
at least one of the predictor variables is related to total hours of sleep. Using the reduced
model, the calculated F-statistic was labeled as infinity in R; thus, the calculated p-value
for this test was zero. This means that we can reject the null hypothesis and can conclude
that at least one of the values of the predictor variables cannot be removed. Further F-
tests were not used because repeated F-test lead to type I error.

In order to optimize variable selection, the principle components data reduction method
was applied to the model. The principal components method was used because the
predictor variables had shown to have multicollinearity. The principal component
analysis is shown in Table 3 and the corresponding Screeplot is shown in Figure 4.
Through observation of the screeplot, the most significant elbow occurs at PC3.
However, there is still notable reduction in variances when moving from PC3 to PC4.
Thus, the optimal number of PCs to include is four. However, due to missing values, the
optimal model with this principal component analysis cannot be constructed because
variable lengths would be different. In order to account for this, a new vector of
TotalSleep was created and indexed such that the values aligned with the X-Scaling of the
PC analysis. The R2 for this optimal model is 0.9903 and the Ra2 is 0.9893, both of which
are only slightly lower than the values of the saturated model.

To test for influential observations and outliers in this new model, standard normal values
and cooks distance values were calculated. In this model, the 22nd observation proved to
be an outlier with a standard normal value of -3.01249654. The 50th percentile of the F-
distribution with the four predictor variables and fourty-two observations from the
newTotalSleep response variable is 0.8863808. Any observations with cooks distances
greater than this value are said to be influential. The 22nd observation proved to be an
influential value as well with a cooks distance value of 0.9390591. Thus, the model
needs to be refit without the influential observation. The summary of this final model is
displayed in Figure 5.

The final model used for the data set is the principal component model with the 22nd
observation removed. This is the best model for the data set because the principal
component analysis removed the multicollinearity that was present amongst the predictor
variables in the original saturated model. Furthermore, this final model is free of any
influential observations and free of outliers that would potentially skew a prediction made
by the model. This model is shown to have an excellent fit with an R2 of 0.9926 and a Ra2
of 0.9918 (Figure 5).

10
ACMS30600 Final Project Francisco Huizar

A residual values versus fitted values plot is shown in Figure 6. This plot shows no
observable patterning in between the residual values and the fitted values. This is because
the points on the plot show no pattern and are centered randomly around zero. A normal
plot of the residuals is shown in Figure 7. Through observation of this plot, there are no
major departures from linearity therefore the model can be said to be normal. Further
investigation led to the plotting of the histogram of the residuals (Figure 8). This
histogram plot does show light right-skewness which is nearly visible in the light tailing
present in Figure 7. However, the skew is not major enough to require a different model
be constructed.

In order to test validation of the model, the boostrap validation method was used. The
bootstrap evaluation R2 was calculated to be 0.9905545 from 100 bootstrap samples. This
evaluation R2 is slightly lower than the final model R2 of 0.9926. If the final model was
overfit, each bootstrap model created would perform poorly on the original data set. Thus
the evaluation R2 would be lower than the final model R2. This is the case for the boostrap
analysis run, thus overfit was adjusted for. However, the percent different between the
final model R2 and the evaluation R2 is 0.21%; thus, the overfit of the final model is
minute and would not warrant a new model be created.

Table 3. Principle Component Analysis

11
ACMS30600 Final Project Francisco Huizar

pca

3
Variances

2
1
0

1 2 3 4 5 6 7 8 9

Figure 4. Screeplot of Principal Component Analysis

12
ACMS30600 Final Project Francisco Huizar

Figure 5. Final Model for Data Set

13
ACMS30600 Final Project Francisco Huizar

Residuals Vs. Fitted Values

1.0
0.5
0.0
Residuals

-0.5
-1.0

5 10 15 20

Fitted Values

Figure 6. Residual Values VS. Fitted Values

14
ACMS30600 Final Project Francisco Huizar

Normal Q-Q Plot

1.0
0.5
Sample Quantiles

0.0
-0.5
-1.0

-2 -1 0 1 2

Theoretical Quantiles

Figure 7. Normal Plot of the residuals

15
ACMS30600 Final Project Francisco Huizar

Histogram of the Final Model Residuals

20
15
Frequency

10
5
0

-1.5 -1.0 -0.5 0.0 0.5 1.0

residuals

Figure 8. Histogram Plot of the Residuals

Results:

The confidence intervals for the fitted values are displayed in Table 4 and the prediction intervals

for the fitted values are displayed in Table 5. The confidence intervals for the slope parameters

are displayed in Figure 9. Table 4 depicts the 95% confidence intervals for the mean total sleep

in mammals for the principal component values. This table is viable for observing the expected

total hours of sleep of any mammalian species. For example, the first fitted value of the

16
ACMS30600 Final Project Francisco Huizar

confidence interval is 8.553641. There is 95% confidence that the true mean fitted value of total

hours of sleep required by a mammalian species lies between the interval [8.298168, 8.809114].

Table 5 depicts the 95% prediction intervals for individual mammalian species with given

principal component values. Using the fifth predicted value of 5.285691, there is 95% confidence

that an individual mammalian species will require a total number of hours of sleep between the

interval [4.383291, 6.188091]. Figure 9 depicts the 95% confidence intervals for the slope

parameters of the final model. Using the first slope parameter of -2.07377, there is 95%

confidence that the true slope parameter lies in the interval [-2.144123, -2.003424]. These

inferences are important to the research question, regarding total hours of sleep required for

mammalian species, because the confidence intervals are likely to contain the true values of the

population parameters while the model only captures the sample value parameters. Due to

randomness of samples, the true value of the population parameters will fall between confidence

intervals of the sample parameters 95% of the time.

17
ACMS30600 Final Project Francisco Huizar

Table 4. Confidence Intervals of fitted values.

18
ACMS30600 Final Project Francisco Huizar

Table 5. Prediction Intervals of fitted values.

19
ACMS30600 Final Project Francisco Huizar

Figure 9. Confidence intervals of slope parameters

Conclusion:

The principal component model satisfactorily describes the total hours of sleep for mammalian

species. This is because the model has a high R2 value of 0.9926 and has very little overfitting

after undergoing bootstrap testing. Furthermore, the outliers and influential observations of the

original model were removed. Lastly, the final model removes multicollinearity because the

model was constructed using the principal components method. Using this model, there seems to

be some support for the inactivity theory because of the goodness of fit; however, there may be

other factors that are not being considered. For example, a counterargument to the inactivity

theory, an animal that remains awake and conscious is surely to have greater survival rates due to

having the ability to react to an attacking predator. Further data will need to be collected to

improve the model.

Though the model did show to have good fit, it can be improved. This was because there were

missing data values for some mammalian species; however, this was somewhat accounted for

when re-indexing the final model to not take into account mammals with missing values.

Furthermore, the data set was collected in 1976; thus, there is sure to be more data available on

20
ACMS30600 Final Project Francisco Huizar

the topic to further add to the completeness of the model. A possible predictor variable that

would help add to the fit of the model would be to potentially have a predictor variable that

evaluates the number of predators in the given species environments. This would largely play a

role in the inactivity theory because if an environment has more predators, the rate of survival of

an animal lower on the food chain would decrease significantly. Further questions regarding the

model would be: How would data taken today differ from the data taken in 1976? There has

undoubtedly been shifts in environmental climates, mammalian habitats, and ecological

interactions between species in the past 40 years. Thus, would new data show similar trends to

the data used for this model?

21
ACMS30600 Final Project Francisco Huizar

References

1. Why Do We Sleep, Anyway? | Healthy Sleep. Accessed December 7, 2016.

http://healthysleep.med.harvard.edu/healthy/matters/benefits-of-sleep/why-do-we-sleep.

2. Allison, T., and D. V. Cicchetti. Sleep in Mammals: Ecological and Constitutional

Correlates. Science 194, no. 4266 (November 12, 1976): 73234.

doi:10.1126/science.982039.

3. OzDASL: Sleep in Mammals. Accessed December 8, 2016.

http://www.statsci.org/data/general/sleep.html.

22

Anda mungkin juga menyukai