Anda di halaman 1dari 19

LINEAR STATISTICAL MODELS Paul Nguyen s4561445

Statistical analysis of Sugar Cane Yield.


Sugar Cane Yields in the Mulgrave Area.

Introduction:

The Climate Background:

The North Queensland climate is made up of two seasons, with warm temperatures and
low rainfall during the winter period while summer sees higher rainfall and warmer,
balmy temperatures. Because of the warmer weather, winter is more commonly known
as the 'dry' season and runs from May to October with low humidity and plenty of
sunshine. Interestingly, summer is therefore known as the 'wet' season and experiences
tropical downpours with the occasional electric storm from November to April. Most of
Australias produce is grown in the winter season in the temperate regions of the west,
south and east. The timing of growing seasons varies from region to region and planting
and harvesting dates can vary from year to year depending on the rainfall.

States typically commence seeding in mid-April if there is sufficient moisture but usually
farmers wait till late April when daytime temperatures start falling to more optimal levels.
Seeding is usually finished in most regions by mid-June but some wetter areas can seed
opportunistically well into July if required.

The harvest commences early to mid-October in Queensland, with the lower rainfall
farms generally starting earlier than the higher rainfall farms. Sometimes the harvest in
Queensland can be earlier, but generally only in a poor season.

To summarise the Sowing period within Queensland occurs during April to June and
Harvesting Period occurs from September to December.

This information is especially relevant as we would expect that the the months of April
through to July rainfall to be especially important in the yield of Sugar content.
Furthermore, it would be expected that any months aside from those would not be so
important.

2
Data analysis:
The data analysis process will be an additive process whereby variables will be added
accordingly to their relevance to the overall model. First, we will make assumptions as to
which variables are important in predicting the Sugar Cane Yield.

Initially, a model was produced that contains the hypothesized relevant factors for the
dependent response variable Sugar * Tonn.Hect.

- Notice that the wets months of Nov 1996-Dec 1996 and Nov 1997-Dec 1997 are
not included. This is due to cutting season before the wet months, thus the rain
would have minimal effect on the Sugar Cane Yield.

- Age was used instead of ratoon, as ratoon and age are the same thing (this avoids
collinearity).

- Area would most likely correlate with the tonnage.

- Variety may play a role in the Sugar Content yield.

- District was included for simplicity sake and a general variable encompassing the
district group and position.

- SoilID was included was only included to offer more specific information on the
soil type. The name of the soil type is one in the same with the SoilID thus not
included.

- Harvest Month would be intuitively relevant as its time would have an effect on
how long the plants have left to grow.

- Wet Months of Jan to Feb will be added later and the also the months of sowing
season from April to June as well (note research introduction).

Notice that the inclusion of many variables/independent variables will potentially


produce overfitting. So this model will most likely be simplified.

3
Results- First Model:

first=lm(I(Tonn.Hect*Sugar)~District+fs+Area+Variety+Age+Ha
rvestMonth,data=training41)

Analysis of Variance Table


Response: I(Tonn.Hect * Sugar)
Df Sum Sq Mean Sq F value Pr(>F)
District 14 8.0842e+08 5.7744e+07 38.5894 < 2.2e-16 ***
fs 153 1.4653e+09 9.5768e+06 6.4000 < 2.2e-16 ***
Area 1 1.4670e+10 1.4670e+10 9803.5583 < 2.2e-16 ***
Variety 27 1.5962e+08 5.9117e+06 3.9507 3.582e-11 ***
Age 1 2.0168e+08 2.0168e+08 134.7802 < 2.2e-16 ***
HarvestMonth 1 1.7222e+06 1.7222e+06 1.1509 0.2835
Residuals 2843 4.2542e+09 1.4964e+06
---
Signif. codes: 0 e***f 0.001 e**f 0.01 e*f 0.05 e.f 0.1 e f 1
Residual standard error: 1223 on 2843 degrees of freedom
Multiple R-squared: 0.8027, Adjusted R-squared: 0.789
F-statistic: 58.71 on 197 and 2843 DF, p-value: < 2.2e-16

Observe that 80.27% of the variation in the data can be explained by the model with
Tonn.Hect*Sugar~District+fs+Area+Variety+Age+HarvestMonth. This along with the
Adjusted R-squared of 0.789 suggest that the model is quite good, but one must be wary
of overfitting as the model might be too good to be true. The F-test global utility p-value
is virtually zero suggesting that model is significant. However, in observing the 0.2835
F-test p-value for the HarvestMonth we have evidence to suggest the HarvestMonth is
insignificant. (fs is the factor version of SoildID)

4
A Nested F-test was done to study the significance of
HarvestMonth:

anova(first,reducedfirst)
Analysis of Variance Table

Model 1: I(Tonn.Hect * Sugar) ~ District + fs + Area + Variety + Age +


HarvestMonth
Model 2: I(Tonn.Hect * Sugar) ~ District + fs + Area + Variety + Age
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2843 4254186944
2 2844 4255909096 -1 -1722153 1.1509 0.2835

Thus we see that the Pr(>F) values is 0.2835 which is very much above the threshold of
the 0.05 significance level. Then we can conclude that HarvestMonth is not significant as
we cannot reject the null hypothesis that the HarvestMonth Beta is equal to zero.

Second Model (without HarvestMonth):

reducedfirst=lm(I(Tonn.Hect*Sugar)~District+fs+Area+Variety
+Age,data=training41,na.action = na.exclude)

Residual standard error: 1223 on 2844 degrees of freedom


Multiple R-squared: 0.8026, Adjusted R-squared: 0.789
F-statistic: 59 on 196 and 2844 DF, p-value: < 2.2e-16

Analysis of Variance Table


Response: I(Tonn.Hect * Sugar)
Df Sum Sq Mean Sq F value Pr(>F)
District 14 8.0842e+08 5.7744e+07 38.5874 < 2.2e-16 ***
fs 153 1.4653e+09 9.5768e+06 6.3997 < 2.2e-16 ***
Area 1 1.4670e+10 1.4670e+10 9803.0382 < 2.2e-16 ***
Variety 27 1.5962e+08 5.9117e+06 3.9505 3.589e-11 ***
Age 1 2.0168e+08 2.0168e+08 134.7730 < 2.2e-16 ***
Residuals 2844 4.2559e+09 1.4965e+06
---
Signif. codes: 0 e***f 0.001 e**f 0.01 e*f 0.05 e.f 0.1 e f 1

5
Graph of Residuals:

A plot of the residuals of the current model.

Notice that there is heteroskedasticity behaviour, meaning that the residuals get larger
as the prediction moves from small to large (or from large to small).

This indicates that there are two problems either:

- The model is missing variables ( which it most likely is)

- The model requires a log transformation.

6
Other models:

Now we will observe other models.


. . . . . . . . . . . . . . . . . . . .
Varietymxd -215.362 1148.433 -0.188 0.851261
VarietyMXD -367.877 653.036 -0.563 0.573253
VarietyRAG -2217.284 1387.659 -1.598 0.110185
Age -178.453 15.372 -11.609 < 2e-16 ***
Jun.97 NA NA NA NA
---
Signif. codes: 0 e***f 0.001 e**f 0.01 e*f 0.05 e.f 0.1 e f 1

Residual standard error: 1223 on 2844 degrees of freedom


Multiple R-squared: 0.8026, Adjusted R-squared: 0.789
F-statistic: 59 on 196 and 2844 DF, p-value: < 2.2e-16

It was observed that when district was removed from the model the rainfall values were
finally able to play a role in the model. In doing this the independent variable
DistrictGroup was added to the model to provide some relation to where the Sugar
Cane was produced. Furthermore, it also reduces the complexity of the model by
swapping District for DistrictGroup.

rainfall1=lm(I(Tonn.Hect*Sugar)~DistrictGroup+fs+Area+Varie
ty+Age,data=training41,na.action = na.exclude)

Anova:
Residual standard error: 1227 on 2854 degrees of freedom
Multiple R-squared: 0.8006, Adjusted R-squared: 0.7876
F-statistic: 61.6 on 186 and 2854 DF, p-value: < 2.2e-16

Notice how the model is still very good despite the lack of District factor.

7
Now we add the months:
We will now observe the influence of rainfall on the models. In particular wet months->
Jan.97,Feb.97.
rainfall2=lm(I(Tonn.Hect*Sugar)~DistrictGroup+fs+Area+Varie
ty+Age+Jan.97+Feb.97,data=training41,na.action =
na.exclude)

anova(rainfall2,rainfall1)
Analysis of Variance Table

Model 1: I(Tonn.Hect * Sugar) ~ DistrictGroup + fs + Area + Variety +


Age + Jan.97 + Feb.97
Model 2: I(Tonn.Hect * Sugar) ~ DistrictGroup + fs + Area + Variety +
Age
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2852 4299129892
2 2854 4299704285 -2 -574393 0.1905 0.8265

The high Pr(>F) suggest that we cannot reject the null hypothesis, implying the wet
months are not relevant.

Rainfall over cutting season:

rainfall3=lm(I(Tonn.Hect*Sugar)~DistrictGroup+fs+Area+Varie
ty+Age+Jul.96+Aug.96+Sep.96+Oct.96+Jul.97+Aug.97+Sep.97+Oct
.97,data=training41,na.action = na.exclude)

rainfall1=lm(I(Tonn.Hect*Sugar)~DistrictGroup+fs+Area+Varie
ty+Age,data=training41,na.action = na.exclude)

anova(rainfall3,rainfall1)
Analysis of Variance Table

Model 1: I(Tonn.Hect * Sugar) ~ DistrictGroup + fs + Area + Variety +


Age + Jul.96 + Aug.96 + Sep.96 + Oct.96 + Jul.97 + Aug.97 +
Sep.97 + Oct.97
Model 2: I(Tonn.Hect * Sugar) ~ DistrictGroup + fs + Area + Variety +
Age
Res.Df RSS Df Sum of Sq F Pr(>F)

8
1 2849 4263674046
2 2854 4299704285 -5 -36030239 4.8151 0 .0002184 ***
---
Signif. codes: 0 e***f 0.001 e**f 0.01 e *f 0.05 e.f 0.1 e f 1

The nested F-test suggest that the months of rainfall over cutting season are significant
as the p-value is very small.
rainfall4=lm(I(Tonn.Hect*Sugar)~DistrictGroup+fs+Area+Varie
ty+Age+Jul.96+Aug.96+Sep.96+Oct.96+Jul.97,data=training41,n
a.action = na.exclude)
When months of Oct.97 were included the model rainfall4 gave NA values for Jun.97
suggesting collinearity. Thus it was wise to remove either Jul or Oct. Oct was removed.

anova(rainfall4)
Analysis of Variance Table

Response: I(Tonn.Hect * Sugar)


Df Sum Sq Mean Sq F value Pr(>F)
DistrictGroup 4 1.7923e+08 4.4807e+07 29.9398 < 2.2e-16 ***
fs 153 1.7937e+09 1.1724e+07 7.8337 < 2.2e-16 ***
Area 1 1.4927e+10 1.4927e+10 9974.4655 < 2.2e-16 ***
Variety 27 1.5903e+08 5.8901e+06 3.9358 4.155e-11 ***
Age 1 2.0168e+08 2.0168e+08 134.7655 < 2.2e-16 ***
Jul.96 1 1.4376e+07 1.4376e+07 9.6064 0.001958 **
Aug.96 1 1.2673e+06 1.2673e+06 0.8468 0.357539
Sep.96 1 2.6004e+06 2.6004e+06 1.7376 0.187547
Oct.96 1 4.9418e+06 4.9418e+06 3.3021 0.069296 .
Jul.97 1 1.2844e+07 1.2844e+07 8.5826 0.003421 **
Residuals 2849 4.2637e+09 1.4966e+06

Notice how the months of July appear to be significant, this suggest we could try to see if
only the months of July are significant.

anova(rainfall4,rainfall5)
Analysis of Variance Table

Model 1: I(Tonn.Hect * Sugar) ~ DistrictGroup + fs + Area + Variety +


Age + Jul.96 + Aug.96 + Sep.96 + Oct.96 + Jul.97
Model 2: I(Tonn.Hect * Sugar) ~ DistrictGroup + fs + Area + Variety +
Age + Jul.96 + Jul.97

9
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2849 4263674046
2 2852 4276681856 -3 -13007810 2.8973 0 .03386 *
---
Signif. codes: 0 e***f 0.001 e**f 0.01 e *f 0.05 e.f 0.1 e f 1

As the p-value is small we may reject the null hypothesis indicating that the months of
July are significant.

Our new model is now the following:

rainfall5=lm(I(Tonn.Hect*Sugar)~DistrictGroup+fs+Area+Variety+Age+Jul.96+Ju
l.97,data=training41,na.action = na.exclude)

anova(rainfall5)
Analysis of Variance Table

Response: I(Tonn.Hect * Sugar)


Df Sum Sq Mean Sq F value Pr(>F)
DistrictGroup 4 1.7923e+08 4.4807e+07 29.8802 < 2.2e-16 ***
fs 153 1.7937e+09 1.1724e+07 7.8181 < 2.2e-16 ***
Area 1 1.4927e+10 1.4927e+10 9954.5987 < 2.2e-16 ***
Variety 27 1.5903e+08 5.8901e+06 3.9279 4.492e-11 ***
Age 1 2.0168e+08 2.0168e+08 134.4971 < 2.2e-16 ***
Jul.96 1 1.4376e+07 1.4376e+07 9.5872 0.001978 **
Jul.97 1 8.6460e+06 8.6460e+06 5.7658 0.016405 *
Residuals 2852 4.2767e+09 1.4995e+06
---
Signif. codes: 0 e***f 0.001 e**f 0.01 e*f 0.05 e.f 0.1 e f 1

Residual standard error: 1225 on 2852 degrees of freedom


Multiple R-squared: 0.8016, Adjusted R-squared: 0.7886
F-statistic: 61.31 on 188 and 2852 DF, p-value: < 2.2e-16

As seen the model is very significant with a Global Utility of virtually zero and mostly
significant variables.

Now a stepwise test will used to determine if a better model can be derived.

step(rainfall5,direction = "both")
Start: AIC=43427.92
I(Tonn.Hect * Sugar) ~ DistrictGroup + fs + Area + Variety +

10
Age + Jul.96 + Jul.97

Df Sum of Sq RSS AIC


<none> 4.2767e+09 43428
- DistrictGroup 3 9.1712e+06 4.2859e+09 43428
- Jul.97 1 8.6460e+06 4.2853e+09 43432
- fs 153 4.6947e+08 4.7462e+09 43439
- Jul.96 1 2.2257e+07 4.2989e+09 43442
- Variety 27 1.0467e+08 4.3814e+09 43447
- Age 1 2.0240e+08 4.4791e+09 43567
- Area 1 1.4291e+10 1.8568e+10 47891

Call:
lm(formula = I(Tonn.Hect * Sugar) ~ DistrictGroup + fs + Area +
Variety + Age + Jul.96 + Jul.97, data = training41, na.action =
na.exclude)

No there can not be a better model so far as the stepwise procedure could not find
anything better.

There is heteroskedasticity behaviour present , indicating that we may have too much
linear dependence within the model.

11
Rainfall over off-season:

rainfall6=lm(I(Tonn.Hect*Sugar)~DistrictGroup+fs+Area+Varie
ty+Age+Jul.96+Jul.97+Mar.97+Apr.97+May.97+Jun.97,data=train
ing41,na.action = na.exclude)

anova(rainfall6,rainfall5)
Analysis of Variance Table

Model 1: I(Tonn.Hect * Sugar) ~ DistrictGroup + fs + Area + Variety +


Age + Jul.96 + Jul.97 + Mar.97 + Apr.97 + May.97
Model 2: I(Tonn.Hect * Sugar) ~ DistrictGroup + fs + Area + Variety +
Age + Jul.96 + Jul.97
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2849 4263674046
2 2852 4276681856 -3 -13007810 2.8973 0.03386 *

Suggest that there are months in the offseason that are significant to model.
anova(rainfall6)
Analysis of Variance Table

Response: I(Tonn.Hect * Sugar)


Df Sum Sq Mean Sq F value Pr(>F)
DistrictGroup 4 1.7923e+08 4.4807e+07 29.9398 < 2.2e-16 ***
fs 153 1.7937e+09 1.1724e+07 7.8337 < 2.2e-16 ***
Area 1 1.4927e+10 1.4927e+10 9974.4655 < 2.2e-16 ***
Variety 27 1.5903e+08 5.8901e+06 3.9358 4.155e-11 ***
Age 1 2.0168e+08 2.0168e+08 134.7655 < 2.2e-16 ***
Jul.96 1 1.4376e+07 1.4376e+07 9.6064 0.001958 **
Jul.97 1 8.6460e+06 8.6460e+06 5.7773 0.016298 *
Mar.97 1 7.4303e+04 7.4303e+04 0.0496 0.823690
Apr.97 1 1.1709e+07 1.1709e+07 7.8243 0.005189 **
May.97 1 1.2241e+06 1.2241e+06 0.8179 0.365866
Residuals 2849 4.2637e+09 1.4966e+06

12
The large p-values suggest that May.97 and Mar. 97 are insignificant. We will can test this
with a nested F-test.
Analysis of Variance Table
Model 1: I(Tonn.Hect * Sugar) ~ DistrictGroup + fs + Area + Variety +
Age + Jul.96 + Jul.97 + Mar.97 + Apr.97 + May.97
Model 2: I(Tonn.Hect * Sugar) ~ DistrictGroup + fs + Area + Variety +
Age + Jul.96 + Jul.97 + Apr.97
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2849 4263674046
2 2851 4269651799 -2 -5977753 1.9972 0.1359

This suggest that we can not reject the null hypothesis that Mar.97 and May.97 are
significant to the model.

13
Final models:
Hence the final models would look like this:

rainfalllog=lm(log(I(Tonn.Hect*Sugar))~DistrictGroup+SoilID
+I(Area^0.5)+Area+Variety+Age+Jul.96+Jul.97+Apr.97,data=tra
ining41,na.action = na.exclude)

Analysis of Variance Table


Response: log(I(Tonn.Hect * Sugar))
Df Sum Sq Mean Sq F value Pr(>F)
DistrictGroup 4 17.83 4.46 21.2698 < 2.2e-16 ***
SoilID 153 284.23 1.86 8.8646 < 2.2e-16 ***
I(Area^0.5) 1 2026.01 2026.01 9667.7672 < 2.2e-16 ***
Area 1 125.48 125.48 598.7854 < 2.2e-16 ***
Variety 27 26.87 1.00 4.7491 9.929e-15 ***
Age 1 17.50 17.50 83.5171 < 2.2e-16 ***
Jul.96 1 2.49 2.49 11.8771 0.0005765 ***
Jul.97 1 0.38 0.38 1.8184 0.1776089
Apr.97 1 0.06 0.06 0.2642 0.6073098
Residuals 2850 597.25 0.21

Residual standard error: 0.4578 on 2850 degrees of freedom


Multiple R-squared: 0.8072, Adjusted R-squared: 0.7944
F-statistic: 62.81 on 190 and 2850 DF, p-value: < 2.2e-16

The only issue with this model for the one SoilID variable of 838 appears as NA. However,
in all other SoilIDs of that function are completely fine. The model has a very good
adjusted R square value of 0.7888 and has a Global Utility where the p-value is virtually
zero. This factors suggest that the Tonne.Hec*Sugar yield value can be modelled by such
a model where the Log expected value:

Log((Tonn.Hect*Sugar))~DistrictGroup+SoilID+I(Area^0.5)+Area+Variety+Age+Jul.96+Jul.97
+Apr.96

Also worth noting is that the QQ plots are linear in nature suggesting the residuals are
normally distributed. Furthermore, the residual vs fitted graph also shows that the
residuals are randomly and evenly distributed indicating that the residuals are
independent and have constant variance.

14
15
In other words sugar cane yield of the final model could be as follows:

(Tonn.Hect*Sugar)~Exp(DistrictGroup+SoilID+I(Area^0.5)+Area+Variety+Age+Jul.96+Jul.97
+Apr.96)

Alternatively is the model can be improved by the removal of the SoilID and the months
of Jul 97 and April 96. The months can be removed as they demonstrate a lack of
significance to the model in the anova table due to their large p-values. The issue with
this is simply having a model without rainfall which is not entirely accurate as rainfall is
key to plant growth.

Here is also another model that would be appropriate and that is the following:

(Tonn.Hect*Sugar)~Exp(DistrictGroup+Variety+I(Area^0.5)+Area+Age+Jul.96+Jul.97+Apr.9
7)

As the SoilID variable did provide us with an NA value for the specific variety of 838. We
could not utilize it hence we have removed it!

rainfalllog=lm(log(I(Tonn.Hect*Sugar))~DistrictGroup+I(Area^0.5)+Area
+Variety+Age+Jul.96+Jul.97+Apr.97,data=training41,na.action =
na.exclude)
Residual standard error: 0.4724 on 3003 degrees of freedom
Multiple R-squared: 0.7837, Adjusted R-squared: 0.7811
F-statistic: 294.1 on 37 and 3003 DF, p-value: < 2.2e-16

Analysis of Variance Table


Response: log(I(Tonn.Hect * Sugar))
Df Sum Sq Mean Sq F value Pr(>F)
DistrictGroup 4 17.83 4.46 19.9773 2.992e-16 ***
I(Area^0.5) 1 2230.61 2230.61 9997.2664 < 2.2e-16 ***
Area 1 122.68 122.68 549.8412 < 2.2e-16 ***
Variety 27 31.40 1.16 5.2121 < 2.2e-16 ***
Age 1 22.40 22.40 100.3715 < 2.2e-16 ***
Jul.96 1 0.32 0.32 1.4457 0.2293090
Jul.97 1 2.75 2.75 12.3385 0.0004503 ***
Apr.97 1 0.07 0.07 0.3234 0.5696294
Residuals 3003 670.04 0.22

16
17
Again the QQ plots are moderately linear in nature suggesting the residuals are normally
distributed. The residual vs fitted graph also shows that the residuals are randomly and
evenly distributed indicating that the residuals are independent and have constant
variance. Although in this model only the 78.37% percent of the variation in data can be
explained, this is model is likely the better alternative. It is the better alternative due to
the fact that when SoilID was removed the predictive power of the test was only reduced
by a mere 2%, this is very good as a model with lesser terms is often the better one as it
is less likely to produce overfitting. The global utility of the model is virtually zero and
hence a significant model.

The test data output for the model is as follows: (this is the loged output data)

1 2 3 4 5 6 7 8 9

6.628094 7.605034 5.694790 6.908057 6.953370 7.881200 6.910592 7.608660


6.757070

The following is the un-Loged data:

1.891317 2.02881 1.739552 1.932688 1.939226 2.06448 1.933055 2.029287


1.910589

The following is a short summary of how the Expected Log Tonn.Hect*Sugar would be
influenced if there were changes in the variables: (variety/district was not included due to
the vast amount of data)

Age: 0.0565101 unit loss of Expected Log Tonn.Hect*Sugar per year of age increase.

Jul.96: 0.0004569 unit increase of Expected Log Tonn.Hect*Sugar when July 96 rainfall is
present.

Jul.97: 0.0017393 unit loss of Expected Log Tonn.Hect*Sugar when July 97 rainfall is
present.

Apr.97: 0.0001665 unit increase of Expected Log Tonn.Hect*Sugar when April 97 rainfall
is present.

18
I(Area^0.5): 2.4300759 => on average a 2.091599 unit increase of Expected Log

Area: -0.3384773 Tonn.Hect*Sugar when there is a unit increase in Area^0.5 and

Area. ( 2.4300759-0.3384773=2.09159)

Conclusion:

In conclusion, we can say that the model:

log(I(Tonn.Hect*Sugar))~DistrictGroup+I(Area^0.5)+Area+Variety+Age+Jul.96+Jul.97+
Apr.97i

Is a very good candidate as it contains all of the deemed relevant terms within the
produce of Sugar Cane Yield. As mentioned the months of winter are primary due to the
rainfall, and hence through our analysis they were found to be relevant. The area was
also very important as implied by the model, more area means more yield. The Age had a
negative relation to the Sugar Cane Yield as it is obvious that older plants would decay
and generate less yield. Variety was also shown to be important as the type Sugar Cane
would have an affect on the Sugar content.

19