Anda di halaman 1dari 6

DATA ANALYTICS

ASSIGNMENT
NE-3002 Energy in Buildings

Class: MengSc Sustainable Energy


Student name: Yoga Bagus Wicaksono
Student Number : 116221387
Course Lecturer: Dominic OSullivan
Lab Instructor: Kevin Leahy
Lab Date: 3rd of October 2016
1. Introduction
Energy consumption in a building is affected by several variables. The relations between those variables with
the energy consumption could be studied and predicted using data analytics. In this report, Boole Library
electric consumptions data set for 2012 is used to predict electrical consumption of 2013.
The data analytics process in this report is using python Jupyter notebooks as the tools. Python will make a
model from the existing data that will represent the closest relation between each variables with the electricity
consumption. Therefore the model could predict the electricity consumptions from any new inputs of those
variables. That is what we called as machine learning.
2. Methodology
The enginees of this data analytics is pythons library such as numpy, seaborn, matplotlibs, panda, and
statsmodels. The first steps is to input the data set as the reference for the model. The data set used here is
Boole Library Electric Consumptions 2012. This data set consist of daily data of Boole Librarys opening
hours, foot fall, heating degree days, day light, and electrical consumptions in one year.
The next step is to see the correlation between each variables using coefficient correlation. The coefficient
correlation meassure the strength and direction between two variables. Coefficient correlation ussually
symbolized as R. The mathematical formula for computing R is:
( )( ) ............(eq.1)
=
( 2 ) ( )2 ( 2 ) ( )2
Where x is data series of the first variable, y is data series of the second variables, and n is the amount of series
in each variable. In python this calculation can be done using coefficient correlation formula from numpy
library. The result of this calculation to Boole Library Electric Consumptions 2012 data set can be seen in
Figure 1.
The next method is to make the prediction model using linear regression. Base on the reference data set, an
equation that can generalized the behaviour of target variable under the influence of other variables can be
made. Each influencing variable will has its own weight that represent the strength of that variable to the target
variable, the value of the target variable is the sum of the weighted variable (eq.2):
= 0 + 1 1 + 2 2 + + + ............(eq.2)
Where :
is the prediction/estimate of the linear model for a particular sample (target variable)
1 , 2 , are the m features of a single sample (influencing variables)
0 , 1 , are the m associated weights applied to each feature/variable
In python, this process can be done using statsmodels library package.
However before we make the linear regression model, the data set need to be splited into two set, the training
set and test set. The training set is the data set that we use to make the model, and test set is the data set that
we use to test the accuracy of the model we made. The test set isnt included when we make the linear regression
model. So it will not influence the weighting of the model. But when we input the influencing variables of the
test set to the model equation, the prediction result should be close with the real target variables value of the
test set. The training set take 80% of the original data set, and the rest 20% is the training set.
After we run the Ordinary Least Square in python, we get the weighting value of each variables (1 , 2 , )
and other statistical measurement (see Table.1). After the model is created base on the trainin data set, its
accuracy need to be checked. One of the method to check this is by running the model using test data set and
calculate the models coefficient of determination (R2). R2 value showed the fitness of the model. It is a
statistical measure of how well the model fit with the actual value from the test data set. The closer the value

1
into 1 the better the model we make. However it is not the final step. The residual plot of the prediction vs test
need to be checked (Fig.2). It will show how well the R2 value represent the fitness of the model.
Next, prediction of the Boole Library Electricity Consumption in 2013 can be created using that linear
regression model..
3. Result and Discussion
a. Coefficient Correlation
Boole Library Electric Consumptions 2012 data set consist of 5 variables of the data, Opening Hours,
Foot Fall, Heating Degree Days, Day Light, Electricity. The first thing to do to analyse this data set is to
see the relationship between each variable by using Coefficient Correlation. Figure 1 show the Coefficient
Correlation between each variable. The value 1 show the srong relation between each variable in the same
dirrection, its mean when the variable 1 goes up the variable 2 will goes up too, and vice versa. The value
-1 show the strong relation but in different dirrection, its mean variable 1 will goes down if the variable
2 goes up and vice versa. Meanwhile the value 0 means there is no relation between those variable, what
ever happen with variable 1 will not give any impact to variable 2.

Figure 1. Matrix of Coefficient Correlation for


Boole Library Electric Consumptions 2012 Dataset
In this data analysis, the target variable is electricity consumption. Therefore the main focus in this
discussion is the correlation between Opening Hours, Foot Fall, Heat Degree Days, and Day Light to the
Electricity Consumptions.
Opening hours of the library has the strongest positive correlation with the electricity consumption. It is
mean the electricity usage is linear with the opening hours. It happen because the electricity will be in use
if the library is open. And when the library close the electricity ussage will be minimum.
Foot fall also has positive strong corelation to the electricity use. Foot fall means the amount of peoples
that come into the library each days. Foot fall and electricity correlation is strong because the more people
in the library the more facility in the library that will be in use. For example, more people in the library
more reading lamp and computer that is in use, more people charging their gadgets, or using the elevator.
Heating degree days has a positive but low correlation with the electricity usage. Heating degree days
(HDD) means how many degrees the temperature need to be heat up to achieve comfort room temperature

2
in one day. It is strongly related with heating systems. However the correlation with the librarys electricity
consumptions is low because Boole library use gas for its heating systems, not electricity. But there might
be indirect correlation therefore the coefficient correlation is small.
Daylight has a negative but low correlation with the electricity usage. It seems daylight wasnt really give
direct impact to the electricity usage in library since the majority of the lamps are on all day during opening
hour. However there might be indirect negative correlation between daylight and librarys electricity
consumptions.
b. Linear Regression
Boole Library Electric Consumptions 2012 data set is used as the input to develope the linear regression
model of Boole Library Electric Consumption. Table 1 show the estimated weight/coefficient of each
variable in linear regression equation.

Table 1. Result of Ordinary Least Square from Python for the


Boole Library Electricity Consumption linear regresion model
Using the weighting/ coefficient of each variables, equation 2 can be re-write as:

= 2411.67 + 242.62 () + 0.25 () 12.61 () + 8.21 ()

Where :
is the prediction/estimate electrical consumptions (kWh)
is the Operation Hours (hours)
is the Foot Fall (persons)
is the Heat Degree Day (oC)
is the Day Light (hours)
Table 1 showed that the R2 value of the model is 0.94. This R2 value is calculated based on the training
data set. Its mean 94% of the estimated electrical consumptions variance (difference between estimation
and actual electrical consumption) can be explained by those 4 variables (OH, F, HDD , DL) in the model,

3
meanwhile the other 6% can not be explained by the linear regression model. However, when we use the
test data set to get the prediction of electrical, the R2 we got is:

The R2test is lower than R2training. Because R2training is calculated with the model which is developed base on
the training dataset. Meanwhile the test dataset is something that is not related with the model. However,
R2test value is high enough, its mean 89% of the estimated output variance can be explained by the linear
regression model.
The big R2 value doesnt mean the model is absolutely accurate. It might have some factors that cant be
explained by linear regression (example: Sinusoidal model) which could bring wrong interpretation of the
R2. Therefore we need to compare the prediction and the actual value of the electricity consumption of
test data set (Fig.2).

Figure 2. Comparison of The Estimated Electricity Consumption


and The Actual Electricity of Test Dataset
Figure 2, show that there is some variance between the 2 data. But the electrical consumptions predictions
have the similar trend with the actual value in the test set. The model has quite good accuracy.
c. Boole Library Electricity Consumption in 2013

Boole Library Electricity Consumption


10000
Electricity Consumptions

8000 2013
Electricity
6000 (estimate)
(kWh)

2012
4000 Electricity

2000

0
01-Jan 31-Jan 01-Mar 31-Mar 30-Apr 30-May 29-Jun 29-Jul 28-Aug 27-Sep 27-Oct
Figure 3. Boole Library Electricity Consumptions 2013 (estimated)
and Boole Library Electricity Consumptions 2012

The electricity consumption in 2013 is estimated using the linear regression model that already developed
in the previous step. The input is the Boole Library Electric Consumptions 2013 dataset which consist

4
of Opening Hours, Foot Fall, Heat Degree Days, and Day Light. Figure 3 show the graph of estimate
electricity consumption in 2013 and electricity consumption in 2012. From the graph we can see the
electricity consumption will reach some low point, and rise up again. However, it never reaches 0. It
means the Library always consume electricity. It explains the constant value of 0 which related to the
minimum electricity consumption of the library. This happen because of in the weekend the opening hours
of library is shorter. The graph also shows there are a significant decrease of electricity usage during May-
August. It is related to Holiday during academic year. It also shows different minimum electricity usage
during February-May and May-November. This difference might be caused by library policy changes of
opening hour during first and second semesters, or because of the peak time the students come to the
library.
4. Conclussion
Boole library electricity consumption is related with several factors, such as opening hour and foot fall. This
conclusion is the result of data analytical process of Boole Library Electric Consumption 2012 dataset. The
data analytical process used in this exercise is Coefficient Correlation, and Linear Regression. The coefficient
correlation also showed that Heating Degree Days and Daylight might not have big impact to the electric
consumption.
The linear regression model is developed from the Boole Library Electric Consumption 2012 dataset using
Ordinary Least Square in Python. It calculates the weighting coefficient for each variable to estimate the
electricity consumption. The model created has quite high accuracy. The internal coefficient of determination
(R2training) is 94%, and the coefficient determination (R2test) from the test dataset is 89%. The plot of estimation
and actual electricity also showed a relatively small residue. However, the weighting coefficient for the HDD
and Daylight seems irrelevant. If the HDD value is high it means need more heating, energy consumption rise.
But the HDD weight coefficient is negative, which show otherwise. Same with daylight, if its value is high,
means no need for additional light. But daylight coefficient is positive which means daylight and electric
consumptions is in the same direction. Base on this rationale and the coefficient correlation, it concludes that
HDD and daylight is not related to Electricity consumption. These two variables could be the bias of the linear
regression model. It these two variable (HDD and daylight) is taken out from the input, the linear regression
model might be improved.
Estimation of the electric consumption can be calculated using the developed linear regression model. The
graph of Boole library electric consumption could show some pattern and variation of electricity usage of
Boole library.

Anda mungkin juga menyukai