Anda di halaman 1dari 5

# UECM2263 Applied Statistical Models, January 2016

Assignment Report 1
UTAR, Malaysia

## PREDICTION ON HOUSE ASKING PRICE IN JINJANG

Sutha Kathiravan1, Keoy Eng Chiew2, Chin Yan3, Sathya Seelan1, Liaw Zhen Liang1
1

## AM, Y2S2, Department of Mathematics and Actuarial Sciences, UTAR

sutha_sky@yahoo.com

## Abstract Statistical research was carried out

on houses pending for sale in Jinjang, Kuala
Lumpur in order to determine the housing
affordability and other information that are
relevant for potential buyers to know. This will
ensure that house hunters are able to make good
decisions when choosing a house through our
statistical findings. Key word: statistics
I.

INTRODUCTION

## This project is a research on the factors affecting

the houses up for sale in Jinjang. Researches in
Malaysia currently say that even though the
economy is slowing and the value of ringgit
Malaysia is depreciating but economists say that
the price of the houses continue to rise. Making it
difficult for Malaysians to own a house in the
future.[2]
Through our project, we have identified 13
predictor variables: We want to investigate what are
the Area, Housing Type, Land area, Built-up area,
Tenure type, Number of bedroom, Number of
bathroom, Furnished, Distance to nearest
LRT/KTM/Monorail station, Distance to nearest
primary school, Distance to nearest secondary
school, Distance to nearest shopping mall, and
Distance to nearest mosque. Plus, and a response
variable (Asking price of the property). We
collected data from a website (iProperty) and then
conducted statistical analysis on it using Rprogram. In this project, we do not use area and
land area as predictor variable because they are not
related. Distance to nearest mosque also not used
because there are too many missing value.
The variables we will be using are defined as
follow:

+6011-26169767
* X2 is the Built-up Area.
* X3 is the Tenure Type.
* X4 is the Number of Bedroom.
* X5 is the Number of Bathroom.
* X6 is the Distance to the nearest
LRT/KTM/Monorial station.
* X7 is the Distance to the nearest primary school.
* X8 is the Distance to the nearest secondary school
* X9 is the Distance to nearest shopping mall/
convenience store
* X10 is the Furnished.
The significance level that we will use throughout
this project is =0.05.
II.

Methodology

## We carried out our project on houses that were up

for sale in Jinjang, Kuala Lumpur through
iProperty. Our population is Jinjang and our sample
consists of 65 observations.
Firstly, we suggest a multi-linear regression model
to explain the relationship between the response
variable Y, and the predictor variables. Under this
hypothesis,
Y= 0 + 1X1 + 2X2 + B3X3 + 4X4 + 5X5 +
6X6 + 7X7 + 8X8 + 9X9 + 10X10 +
III.

Residuals:
Min

1Q Median

-532028 -101390

Coefficients:

3Q

Max

## UECM2263 Applied Statistical Models, January 2016

Assignment Report 1
UTAR, Malaysia

3.
4.
5.
6.

X1

X2

220.92

## The error term has constant variance.

The errors are normally distributed.
The error are uncorrelated.
No outliers.

## The validity of these assumptions should always be

doubtful and conduct analysis to examine the
adequacy of model. The residual vs x is examine
the linearity for a model while the residual vs
predicted value is measure the constancy of the
variance. Normal probability plot is measure
whether the error is normally distributed or not. In
this assignment we are going to check X2 (Built-up
Area) and X4 (Number of Bedroom) as the others
variables are not related.

X3
***

X4
***

X5

X6

X7

X8

X9

Input

X10

## Non-linearity of Regression Model

model.reg<-lm(Y~X2,data=model.dat)

0.1 1

## plot(x=model.dat\$X2, y=model.reg\$residuals, xlab

= "Built-Up Area", ylab = "Residuals",
main="Residuals vs. Built-Up Area", col = "red",
pch =
19,cex=1.5,panel.first=grid(col="gray",lty="dotted
"))

## Residual standard error: 230400 on 54 degrees of

freedom

abline(h=0,col="blue")

0.7522

Output

---

## F-statistic: 20.43 on 10 and 54 DF, p-value:

5.681e-15
After analyzing the data to find the summary of
each variables, we have found that the fitted
regression equation is
= 79525.93 + 99512.34X1 + 220.92X2
262283.09X3 + 83774.21X4+ 109262.42X5
67520.15X6 - 130055.64X7 + 31032.78X8 +
50432.54X9 + 106441.45X10

Assumptions:

1.
2.

## The relationship between the response Y

and the predictor variables are linear.
The error term, has zero mean.

## Figure 1: The residuals fall within a horizontal

band centred around 0, displaying no systematic
tendencies to be positive and negative. Therefore,
linear regression model is appropriate.
Residual VS Bedrooms

model.reg<-lm(Y~X4,data=model.dat)
plot(x=model.dat\$X4, y=model.reg\$residuals, xlab
= "Bedrooms", ylab = "Residuals",
main="Residuals vs. Bedrooms", col = "red", pch =

## UECM2263 Applied Statistical Models, January 2016

Assignment Report 1
Lecturer: Dr. Chang Yun Fah

19,cex=1.5,panel.first=grid(col="gray",lty="dotted
"))
abline(h=0,col="blue")

UTAR, Malaysia

Residual VS Bedrooms
Input
model.reg<-lm(Y~X4,data=model.dat)
plot(x=model.reg\$fitted.values,
y=model.reg\$residuals, xlab = "Bedrooms", ylab =
"Residuals", main="Residuals vs. Predicted
Values", col = "red", pch =
19,cex=1.5,panel.first=grid(col="gray",lty="dotted
"))
abline(h=0,col="blue")
Output

## Figure 2: The residuals fall within a horizontal

band centred around 0, displaying no systematic
tendencies to be positive and negative. Therefore,
linear regression model is appropriate.

## Non-constancy of Error Variance

Residual VS Built-Up Area
Input

model.reg<-lm(Y~X2,data=model.dat)
plot(x=model.reg\$fitted.values,
y=model.reg\$residuals, xlab = "Built-Up Area",
ylab = "Residuals", main="Residuals vs. Predicted
Values", col = "red", pch =
19,cex=1.5,panel.first=grid(col="gray",lty="dotted
"))
abline(h=0,col="blue")
Output

## Figure 4: The graph shown that all points are

randomly scatted within a horizontal band centred
and no funnel shape is observed. Hence, constant
variance assumption seems to be fulfilled.
Normal Probability Plot
Price vs Built-Up Area
Input
model.reg<-lm(Y~X2,data=model.dat)
qqplot<qqnorm(model.reg\$residuals,main="Normal
Probability Plot",xlab="Built-up
Area",ylab="Price",plot.it=TRUE ,col="blue",
pch=19,
cex=1.5,panelfirst=grid(col="gray",lty="dotted"))
abline(lm(qqplot\$y~qqplot\$x))

## Figure 3: The graph shown that all points are

randomly scatted within a horizontal band centered
and no funnel shape is observed. Hence, constant
variance assumption seems to be fulfilled.

## UECM2263 Applied Statistical Models, January 2016

Assignment Report 1
UTAR, Malaysia

Output

## the error of all regressors conform to the normality

assumption initially made. No violation of
normality is detected.
According to [3], it says here that the interior
designs do affect the pricing of the house due the
number of bathrooms, bedrooms and the housing
type apart from its geographical location.
Coefficient of Determination
model

R.sq

## Figure 5.From the graph above, error terms do not

depart substantially from normality suggesting that
the error of all regressors conform to the normality
assumption initially made. No violation of
normality is detected.

x1

0.1331626

0.1194033

x1, x2

0.5943963

0.5813123

x1, x2, x3

## Referring to [1], research has shown that built-up

affects according to its location. If it is rural, the
housing price should be lower but if the it is located
in a city with a large built-up area, dwellers would
show in favour of those kind of houses.

## x1, x2, x3, x4,x5

0.7551079

0.7343543

x1,x2,x3,x4,x5,x6

0.7769258

0.7538492

x1,x2,x3,x4,x5,x6,x7

0.7831864

0.7565602

x1,x2,x3,x4,x5,x6,x7,x8

0.7832118

0.7522420

x1,x2,x3,x4,x5,x6,x7,x8,x9

0.7883409

0.7537057

0.7909363

0.7522208

Price vs Bedrooms

0.6297972 0.6115905
0.7343859

10 x1,x2,x3,x4,x5,x6,x7,x8,x9,x10

0.7166783

Input
model.reg<-lm(Y~X4,data=model.dat)

Call:

qqplot<qqnorm(model.reg\$residuals,main="Normal
Probability
Plot",xlab="Bedrooms",ylab="Price",plot.it=TRUE
,col="blue", pch=19,
cex=1.5,panelfirst=grid(col="gray",lty="dotted"))

lm(formula = Y ~ X1 + X2 + X3 + X4 + X5 + X6 +
X7, data = model.dat)

abline(lm(qqplot\$y~qqplot\$x))

Residuals:
Min

1Q Median

3Q

Max

## -534969 -81367 13282 156080 410508

Output
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 139279.00 138479.11 1.006 0.318774
X1
X2

## Figure 6 From the graph above, error terms do not

depart substantially from normality suggesting that

218.51

X3
***

X4

## UECM2263 Applied Statistical Models, January 2016

Assignment Report 1
UTAR, Malaysia

X5
**

X6

X7

0.1 1

## Residual standard error: 228400 on 57 degrees of

freedom
0.7566
F-statistic: 29.41 on 7 and 57 DF, p-value: < 2.2e16
IV.

Conclusion

## As you can see the best model is model7 because

model7 has the highest adjusted R squared.
Therefore, the best model used to estimate the price
of a house is
= 139279 + 102055.71X1 + 218.51X2
247709.72X3 + 81540.08X4 + 119661.49X5
77243.87X6 92439.26X7
The challenge that we faced throughout this project
was having certain predictors corresponding with
the response variable because throughout the
process of collecting the data and compiling them,
we have found that the variable that contains the
distance to the nearest mosque has a lot of missing
variables and it became a challenge for us to do the
scatterplot. Hence, this predictor had to be ignored.

## In the future, a project should be carried out where

all the predictors have values.
V.

Reference

## [1] Gallent N., Shucksmith M., Tewdwr-Jones M.

(2003). Housing in the European Countryside:
Rural Pressure and Policy in Western Europe.
Architecture. 35-36
[2] Malaysias property market slowing sharply.
(2016, January 4). Global Property Guide.
Retrieved
March
23,
2016,
from
http://www.globalpropertyguide.com/Asia/malaysia
/Price-History
[3] Positive and negative impacts on house prices.
Rightmove.
Retrieved
from
http://www.rightmove.co.uk/what-affects-houseprices.html
VI. Overall
Overall, from the project that we have carried out,
we made assumptions of a multiple linear
regression models. Obtained a scatterplot to test its
validity by testing the non-linearity of regression
model and the non-constancy of error variance
between Residual vs. Built-Up Area and Residual
vs. Bedrooms; testing the Normal Probability Plot
between Price vs. Built-Up Area and Price vs.
Bedroom. The coefficient of determination was
obtained in order R square and the adjusted R
square so that the best model could be obtained.