Anda di halaman 1dari 6

CASE BASED REASONING AS A TOOL "IN THE SEARCH FOR KNOWLEDGE

IN DATABASES": PROSPECTS OF A "SUPPORT SYSTEM" 'FOR RESIDENTIAL


VALUATION AND THE CONSTRUCTION OF RESIDENTIAL IEIOUSING.
Ilesh Dattani & Max Bramerl
"I have but one lamp by which my feet are guided, and that is the lamp of experience. I know
no way of judging the future but by the past. 'I Patrick Henry (S'peech in Virginia
Convention, Richmond,March 23,1775).
"Computers have promised us a fountain of wisdom but delivereld a flood of data"
A frustrated MIS executive

Introduction
What are we supposed to do with these *floodsof data ?' A very small proportion of it will
ever be seen by human eyes and even less will be analysed and understood. The data, that is
and will be intelligently analysed and presented, would be a valuable resource and
commercially could be used to a competitive advantage. The widespread exploitation of
knowledge discovery has been synergistic with this realisation. Case-based reasoning
represents part of Artificial Intelligence's scientific ambitions within problem solving and the
search for efficient methods to define descriptive pattems and explanations within such 'a
flood of data'. Theoretically it can be used when working with large amounts of historical
data and in situations where there is a need to extract order from complex data.
At a simplistic level case-based reasoning represents the ability to solve a given problem by
remembering a previous similar situation and by reusing information and knowledge of that
situation. This approach is supported by empirical evidence and results from cognitive
psychological research and within these findings lie part of the foundations for the casebased approach.
Essentially the roots of case-based reasoning in AI are found in the works of Roger %hank2
on dynamic memory and the fundamental role that a reminding of earlier situations have in
problem solving and learning.
For a fuller overview ofthe theoretical principles see [Dattani & Bramer.,95. Aamodt and
Plaza.,94. Watson & Marir.,94].
Prospects for Applications
The work in progress within the domain of property valuation and construction involves the
use of two datasets at present. The first of these was collated from the 1970 US Census.3

lDept of Information Science, AI Research Group, University of Portsmouth, Milton Site, Portsmouth.
PO4 8JF
Email: DATTANII@sis.port.ac.uk & BRAMERMA@cv.port.ac.uk
2Schank, R (1982): Dynamic memory; a theory of reminding and learning in computers andpeople.

Cambridge University Press.


30riginaldaraset held in the UCI Repository OfMachhe Learning Databases;and Domain Theories

6/ 1

The data concerns housing values in the suburbs of Boston. Their are 506 instances ail wit1
14 attributes, (13 continuous attributes (including "class" attribute WEDV"), one is a binary.
valued attribute. The attributes are:
CFUM
per capita crime rate by town
2%
proportion of land zoned for lots over 25,000 sq. ft.
INDUS
proportion of non-retail business acres per town
cHAs4
Charles River dummy variable
(=I (iftract bounds river>
eke

4)
NOX
Rh4
AGE
DIS
RAD

TAX
PTRATIO
B

LSTAT
MEDV

nitric oxides concentration (parts per 10 million)


average number of rooms per dwelling
average number of owner-occupied units built prior to 1940
weighted distances to five Boston employment centres
index of accessibility to radial highways
full-value property-tax rate per $10,000
pupil-teacher ratio by town
IUOU(Bk 0.63' A2 (Bk represents proportion of blacks by town)
% lower status of the population
Median value of owner-occupied homes in $1000'~

I There are no missing attribute values and all the data generated is in numerical format.
The analysis of house price data to establish the effect of variations in locational and physical
attributes has been attempted with the use of statistical techniques alone: the aim is usually to
establish which attributes can then be used to synthesise valuations of a range of different
properties. On their own statistical techniques appear to have achieved only a limited degree
of success despite the relatively complex calculations involved. Results derived can be open
possibly to different interpretation and any additional 'knowledge or information' used by the
expert in making the valuations is not readily apparent.
We are looking to consider whether numerical analysis and CBR can complement one
another within a "hybrid system". Within such a system statistical techniques can be used to
perform Exploratory Data Analysis (EDA) on large datasets after which CBR can be used
and information derived from the initial statistical tests can act as input into any CBR
system, particularly when applying appropriate weightings and developing qualitative
models.
Statistical Techniques
By using correlational methods to identify relationships between the attributes one can
reasonably make more informed judgements about developing qualitative models and
determining the respective weightings for attributes that might together be related 'in-order'
to build 'virtual q-nodes' .A 'virtual q-node' is used at one-level to summarise case-data into
groups which then become new attributes within the 'case representation'. The virtual qnodes represent causal relationships between the attributes that have been incorporated to
contrive the 'virtual q-node'. Being able to provide precise information about these
relationships is not however a pre-requisite. At the 'top level' the qualitative model is used to
represent known causal relationships between case features that might affect a solution or
outcome.(Barletta.,93)
The Correlation Coefficient is generally used when we are concerned with relationships,
however, the independent variable (X)usually has many quantitative levels (ie. XI,X2, .-...,

CHAS represents a binary attribute, all the rest being continuous.

Xi ) and the experimenter is interested in showing that the dependent variable is some
function of the independent variables. (Howe11.,87)

In defining the respective weightings of the 'match fields', regression analysis is a method we
intend to apply in order to estimate how good a predictor 'XI is of Y in comparison to Xi'
(where i=2,3 ,.....,n).
The technique of Principal Component Analysis (Pearson.,l901. Hottceling.,l933) attempts to
achieve some degree of economy in that within any respective CBR model 20 or 30 original
variables might be. adequately represented by a significantly smaller number of principal
components, at a given level level of statistical significance.
The steps in a principal component analysis can be stated as:
[ 13 Make sure that the assumptions of 'a normal distribution' and 'homogeneity of variance'
can be applied to the dataset.
[2] Calculate the covariance matrix

covxr =

C ( X - F ) ( Y - y>
N-1
the cov-

This would be a correlation matrix if the assumptions for step 1 can be met.
[3] Find the eigenvalues hl,h2, ....,hp and the corresponding eigenvectors al,a2, ...,ap. The
coefficients of the ith principal component are then given by ai while hi is its variance.
[4] Discard any components that only account for a small proportion of the variation in the
data.
Modelling in Reminds
The correlation matrix shows simple positive and negative relationships for variables
Xi, .....,Xi3 in relation to MEDV(Y1). This allows one to make some initial decisions about
appropriate match fields for the outcome field MEDV. Weightings cam also be applied based
on the strength of the respective positive or negative relationships that have been identified.
Within Qualitative models the correlation matrix results can be used to again determine +ve
and -ve relationships for the virtual nodes and inevitably on the outcome field.
Simple models using this data have been implemented in Remind as a 'test bed'. One of the
next stages is to carry outhmplement a Multiple Regression model on the above data. In this
way a more robust and reliable set of indicators andor predictors would be available when
implementing the underlying CART algorithm, accompanied with the appropriate Qualitative
models and symbol hierarchies.(Brieman et a1.,84)
Symbol hierarchies represent data that can be classified and rankLed. Graphically it is a
branching structure of
"parents" and "children" representing generalisations and
specialisations. Through such a mechanism the system is provided with knowledge about the
data within the domain. (Barletta.,93)

SRemindwSolutions from prior experience: A case-based reasoning development shell. Copyright 0


1992, Cognitive Systems, Inc.

-1-Q

0.70

-dent

TAX

vanabk

MER!!!

0.30
0.46(-)
0.50(-)

PTRATIO

0.30(-)

LSTAT
CRIM
INDUS

0.74(-)

CHAS

0.18

0.38(-)
0.48(-)

0.36

Nearest Neighbour retrievals involve the use of the importance editor within which weight
vectors can be specified for the assigned 'match fields'. (Watson.,94)
/

CLWi
Fig 3. A Nearest Neighbour Algorithm

NeighbourhoodAesthetic (1 1%)
ZN = 0.36

Accessibility (5%)
RAD = -0.38
DIS = 0.30

LSTAT = -0.74

CHAS = 0.18

These tables represent the appropriate weightings and the 'virtual q-nodes'. Although this
represents work in progress, accuracy appears to-improve when the weightings for the
respective virtual q-nodes are increased. It is apparent that retrieval and indexing techniques
can be incorporated into the system at varying levels of complexity through the use of
qualitative models, virtual q-nodes, and Prototypes a11 of which can be used to represent
'domain specific knowledge' and to accommodate knowledge guided induction within the
retrieval process. Past usage of this data has been made in the area of 'Combining InstanceBased and Model Based Learning'(Quinlan.,93a).
The second dataset is a library of 34 cases from the "Inland Revenue Valuation Office
(Southern Region)" . The variables in the dataset include:
[l] Location, [2] VO Code, [3] Type, [4] Format, [5] Constructed, [6] Reduced Covered
Area, [7] Central Heating, [8] Garage, [9] Car space and the outcome variable [lo] Value.
This data has been used for valuation systems using 'artificial neural networks' (Evans..,92.

614

Tay.,92.) We are using this dataset within Remind and C4.5 (Quinlan.,93b) to further assess
the effectiveness of Case-based Reasoning for a 'decision support system' within the
respective domain.

Summary
The main purpose of this project is the evaluation of a 'hybrid system' that would involve the
use of a mathematical model ,namely principal component analysis (PCA), with the results
being applied to a CBR system incorporating CART, ID3, C4.5 (and their respective
derivatives).
Steps [I] and [2] of PCA have been applied and have then to some extent been applied to a
CBR tool, Remind, which incorporates CART as an underlying algorithm in its 'Inductive
Retrieval Engine'.
Within the domain of residential valuation the tentative results to date indicate that this could
be used in the development of a 'decision support system' for applications to determine
taxation valuation, particularly the new Council Tax,or for loan secririty purposes. It could
be used as an additional tool in the valuation process within which the system could gather
comparables and adjust for differences relating to specific indicators. Such a system might
also identify patterns based on similarities, interdependencies and relationships between predetermined identifiers within the data. This might be useful for preliminary valuation prior to
inspection. It would highlight non-conforming figures for further investigation and in some
cases suggest 'a figure on which to work'. This might be suitable for application where bulk
valuations might be required.

References
[11Agnar Aamodt and Enric Plaza.,Case-Based Reasoning: Foundational Issues,
Methodological Variations, and System Approaches. AICOM Vo1.7 Nr.1. 39-59. March
1994.
[2]Barletta. R, et al.,ReMind: Developer's Reference Manual. Cognitive Systems Inc. 1993.
[3]Brieman. L, et al.,Classification and Regression Trees. Belmont, CY: ?"&worth, 1984.
[4]Dattani. I and Bramer. M.A., Case-Based Reasoning: Theoretical Principles, Development
tools and the prospects for applications. Artificial Intelligence Research Group, Universiv of
Portsmouth: Technical Report, 1995.
[SIEvans. A, James. H and Collins. A.,Artificial Neural Networks: an application to
Residential Valuation in the UK. Journal of Property Valuationand Investment: 11, 195-204,
Computer Briefing 1992
[6]Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the dennand for clean air', J.
Environ. Economics & Management:Vo1.5,8 1 102, 1978.
[7]Hotelling, H. Analysis of a complex of statistical variables into principal components.
Journal of Educational Psychology 24,47 1-4 1,498-520.
[8]Howell, D.C.,Statistical Methods for Psychology, 2nd Edition.PWS-Kent Publishing
Company, 1987.
[9]Pearson, K.,On lines and planes of closest fit to a system of points in space. Philosophical
Magazine 2,557-72.
[1OIQuinlan, R.,Combining Instance-Based and Model-Based Learning. Proceedings of the
tenth International Conference on Machine Learning. Morgan Kaiufman Pub Inc, 1993%
236-243.
[1 IIQuinlan, R.,C4.5:Programs for Machine Learning. Morgan Kaujman Pub Inc, 1993b.
[12]Schank, R., Dynamic memory; a theory of reminding and learning in computers and
people. Cambridge UniversityPress, 1982.
[13lTay, D.P.H. and Ho, D.K.K.,Artificial Intelligence and the Mass Appraisal of Residential
Apartments. 10 Journal of Properq Valuationand Investment: 2, 19912,525-540.

[14]Watson, I.,The Case for Case-Based Reasoning in Engineering Decision Support.


Proceedings of Information TechnologyAwareness in Engineering: Informing Technologies
to Support Engineering Decision Making. (Edited by James A. Powell), 55-64. Institute of
Civil Engineers, London. November 1994.
[151Watson, I and Marir, F.,Case-Based Reasoning: A Review. The Knowledge Engineering
Review: Vo1.9, No.4, 1994.

0 1995 The institution of Electrical Engineers.


Printed and published by the IEE. Savoy Place, London WCOR OBL. UK.

616

Anda mungkin juga menyukai