Anda di halaman 1dari 33

CHAPTER 1: INTRODUCTION

With the demand for water increasing day by day, proper development and management of available water resources becomes necessary. For this the estimation of dependable yield is a prerequisite. But the lack of long flow records of most of the river basins in India makes it difficult to estimate the dependable yield of a water course. To overcome such difficulties, multivariate regression analyses are often employed to relate the annual runoff volume with the rainfall and measurable morphological characteristics of the basin. These derived relationships could be utilized for estimating the annual runoff of a basin provided the basins used in deriving the relationships are representative of the selected basin. In addition, meteorological homogeneity is often assumed whenever there is lack of data in a river basin and analysis is carried out by utilizing data from neighbouring basins. However, the neighbouring basins need not necessarily be homogeneous with the basin under consideration. The identification of homogeneity of basins manually demands a high degree of subjective judgment. Although an expert may integrate multivariate, nonlinear and unquantifiable factors quite well based on his subjective judgment, different experts may not produce the same result. Hence there is a need to have a rational procedure for grouping of basins based on hydrological homogeneity. In this study, the clustering done by various algorithms based on data homogeneity are studied and compared with that done using Artificial Neural Network (ANN, B.S. Thandaveswara1 and N.Sajikumar, 2000). The data of river basins used are independent of its geographical location. The geographic distribution of clusters thus obtained is also discussed. Clustering algorithms used are Expectation Maximization (EM), Farthest First, Hierarchical Clusterers, sIB, Simple K-Means, X-Means.

Department Of Civil Engineering, GEC, Thrissur 1

CHAPTER 2: LITERATURE REVIEW

2.1 RIVER BASIN A river basin (also called drained basin) is the portion of land drained by a river and its tributaries. It encompasses the entire land surface dissected and drained by many streams and creeks that flow downhill into one another, and eventually into one river. The final destination is an estuary or an ocean. It sends all the water falling on the surrounding land into a central river and out to the sea. Everyone lives in a river basin. Even if we don't live near the water, we live on land that drains to a river or estuary or lake, and our actions on that land affect water quality and quantity far downstream. 2.1.1 River basin and watershed Both river basins and watersheds are areas of land that drain to a particular water body, such as a lake, stream, river or estuary. In a river basin, all the water drains to a large river. The term watershed is used to describe a smaller area of land that drains to a smaller stream, lake or wetland. There are many smaller watersheds within a river basin. A river basin or a watershed is also referred to as a catchment basin. 2.1.2. Importance of river basins River basins are quite important in many ways and in many fields, as follows:

Department Of Civil Engineering, GEC, Thrissur 2

Hydrology In hydrology, the river basin is a logical unit of focus for studying the movement of water within the hydrological cycle, because the majority of water that discharges from the basin outlet originated as precipitation falling on the basin. A portion of the water that enters the ground water system beneath the drainage basin may flow towards the outlet of another drainage basin because groundwater flow directions do not always match those of their overlying drainage network. Measurement of the discharge of water from a basin may be made by a stream gauge located at the basin's outlet.

Geomorphology River basins are the principal hydrologic unit considered

in fluvial geomorphology, that is the study of river related landforms. A river basin is the source for water and sediment that moves through the river system and reshapes the channel. Ecology River basins are important elements to consider also in ecology . As water flows over the ground and along rivers it can pick up nutrients, sediment, and pollutants. Like the water, they get transported towards the outlet of the basin, and can affect the ecological processes along the way as well as in the receiving water source. Modern usage of artificial fertilizers, containing nitrogen, phosphorus, and potassium, has affected the mouths of watersheds. The minerals will be carried by the watershed to the mouth and accumulate there, disturbing the natural mineral balance. This can cause eutrophication where plant growth is accelerated by the additional material. Catchment factors The catchment is the most significant factor determining the amount or likelihood of flooding. Catchment factors are: topography, shape,

Department Of Civil Engineering, GEC, Thrissur 3

size, soil type and land use. Catchment topography and shape determine the time taken for rain to reach the river, while catchment size, soil type and development determine the amount of water to reach the river. Topography determines the speed with which the runoff will reach a river. Clearly rain that falls in steep mountainous areas will reach the river faster than flat or gently sloping areas. Shape will contribute to the speed with which the runoff reaches a river. A long thin catchment will take longer to drain than a circular catchment. Size will help determine the amount of water reaching the river, as the larger the catchment the greater the potential for flooding. Soil type will help determine how much water reaches the river. Certain soil types such as sandy soils are very free draining and rainfall on sandy soil is likely to be absorbed by the ground. However, soils containing clay can be almost impermeable and therefore rainfall on clay soils will run off and contribute to flood volumes. After prolonged rainfall even free draining soils can become saturated, meaning that any further rainfall will reach the river rather than being absorbed by the ground. Land use can contribute to the volume of water reaching the river, in a similar way to clay soils. For example, rainfall on roofs, pavements and roads will be collected by rivers with almost no absorption into the groundwater. 2.1.3. Need to classify river basin River basin classification has an important role in fluvial management and conservation. It serves a wide range of purposes, including scientific research, river management and river restoration and conservation. Estimation of annual runoff is needed for planning of water resources projects. There is a lack of longperiod runoff records in a large number of catchments in India. In the absence of detailed data, use of regional relationships is made for hydrologic planning. The mean annual rainfall, mean annual temperature, and vegetal (forest) cover factor are the most important variables influencing the mean annual runoff. The forest cover is positively correlated with mean annual runoff from large catchments. Relationship for the coefficient of variation of annual runoff has also been

Department Of Civil Engineering, GEC, Thrissur 4

proposed. Making use of the fact that the annual runoffs follow the normal frequency distribution, one can generate annual runoff series for the required purpose. But for all this first of all we need to identify different classes of river basins according to data homogeneity. The results of the present study can be used for planning purpose of water resources. 2.2. CLASSIFICATION AND CLUSTERING Classification and clustering are examples of the more general problem of pattern recognition, which is the assignment of some sort of output value to a given input value. Other examples are regression, which assigns a real-valued output to each input; sequence labelling, which assigns a class to each member of a sequence of values (for example, part of speech tagging, which assigns a part of speech to each word in an input sentence); parsing, which assigns a parse tree to an input sentence, describing the syntactic structure of the sentence; etc. The piece of input data is formally termed an instance, and the categories are termed classes. The instance is formally described by a vector of features, which together constitute a description of all known characteristics of the instance. Typically, features are either categorical i.e. consisting of one of a set of unordered items, such as a gender of "male" or "female", or a blood type of "A", "B", "AB" or "O", ordinal consisting of one of a set of ordered items, e.g. "large", "medium" or "small", integer-valued e.g. a count of the number of occurrences of a particular word in an email or real-valued e.g. a measurement of blood pressure. Often, categorical and ordinal data are grouped together; likewise for integervalued and real-valued data. Furthermore, many algorithms work only in terms of categorical data and require that real-valued or integer-valued data be discretized into groups (e.g. less than 5, between 5 and 10, or greater than 10). Classification normally refers to a supervised procedure, i.e. a procedure that learns to classify new instances based on learning from a training set of instances that have been properly labelled by hand with the correct classes. The corresponding unsupervised procedure is known as clustering, and involves grouping data into classes based on some measure of inherent similarity.

Department Of Civil Engineering, GEC, Thrissur 5

2.2.1. Classification (supervised learning) Supervised learning is the machine learning task of inferring a function from supervised training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which is called a classifier (if the output is discrete) or a regression function (if the output is continuous). The inferred function should predict the correct output value for any valid input object. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way . The parallel task in human and animal psychology is often referred to as concept learning. 2.2.2.Clustering (unsupervised learning) In machine learning, unsupervised learning refers to the problem of trying to find hidden structure in unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution. This distinguishes unsupervised learning from supervised learning and reinforcement learning. Clustering allows a user to make groups of data to determine patterns from the data. Clustering has its advantages when the data set is defined and a general pattern needs to be determined from the data. We can create a specific number of groups, depending on our needs. One defining benefit of clustering over classification is that every attribute in the data set will be used to analyze the data. A major disadvantage of using clustering is that the user is required to know ahead of time how many groups he wants to create. For a user without any real knowledge of his data, this might be difficult. It might take several steps of trial and error to determine the ideal number of groups to create. Unsupervised learning is closely related to the problem of density estimation in statistics. However unsupervised learning also encompasses many other techniques that seek to summarize and explain key features of the data.

Department Of Civil Engineering, GEC, Thrissur 6

Many methods employed in unsupervised learning are based on data mining methods used to preprocess data. 2.3. DATA The hydrologic data compiled for this study include annual rainfall, annual temperature, and catchment characteristics represented by average slope, drainge density, and land uses. These data were compiled for 55 catchments spread over a large part of India, as shown in Fig.1. Data of these 55 catchments were available from a journal paper by U. C. Kothyari and R. J. Garde, annual runoff estimation for catchments In india. Out of these 55 catchments, the data for 26 catchments were already available from the earlier studies (Kothyari et al. 1985; Garde and Kothyari 1987; Garde and Kothyari 1990). For the remaining catchments the data were determined as described herein.

Figure.2.3 Geographical Locations of Different Catchments

Department Of Civil Engineering, GEC, Thrissur 7

The annual runoff was obtained by summing up the daily discharges for the year. Since all the runoff data used in this study have already been analyzed by various investigators and published elsewhere, the need for re-examination of the data for likely error was not felt. Keeping in view the fact that the density of raingauge stations on the catchment varied largely, it was thought adequate to take an arithmetic average of individual station records to obtain annual rainfall over the catchment. Annual temperatures were obtained by taking the arithmetic sum of annual maximum and annual minimum temperatures in the catchment. Arithmetic means of the annual series were taken to get the mean values. The length of data sets varied between 15 and 25 years. Drainage density was obtained from maps given in the Indian Government Irrigation Atlas of India (1972) at a scale of 1:1,000,000. Catchment length was also determined from these maps. The land slope was obtained from maps given in the Indian Government National Atlas of India (1979) at a scale of 1:6,000,000. The vegetal cover factor for the basin is computed by the weighted average of the land uses. The weighing factors were determined by selecting the set of factors that give maximum correlation between the average vegetal cover and the mean annual runoff. Four types of land uses [forest (FF, grass and scrub (FG), arable (FA), and waste land (Fw)] were obtained for each catchment from the maps given in the Indian Government Agricultural Atlas of India (1980) at a scale of 1:1,000,000. It is realized that land usages may have changed during the intervening period. However, in the absence of any detailed information, the information available in the Indian Government Agricultural Atlas of India has been used as the average. In this data we have used only those which are not directly dependent on the extend of basin in its geographical location ,i.e, mean annual rainfall, mean annual temperature, coefficient of variation of rainfall, vegetal cover factor, drainage density and land slope.

Department Of Civil Engineering, GEC, Thrissur 8

Variable (1) Mean annual rainfall Mean annual temperature Coefficient of variation of annual rainfall Vegetal cover factor Drainage density Land slope

Notation (2) Pm Tm CP Fv Dd S
Table 2.3(a) Ranges of Data

Range (3) 59.1 cm-455.6 cm 17.5o C-31o C 0.12-0.40 4.0-70.8 0.03-0.27 0.004-0.55

Catch River catchment ment no. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Manair above dam site Bhawani above Kodivery Pennar above Sangam anicut Purna above Ghodesgoan Vaghur above Jalaon Bori above Amalner aner above Velode Hiran above Pondi Narmada above Jamtara Watrak above Ratanpur Tel above Sonepur Mahanadi above Saradih Ib above Deogaon Maganadi above Sambalpur Mahanadi above Naraj

Pm Dd S (cm) (km) 81.84 131.56 59.08 114.3 88.9 62.5 101.6 133.6 150.9 83.82 130.81 118.6 142 140.31 0.35 0.259 0.35 0.152 0.08 0.019 0.06 0.017 0.41 0.5 0.34 0.22 0.05 0.46 0.04 0.36 0.3 0.34 0.35 0.25 0.55 0.02

Tm (C) Fv

CP

29.2 23.6

4 50.5

0.21 0.19 0.3 0.27 0.28 0.29 0.25 0.27 0.25 0.25 0.25 0.21 0.2 0.12 0.12

29.2 29.62 27.1 13.33 26.8 21.43 26.8 19.81 26.8 35

26.55 35.47 26.55 41.14 28.2 6

27.4 24.67 27.6 33.48 27.8 28.36 26.6 26.54 11.8 29.8

0.03 0.007 0.14 0.132 0.14 0.04

143.47 0.1074 0.064

Department Of Civil Engineering, GEC, Thrissur 9

8 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Tapi and Kathore Girna above Jamda Sabarmati above Dharoi Sabarmati above Ahmedabad Hathmati above Himatnagar Koyna above dam site Damodar above Rohandia Ghatprabha above Hadlga Kansawati above Midnapur Machkund above Jalput Betwa above Dhukwan Manjira above Nizamsagar Yamuna above Tajewala Mayurakshi above dam site Subranarekha catchment Burhabalang up to Kuliana dam Brahmani catchment Baitarni catchment Bahuda catchment Nagavali catchment 78.98 0.0762 0.070 8 63.77 0.0739 0.061 92.48 0.093 0.07 26.76 29.18 26.22 28.56 28.58 28.58 21.15 12.3 9 4.7 8.1 89 0.25 0.23 0.41 0.46 0.36 0.16 0.18 0.15 0.15 0.2 0.19 0.17 0.16 0.3 0.2 0.17 0.22 0.23 0.2 0.2

77.59 0.0628 0.074 78.09 0.08 0.121 5

455.6 0.1683 0.128 5 119.35 0.1118 0.027 9 453.35 0.2062 0.101 5 141.49 0.0622 0.040 1 129.87 0.2 0.15

24.7 37.25 23.6 25.5 25.95 24.9 26.62 17.45 25.1 23.7 23.5 26.5 26 27.9 27.9 73.5 25.3 29.8 34.2 12.9 24 32.5 18.8 19.8 47.8 42.4 89 16.3

113.8 0.2476 0.018 4 81.6 0.0779 0.029 4 167.26 0.0879 0.481 7 128.61 0.0967 0.005 142.8 164.8 153.8 158.4 129.6 137.8 0.13 0.15 0.063 0.19 0.25 0.2 0.13 0.19 0.09 0.03 0.37 0.12

Department Of Civil Engineering, GEC, Thrissur 10

36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55

Rushikulya catchment Vamasadhara catchment Godavari above Dowleiswaram Krishna catchment Pennar above Nellore Chambal catchment Yamuna catchment Ramganga catchment Son catchment Gantak catchment Bagmati catchment Brahputra catchment Barak above Laphirpur Barak above Badarpurghat Sonai catchment Kathakal catchment Tapi and Kathore catchment Narmada catchment Mahi catchment Sabarmati catchment

120 119.4 113.2 78.4 71.7 87.1 80.9 107.3 130.8 124.1 126.5 399.1 432 432 275 290 94.2 117.8 85.8 78.5

0.21 0.17

0.16 0.27

27.9 27.9 26.7 27 24.9 31 28 21.7 25.5 22.5 22.5 20.6 24 24 24.5 23.8 26.5 26.3 25.8 28

17.2 18 13.6 23.5 18 18 20 30 27 26 33 70.8 54.2 53.8 56 54.2 21 29 12.7 5

0.2 0.23 0.27 0.3 0.32 0.4 0.19 0.22 0.22 0.27 0.25 0.18 0.21 0.2 0.17 0.15 0.32 0.3 0.35 0.37

0.055 0.004 0.061 0.005 0.154 0.084 0.088 0.033 0.065 0.020 4 0.075 0.14

0.13 0.065 0.1 0.15 0.1 0.18

0.07 0.025 0.18 0.2 0.29 0.19 0.12 0.11 0.18 0.19 0.3 0.5 0.07 0.02

0.2 0.007 0.27 0.014

Table 2.3(b) Data

2.4. SOFTWARE (WEKA) Waikato Environment for Knowledge Analysis (WEKA), is a machine learning/data mining software written in Java (distributed under the GNU Public License) used for research, education, and applications. Main features of this

Department Of Civil Engineering, GEC, Thrissur 11

software are, comprehensive set of data pre-processing tools, learning algorithms and evaluation methods, graphical user interfaces (incl. data visualization) environment for comparing learning algorithms. Developed by Waikato University, New Zealand. Data can be imported to WEKA from a file in various formats i.e. ARFF, CSV, C4.5, binary. Data can also be read from a URL or from an SQL database (using JDBC). Pre-processing tools in WEKA are called filters. WEKA contains filters for Discretization, Normalization, Resampling, attribute selection, transforming and combining attributes. Cobweb, DBScan, EM, FarthestFirst, FilteredClusterer, HierarchicalClusterer, XMeans are OPTICS, the sIB, MakeDensityBasedClusterer, SimpleKMeans, clustering

algorithms provided by WEKA. 2.4.1. Data retrieval and preparation There are three ways of loading data into the explorer, these are loading from a file, a database connection and finally getting a file from web server. We will be loading the data file from a locally stored file. Weka supports 4 different file formats namely, CSV, C4.5, flat binary files and the native ARFF format. In general, before one goes about applying filters to a dataset, one must first carefully observe the data and use tools to help in visualizing it. Weka allows a quick way to visualize all the attributes that are given. Filters can be used to prepare the data for processing. Different attributes and the data types in the data provided for the present study will represented as follows:

@attribute @attribute @attribute @attribute @attribute @attribute @attribute

Rm(cm) numeric Pm(cm) numeric A(km) numeric Dd(km) numeric S numeric L (km)numeric Tm (C) numeric

@attribute Fu numeric @attribute CR numeric @attribute CP numeric

@data 25.09,81.84,2140,0.35,0.259,84,29.2,4,0.45,0.21 Department Of Civil Engineering, GEC, Thrissur 38.57,131.56,4582.4,0.35,0.152,128,23.6,50.5,0.35,0.19

12

2.4.2.Clustering However, for the average user, clustering can be the most useful data mining method you can use. It can quickly take your entire set of data and turn it into groups, from which you can quickly make some conclusions. The math behind the method is somewhat complex and involved, which is why we take full advantage of the WEKA. This should be considered a quick and non-detailed overview of the math and algorithm used in the clustering method: 1 Every attribute in the data set should be normalized, whereby each value is divided by the difference between the high value and the low value in the data set for that attribute. For example, if the attribute is age, and the highest value is 72, and the lowest value is 16, then an age of 32 would be normalized to 0.5714. 2 Given the number of desired clusters, randomly select that number of samples from the data set to serve as our initial test cluster centers. For example, if you want to have three clusters, you would randomly select three rows of data from the data set. 3 Compute the distance from each data sample to the cluster center (our randomly selected data row), using the least-squares method of distance calculation. 4 Assign each data row into a cluster, based on the minimum distance to each cluster center. 5 Compute the centroid, which is the average of each column of data using only the members of each cluster. 6 Calculate the distance from each data sample to the centroids you just created. If the clusters and cluster members don't change, you are complete

Department Of Civil Engineering, GEC, Thrissur 13

and your clusters are created. If they change, you need to start over by going back to step 3 , and continuing again and again until they don't change clusters. Obviously, that doesn't look very fun at all. With a data set of 10 rows and three clusters, that could take 30 minutes to work out using a spreadsheet. Imagine how long it would take to do by hand if you had 100,000 rows of data and wanted 10 clusters. Luckily, a computer can do this kind of computing in a few seconds.

2.5. ALGORITHMS USED The clustering algorithms used for the present study are: Expectation-Maximization FarthestFirst HierarchicalClusterer SimpleKMeans XMeans

2.5.1. Expectation-maximization (EM) Expectation-maximization (EM) algorithm is a method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. EM is an iterative method which alternates between performing an expectation (E) step, which computes the expectation of the log-likelihood evaluated using the current estimate for the latent variables, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. Description

Department Of Civil Engineering, GEC, Thrissur 14

Given a statistical model consisting of a set unobserved latent data or missing values , along with a likelihood function marginal likelihood of the observed data

of observed data, a set of

, and a vector of unknown parameters , the maximum

likelihood estimate (MLE) of the unknown parameters is determined by the

However, this quantity is often intractable. The EM algorithm seeks to find the MLE of the marginal likelihood by iteratively applying the following two steps: Expectation step (E-step): Calculate the expected value of the log likelihood function, with respect to the conditional distribution of under the current estimate of the parameters : given

Maximization step (M-step): Find the parameter that maximizes this quantity:

Note that in typical models to which EM is applied:


1. The observed data points

may be discrete (taking values in a finite or

countably infinite set) or continuous (taking values in an uncountably

Department Of Civil Engineering, GEC, Thrissur 15

infinite set). There may in fact be a vector of observations associated with each data point.
2. The missing values (aka latent variables)

are discrete, drawn from a

fixed number of values, and there is one latent variable per observed data point. 3. The parameters are continuous, and are of two kinds: Parameters that are associated with all data points, and parameters associated with a particular value of a latent variable (i.e. associated with all data points whose corresponding latent variable has a particular value). However, it is possible to apply EM to other sorts of models. The motivation is as follows. If we know the value of the parameters , we can usually find the value of the latent variables over all possible values of by maximizing the log-likelihood or through an , either simply by iterating over

algorithm such as the Viterbi algorithm for hidden Markov models. Conversely, if we know the value of the latent variables parameters , we can find an estimate of the fairly easily, typically by simply grouping the observed data points

according to the value of the associated latent variable and averaging the values, or some function of the values, of the points in each group. This suggests an iterative algorithm, in the case where both and
1. First, initialize the parameters 2. Compute the best value for

are unknown:

to some random values. given these parameter values. to compute a better estimate for will

3. Then, use the just-computed values of

the parameters

. Parameters associated with a particular value of

use only those data points whose associated latent variable has that value. 4. Finally, iterate until convergence. The algorithm as just described will in fact work, and is commonly called hard EM. The K-means algorithm is an example of this class of algorithms. However, we can do somewhat better by, rather than making a hard choice for given the current parameter values and averaging only over the set of data points

Department Of Civil Engineering, GEC, Thrissur 16

associated with a particular value of each possible value of associated with a particular value of

, instead determining the probability of to compute a weighted average over the

for each data point, and then using the probabilities

entire set of data points. The resulting algorithm is commonly called soft EM, and is the type of algorithm normally associated with EM. The counts used to compute these weighted averages are called soft counts (as opposed to the hard counts used in a hard-EM-type algorithm such as K-means). The probabilities computed for are posterior probabilities and are what is computed in the E-step. The soft counts used to compute new parameter values are what is computed in the M-step. Applications EM is frequently used for data clustering in machine learning and computer vision. In natural language processing, two prominent instances of the algorithm are the Baum-Welch algorithm (also known as forward-backward) and the insideoutside algorithm for unsupervised induction of probabilistic context-free grammars. In psychometrics, EM is almost indispensable for estimating item parameters and latent abilities of item response theory models. With the ability to deal with missing data and observe unidentified variables, EM is becoming a useful tool to price and manage risk of a portfolio. The EM algorithm (and its faster variant Ordered subset expectation maximization) is also widely used in medical image reconstruction, especially in positron emission tomography and single photon emission computed tomography. See below for other faster variants of EM. 2.5.2 Hierarchical clustering Hierarchical algorithms find successive clusters using previously established clusters. These algorithms usually are either agglomerative ("bottom-up") or divisive ("top-down"). Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters.

Department Of Civil Engineering, GEC, Thrissur 17

Working (aglomerative) Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic process of hierarchical clustering is : 1. Start by assigning each item to a cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters the same as the distances (similarities) between the items they contain. 2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one cluster less. 3. Compute distances (similarities) between the new cluster and each of the old clusters. 4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

2.5.3.Simple k-means

In statistics and data mining, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. It is similar to the expectation-maximization algorithm for mixtures of Gaussians in that they both attempt to find the centers of natural clusters in the data as well as in the iterative refinement approach employed by both algorithms Description Given a set of observations (x1, x2, , xn), where each observation is a ddimensional real vector, k-means clustering aims to partition the n observations into k sets (k n) S = {S1, S2, , Sk} so as to minimize the within-cluster sum of squares (WCSS):

Department Of Civil Engineering, GEC, Thrissur 18

where i is the mean of points in Si.

Demonstration of the standard algorithm

Figure.2.5.3 (a) initial centroids

1) k initial "means" (in this case k=3) are randomly selected from the data set

Figure.2.5.3 (b) associated cluster

2) k clusters are created by associating every observation with the nearest mean. The partitions here represent the Voronoi diagram generated by the means.

Figure.2.5.3 (c) new centroids

3) The centroid of each of the k clusters becomes the new means.

Department Of Civil Engineering, GEC, Thrissur 19

Fig.2.5.3. (d) new cluster

4) Steps 2 and 3 are repeated until convergence has been reached. As it is a heuristic algorithm, there is no guarantee that it will converge to the global optimum, and the result may depend on the initial clusters. As the algorithm is usually very fast, it is common to run it multiple times with different starting conditions. However, in the worst case, k-means can be very slow to converge: in particular it has been shown that there exist certain point sets, even in 2 dimensions, on which k-means takes exponential time, that is 2(n), to converge. These point sets do not seem to arise in practice: this is corroborated by the fact that the smoothed running time of k-means is polynomial. The "assignment" step is also referred to as expectation step, the "update step" as maximization step, making this algorithm a variant of the generalized expectation-maximization algorithm. 2.5.4. X-means Despite its popularity for general clustering, k-means suffers three major shortcomings; it scales poorly computationally, the number of clusters K has to be supplied by the user, and the search is prone to local minima. XMeans provides kmeans extended by an Improve-Structure part and automatically determines the number of clusters. X Means gives solutions for the first two problems, and a partial remedy for the third. Building on prior work for algorithmic acceleration that is not based on approximation, it introduce a new algorithm that efficiently, searches the space of cluster locations and number of clusters to optimize the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) measure. The innovations include two new ways of exploiting cached sufficient statistics and a new very efficient test that in one k-means sweep selects the most promising subset of classes for refinement. This gives rise to a fast, statistically
Department Of Civil Engineering, GEC, Thrissur 20

founded algorithm that outputs both the number of classes and their parameters. A Bayesian Information Criterion (BIC) test is used to determine whether the structure of the isolated set is better described by the original single centroid, or the two newly generated centroids. It can be seen, therefore, that the number of clusters can rapidly converge to the true number of clusters since at most 2k clusters can be generated per iteration of the algorithm. 2.5.4 Farthest first Farthest first is a variant of K means that places each cluster centre in turn at the point farthest from the existing cluster centres. This point must lie within the data area. This greatly speeds up the clustering in most cases since less reassignment and adjustment is needed.

Department Of Civil Engineering, GEC, Thrissur 21

CHAPTER 3: RESULT AND DISCUSSION

3.1. RESULT Using each of the above algorithms the 55 river basins were grouped into four clusters. For those algorithms which require a seed value, 10 was provided as the seed, as obtained from an optimization procedure. For that, clustering was done using different seed values and the seed value corresponding to the best clustering was determined using a training and testing set in Support Vector Machine tool. The resulting clusters for each algorithm are as follows: Cluster Number Catchment 1 2 3 4 5 River catchment Manair above dam site Bhawani above Kodivery Pennar above Sangam anicut Purna above Ghodesgoan Vaghur above Jalaon EM 2 2 0 0 2 Farthest First 2 0 0 0 2 Hierar chical 0 0 0 0 0 Simple Kmean 1 s 1 0 0 1 X Means 0 2 3 3 0

Department Of Civil Engineering, GEC, Thrissur 22

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52

Bori above Amalner aner above Velode Hiran above Pondi Narmada above Jamtara Watrak above Ratanpur Tel above Sonepur Mahanadi above Saradih Ib above Deogaon Maganadi above Sambalpur Mahanadi above Naraj Tapi and Kathore Girna above Jamda Sabarmati above Dharoi Sabarmati above Ahmedabad Hathmati above Himatnagar Koyna above dam site Damodar above Rohandia Ghatprabha above Hadlga Kansawati above Midnapur Machkund above Jalput Betwa above Dhukwan Manjira above Nizamsagar Yamuna above Tajewala Mayurakshi above dam site Subranarekha catchment Burhabalang up to Kuliana Brahmani catchment dam site Baitarni catchment Bahuda catchment Nagavali catchment Rushikulya catchment Vamasadhara catchment Godavari above Krishna catchment Dowleiswaram Pennar above Nellore Chambal catchment Yamuna catchment Ramganga catchment Son catchment Gantak catchment Bagmati catchment Brahputra catchment Barak above Laphirpur Barak above Badarpurghat Sonai catchment Kathakal catchment Tapi and Kathore catchment

2 2 2 3 2 3 3 3 3 3 0 0 0 0 0 1 3 1 3 3 3 0 2 3 3 3 3 3 2 3 3 3 0 0 0 0 0 3 3 3 3 1 1 1 2 2 0

2 2 2 0 2 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 3 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 3 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 2 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0

1 1 1 2 1 2 2 2 2 2 2 2 0 0 0 3 2 3 2 2 2 2 3 0 2 2 2 2 1 2 2 2 0 0 0 0 2 2 2 2 2 3 3 3 3 3 0

0 0 0 2 0 3 2 2 2 2 2 2 3 3 3 1 2 1 2 2 2 2 1 3 2 2 2 2 1 2 2 2 3 3 3 3 2 2 2 2 2 1 1 1 1 1 3

Department Of Civil Engineering, GEC, Thrissur 23

53 54 55

Narmada catchment Mahi catchment Sabarmati catchment

3 0 0

0 0 0

0 0 0

0 0 0

3 3 3

Table 3.1(a). clustering obtained

The number and percentage of river basins coming under each cluster by different clustering techniques is follows:

Cluster EM 0 1 2 3

Number of river basinsPercentage of river basins Farthest Hierarchical Simple Xmeans first 40 (73%) 7 (13%) 6 (11%) 2 (4%) 48 (87%) 5 (9%) 1 (2%) 1 (2%) Kmeans 14 (25%) 8 (15%) 25 (45%) 8 (15%) 6 (11%) 9 (16%) 25 (45%) 15 (27%)

16(29%) 5 (9%) 11 (20%) 23 (42%)

Table 3.1(b) clustering pattern

3.2.DISCUSSION Clusters obtained using the five algorithms were plotted in the map. It is compared with the geographical representation of the clusters obtained using Artificial Neural Network (ANN) and the best algorithm for clustering was determined. In the ANN clustering the data used were independent of the geographic extend of the river basin, but on plotting it on the map a geographic grouping of the various clusteres was obtained. To check the validity of the clusters obtained using various algorithms the river basins were maped cluster wise and was tried to group geographically.

Department Of Civil Engineering, GEC, Thrissur 24

3.2.1. ARIFICIAL NEURAL NETWORK (ANN) Geographic maping of the river basins clustered using ANN is shown below:

Figure.3.2.1. geographhic representation of ANN clustering

Department Of Civil Engineering, GEC, Thrissur 25

3.2.2. EXPECTATION-MAXIMIZATION Geographic representation of clustering using expectation-maximization.

Figure.3.2.2 geographhic representation of EM clustering

Clusters formed using expectation-maximization algorithm can be geograpically

Department Of Civil Engineering, GEC, Thrissur 26

grouped to some extend. Hence it is good clustering algorithm.

3.2.3.HIERARCHICAL Geographic representation of clustering using hierarchical clustering.

Figure.3.2.3. geographhic representation of hierarchical clustering

Department Of Civil Engineering, GEC, Thrissur 27

Hierarchical clustering is not suitable for the given data as most of the river basins expect a few were grouped into a single cluster.

3.2.4. SIMPLE K-MEANS Geographic representation of clustering using simple kmeans.

Figure.3.2.4. geographhic representation by kmeans clustering

Department Of Civil Engineering, GEC, Thrissur 28

For simple kmeans also geographic grouping of the clusters is possible to some extend.

3.2.5. X-MEANS Geographic representation of clustering using Xmeans.

Department Of Civil Engineering, GEC, Thrissur 29

Fig.3.2.5. geographhic representation by xmeans clustering

Xmeans provide a geographhic grouping similar to that of simple kmeans.

3.2.6. FARTHEST FIRST Geographic representation of clustering using

Department Of Civil Engineering, GEC, Thrissur 30

Fig.3.2.6. geographhic representation by farthestfirstl clustering

Farthest first algorithm also is not much preferable but is better than hierarchical clustering.

CHAPTER 4: CONCLUSION

Data of 55 river basins in India were taken. These river basins were grouped into four clusters using five different algorithms viz, Expectation-Maximization, Farthest First,

Hierarchical Clusterer, Simple K-Means and X-Means. Reliability of the clusters


formed was checked by plotting geographic representation and comparing it with the existing geographic representation of clusters obtained by Artificial Neural Network (ANN). Expectation-maximization algorithm gave the best result. The clustering done by simple k-means and x-means gave satisfactory result, whereas farthest first and hierarchical clustering algorithms were found to be not suitable for this data.

Department Of Civil Engineering, GEC, Thrissur 31

CHAPTER 5: REFERENCE

1) D.Han, L. Chan and N. Zhu ,Flood forecasting using support vector machines ,Journal of Hydronformatics. 2) B.S.Thandaveswara and N.Sajikumar ,Classification of river basins using Artificial Neural Network, Journal of Hydrologic Engineering/July 2000. 3) Varun Kumar and Nisha Rathee Knowledge discovery from database using an integration of clustering and classification International Journal of Advanced Computer Science and Application. 4) www.cs.waikato.ac.nz/ml/weka 5) Raza Ali, Usman Ghani and Aasim Saeed Data clustering and its applications http://members.tripod.com/asim_saeed/paper.htm 6) I.K.Ravichandra Rao Data Mining and Clustreing Techniques presented at DRTC Workshop on Sematic Web,Banglore. 7) Leo Wanner Introduction to Clustering Te chniques 8) Wikipedia , Cluster analysis

Department Of Civil Engineering, GEC, Thrissur 32

Department Of Civil Engineering, GEC, Thrissur 33

Anda mungkin juga menyukai