Anda di halaman 1dari 9

July 2011 Master of Computer Application (MCA) Semester 6 MC0088 Data Mining 4 Credits

(Book ID: B1009)

Assignment Set 1 (60 Marks)


Answer all Questions Each Question carries fifteen Marks

1. Describe the following with respect to Cluster Analysis: A. Cluster Analysis B. Clustering Methods C. Clustering and Segmentation Software A. Cluster Analysis
Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other than to those in other clusters. Clustering is a main task of explorative data mining, and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, analysis, information, and bioinformatics. Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with low distancesamong the cluster members, dense areas of the data space, intervals or particular statistical distributions. The appropriate clustering algorithm and parameter settings (including values such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results. Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery that involves try and failure. It will often be necessary to modify preprocessing and parameters until the result achieves the desired properties. Besides the term clustering, there are a number of terms with similar meanings, including automatic classification, numerical taxonomy,botryology (from Greek "grape") and typological analysis. The subtle differences are often in the usage of the results: while in data mining, the resulting groups are the matter of interest, in automatic classification primarily their discriminitative power is of interest. This often leads to misunderstandings of researchers coming from the fields of data mining and machine learning, since they use the same terms and often the same algorithms, but have different goals.

B. Clustering Methods
The goal of clustering is to reduce the amount of data by categorizing or grouping similar data items together. Such grouping is pervasive in the way humans process information, and one of the motivations for using clustering algorithms is to provide automated tools to help in

1|Page

constructing categories or taxonomies [Jardine and Sibson, 1971, Sneath and Sokal, 1973]. The methods may also be used to minimize the effects of human factors in the process. Clustering methods [Anderberg, 1973, Hartigan, 1975, Jain and Dubes, 1988, Jardine and Sibson, 1971, Sneath and Sokal, 1973, Tryon and Bailey, 1973] can be divided into two basic types: hierarchical and partitional clustering. Within each of the types there exists a wealth of subtypes and different algorithms for finding the clusters. Hierarchical clustering proceeds successively by either merging smaller clusters into larger ones, or by splitting larger clusters. The clustering methods differ in the rule by which it is decided which two small clusters are merged or which large cluster is split. The end result of the algorithm is a tree of clusters called a dendrogram, which shows how the clusters are related. By cutting the dendrogram at a desired level a clustering of the data items into disjoint groups is obtained. Partitional clustering, on the other hand, attempts to directly decompose the data set into a set of disjoint clusters. The criterion function that the clustering algorithm tries to minimize may emphasize the local structure of the data, as by assigning clusters to peaks in the probability density function, or the global structure. Typically the global criteria involve minimizing some measure of dissimilarity in the samples within each cluster, while maximizing the dissimilarity of different clusters. A commonly used partitional clustering method, K-means clustering [MacQueen, 1967], will be discussed in some detail since it is closely related to the SOM algorithm. In K-means clustering the criterion function is the average squared distance of the data items from their nearest cluster centroids,

where

is the index of the centroid that is closest to

. One possible algorithm for minimizing the cost function

begins by initializing a set of K cluster centroids denoted by , . The positions of the are then adjusted iteratively by first assigning the data samples to the nearest clusters and then recomputing the centroids. The iteration is stopped when E does not change markedly any more. In an alternative algorithm each randomly chosen sample is considered in succession, and the nearest centroid is updated. Equation 1 is also used to describe the objective of a related method, vector quantization [Gersho, 1979, Gray, 1984, Makhoul et al., 1985]. In vector quantization the goal is to minimize the average (squared) quantization error, the distance between a sample and its representation . The algorithm for minimizing Equation 1 that was described above is actually a straightforward generalization of the algorithm proposed by Lloyd (1957) for minimizing the average quantization error in a one-dimensional setting. A problem with the clustering methods is that the interpretation of the clusters may be difficult. Most clustering algorithms prefer certain cluster shapes, and the algorithms will always assign the data to clusters of such shapes even if there were no clusters in the data. Therefore, if the goal is not just to compress the data set but also to make inferences about its cluster structure, it is essential to analyze whether the data set exhibits a clustering tendency. The results of the cluster analysis need to be validated, as well. Jain and Dubes (1988) present methods for both purposes.

Another potential problem is that the choice of the number of clusters may be critical: quite different kinds of clusters may emerge when K is changed. Good initialization of the cluster centroids may also be crucial; some clusters may even be left empty if their centroids lie initially far from the distribution of data.

Page | 2

Clustering can be used to reduce the amount of data and to induce a categorization. In exploratory data analysis, however, the categories have only limited value as such. The clusters should be illustrated somehow to aid in understanding what they are like. For example in the case of the K-means algorithm the centroids that represent the clusters are still high-dimensional, and some additional illustration methods are needed for visualizing them

C. Clustering and Segmentation Software


Segmentation is the process that groups similar objects together and forms clusters. Thus it is often referred to as clustering. Clustered groups are homogeneous within and desirably heterogeneous in between. The rationale of intra-group homogeneity is that objects with similar attributes are likely to respond somewhat similarly to a given action. This property has various uses both in business and in scientific research. Most clustering techniques are developed for laboratory generated simple data consisting of a few to several numerical variables. Applying these techniques to business data that consist of many categorical complex data suffers from various limitations, as described in the followings;

Numerical variables and normalization


Most clustering techniques are based on distance calculation. It is noted that distance is very sensitive to ranges of variables. For example, "age" normally ranges 0 ~ 100. On the other hand, "salary" can spread from 0 to 100,000. When both variables are used together, distance from salary can overwhelm the other. Thus, values have to be normalized. However, normalization is rather a subjective function. There is no way we can transform without creating biases.

Outliers and numerical variables


Related to numerical variables, outliers also create problems in data mining, especially with clustering based on distance calculations. In such systems, outliers should be identified and removed from data mining. (It is noted that outliers are recommended to be removed in all data mining techniques!)

Categorical variables and binary variable encoding


Dealing with categorical variables (non-numeric data, non-numeric variables, categorical data, nominal data, or nominal variables) are much more problematic. Normally, we use "one-of-N" or "thermometer" encoding. This can introduce extra biases due to numbers of values in categorical variables. Note that one-of-N and thermometer encoding transforms each categorical value into a true-false binary variable. This can significantly increase the total number of variables, which in turn decreases the effectiveness of many clustering techniques. For more, read the section "Why k-means clustering does not work well with business data?".

Clustering variable selections and weighting


Clustering variable selection is another problem. Selection of variables will largely influence clustering results. A commonly used method is to assign different weight for variables and categorical values. However, this introduces another problematic process. When many variables and categorical values are involved, it's never possible to have best quality clustering. For clustering variable selection methods, read Variable & value link analysis.

Behavioral modeling on time-variant variables


Capturing patterns (or behaviors) hidden inside time-varying variables and modeling is another difficult problem. In database marketing, it is desirable to segment customers based on previous marketing campaigns, as predictive models, then to execute marketing campaigns based on current customer information (using the same models). Most clustering techniques do not possess this predictive modeling capability.

2. Describe the following with respect to Web Mining:

Page | 3

a. Categories of Web Mining b. Applications of Web Mining c. Pruning

A. Categories of Web Mining


Web mining - is the application of data mining techniques to discover patterns from the Web. According to analysis targets, web mining can be divided into three different types, which are Web usage mining, Web content mining and Web structure mining.

Web usage mining


Web usage mining is the process of extracting useful information from server logs i.e. users history. Web usage mining is the process of finding out what users are looking for on the Internet. Some users might be looking at only textual data, whereas some others might be interested in multimedia data.

Web structure mining


Web structure mining is the process of using graph theory to analyze the node and connection structure of a web site. According to the type of web structural data, web structure mining can be divided into two kinds: 1. Extracting patterns from hyperlinks in the web: a hyperlink is a structural component that connects the web page to a different location. 2. Mining the document structure: analysis of the tree-like structure of page structures to describe HTML or XML tag usage.

C:- Pruning
Pruning is a horticultural practice involving the selective removal of parts of a plant, such asbranches, buds, or roots. Reasons to prune plants include deadwood removal, shaping (by controlling or directing growth), improving or maintaining health, reducing risk from falling branches, preparing nursery specimens for transplanting, and both harvesting and increasing the yield or quality of flowers and fruits. The practice entails targeted removal of diseased, damaged, dead, non-productive, structurally unsound, or otherwise unwanted tissue from crop and landscape plants. Specialized pruning practices may be applied to certain plants, such as roses, fruit trees, and grapevines. Different pruning techniques may be deployed on herbaceous plants than those used on perennial woody plants. Hedges, by design, are usually (but not exclusively) maintained by hedge trimming, rather than by pruning. Arborists, orchardists, and gardeners use various garden tools and tree cutting tools designed for the purpose, such as hand pruners, loppers, or chainsaws. In nature, meteorological conditions such as wind, ice and snow, and Salt pruning can cause plants to selfprune. This natural shedding is called abscission. In general, the smaller the branch that is cut, the easier it is for a woody plant tocompartmentalize the wound and thus limit the potential for pathogen intrusion and decay. It is therefore preferable to make any necessary formative structural pruning cuts to young plants, when possible, rather than removing large, poorly placed branches from mature plants.

Page | 4

3. Explain: A) Business Intelligence Tools B) Business Intelligence VS Data Warehouse A) Business Intelligence Tools

Business intelligence (BI) mainly refers to computer-based techniques used in identifying, extracting,[clarification needed] and analyzing business data, such as sales revenue by products and/or departments, or by associated costs and incomes.[1] BI technologies provide historical, current and predictive views of business operations. Common functions of business intelligence technologies are reporting, online analytical processing, analytics, data mining, process mining, complex event processing, business performance management, benchmarking, text mining and predictive analytics. Business intelligence aims to support better business decision-making. Thus a BI system can be called a decision support system (DSS). [2] Though the term business intelligence is sometimes used as a synonym for competitive intelligence, because they both support decision making, BI uses technologies, processes, and applications to analyze mostly internal, structured data and business processes while competitive intelligence gathers, analyzes and disseminates information with a topical focus on company competitors. Business intelligence understood broadly can include the subset of competitive intelligence.[3]

B) Business Intelligence VS Data Warehouse


DW - is a way of storing data and creating information through leveraging data marts. DM's are segments or categories of information and/or data that are grouped together to provide 'information' into that segment or category. DW does not require BI to work. Reporting tools can generate reports from the DW. BI - is the leveraging of DW to help make business decisions and recommendations. Information and data rules engines are leveraged here to help make these decisions along with statistical analysis tools and data mining tools. You will find that BI is much like ERP in that it can be extremely expensive and invasive to your firm and there is a wide range between the offerings - low end to high end - which facilitates the pricing. There is a long list of tools to select from. There are also services that provide this as an outsource. Some of these services will allow you to eventually 'own' the solution and in-source it at a future date. Like anything else, this comes at a price. This scenario works well for those who do not have a high caliber IT staff and would like to get results with a short ramp up time, basically because the system is already built. Your rules and reports just have to be generated. That is a bit oversimplified, but you get the picture

4. Describe the following: A) Binning B) Data Transformation C) Data reduction

Page | 5

A) Binning Data binning is a data pre-processing technique used to reduce the effects of minor observation errors. The original data
values which fall in a given small interval, a bin, are replaced by a value representative of that interval, often the central value. It is a form of quantization.

Introduction
In the context of image processing, binning is the procedure of combining a cluster of pixels into a single pixel. As such, in 2x2 binning, an array of 4 pixels becomes a single larger pixel[1], reducing the overall number of pixels. This aggregation reduces the impact of read noise on the processed image at the cost of a lower resolution.

Example
For example, data binning may be used when small instrumental shifts in the spectral dimension from MS or NMR experiments will be falsely interpreted as representing different components, when a collection of data profiles is subjected to pattern recognition analysis. A straightforward way to cope with this problem is by using binning techniques in which the spectrum is reduced in resolution to a sufficient degree to ensure that a given peak remains in its bin despite small spectral shifts between analyses. For example, in NMR the chemical shiftaxis may be discretized and coarsely binned, and in MS the spectral accuracies may be rounded to integer atomic mass unit values. Also, several digital camera systems incorporate an automatic pixel binning function to allow the display of a brighter preview image[2].

B) Data Transformation
In statistics, data transformation refers to the application of a deterministic mathematical function to each point in a data set that is, each data point ziis replaced with the transformed value yi = f(zi), where f is a function. Transforms are usually applied so that the data appear to more closely meet the assumptions of a statistical inference procedure that is to be applied, or to improve the interpretability or appearance of graphs. Nearly always, the function that is used to transform the data is invertible, and generally is continuous. The transformation is usually applied to a collection of comparable measurements. For example, if we are working with data on peoples' incomes in some currency unit, it would be common to transform each person's income value by the logarithm function.

Reasons for transforming data


Guidance for how data should be transformed, or whether a transform should be applied at all, should come from the particular statistical analysis to be performed. For example, a simple way to construct an approximate 95%confidence interval for the population mean is to take the sample mean plus or minus two standard error units. However, the constant factor 2 used here is particular to the normal distribution, and is only applicable if the sample mean varies approximately normally. The central limit theorem states that in many situations, the sample mean does vary normally if the sample size is reasonably large. However if the population is substantially skewed and the sample size is at most moderate, the approximation provided by the central limit theorem can be poor, and the resulting confidence interval will likely have the wrong coverage probability. Thus, when there is evidence of substantial skew in the data, it is common to transform the data to a symmetric distribution before constructing a confidence interval. If desired, the confidence interval can then be transformed back to the original scale using the inverse of the transformation that was applied to the data. Data can also be transformed to make it easier to visualize them. For example, suppose we have a scatterplot in which the points are the countries of the world, and the data values being plotted are the land area and population of each country. If the plot is made using untransformed data (e.g. square kilometers for area and the number of people for population), most of the countries would be plotted in tight

Page | 6

cluster of points in the lower left corner of the graph. The few countries with very large areas and/or populations would be spread thinly around most of the graph's area. Simply rescaling units (e.g. to thousand square kilometers, or to millions of people) will not change this. However, following logarithmic transformations of both area and population, the points will be spread more uniformly in the graph. A final reason that data can be transformed is to improve interpretability, even if no formal statistical analysis or visualization is to be performed. For example, suppose we are comparing cars in terms of their fuel economy. These data are usually presented as "kilometers per liter" or "miles per gallon." However if the goal is to assess how much additional fuel a person would use in one year when driving one car compared to another, it is more natural to work with the data transformed by the reciprocal function, yielding liters per kilometer, or gallons per mile.

Data transformation in regression


Linear regression is a statistical technique for relating a dependent variable Y to one or more independent variables X. The simplest regression models capture a linear relationship between the expected value of Y and each independent variable (when the other independent variables are held fixed). If linearity fails to hold, even approximately, it is sometimes possible to transform either the independent or dependent variables in the regression model to improve the linearity. Another assumption of linear regression is that the variance be the same for each possible expected value (this is known ashomoskedasticity). Univariate normality is not needed for least squares estimates of the regression parameters to be meaningful (see Gauss-Markov theorem). However confidence intervals and hypothesis tests will have better statistical properties if the variables exhibit multivariate normality. This can be assessed empirically by plotting the fitted values against the residuals, and by inspecting the normal quantile plot of the residuals. Note that it is not relevant whether the dependent variable Y is marginally normally distributed.

Examples of logarithmic transformations


Equation: Y = a + bX Meaning: A unit increase in X is associated with an average of b units increase in Y. Equation: log(Y) = a + bX (From taking the log of both sides of the equation: Y = eaebX)

Meaning: A unit increase in X is associated with an average of 100b% increase in Y. Equation: Y = a + blog(X) Meaning: A 1% increase in X is associated with an average b/100 units increase in Y. Equation: log(Y) = a + blog(X) (From taking the log of both sides of the equation: Y = eaXb)

Meaning: A 1% increase in X is associated with a b% increase in Y.

Common transformations
The logarithm and square root transformations are commonly used for positive data, and the multiplicative inverse (reciprocal) transformation can be used for non-zero data. The power transform is a family of transformations parametrized by a non-negative value that includes the logarithm, square root, and multiplicative inverse as special cases. To approach data transformation systematically, it is possible to usestatistical estimation techniques to estimate the parameter in the power transform, thereby identifying the transform that is approximately the most appropriate in a given setting. Since the power transform family also includes the identity transform, this approach

Page | 7

can also indicate whether it would be best to analyze the data without a transformation. In regression analysis, this approach is known as the Box-Cox technique. The reciprocal and some power transformations can be meaningfully applied to data that include both positive and negative values (the power transform is invertible over all real numbers if is an odd integer). However when both negative and positive values are observed, it is more common to begin by adding a constant to all values, producing a set of non-negative data to which any power transform can be applied. A common situation where a data transformation is applied is when a value of interest ranges over several orders of magnitude. Many physical and social phenomena exhibit such behavior incomes, species populations, galaxy sizes, and rainfall volumes, to name a few. Power transforms, and in particular the logarithm, can often be used to induce symmetry in such data. The logarithm is often favored because it is easy to interpret its result in terms of "fold changes." The logarithm also has a useful effect on ratios. If we are comparing positive quantities X and Y using the ratio X / Y, then if X < Y, the ratio is in the unit interval (0,1), whereas if X > Y, the ratio is in the half-line (1,), where the ratio of 1 corresponds to equality. In an analysis where X and Y are treated symmetrically, the log-ratio log(X / Y) is zero in the case of equality, and it has the property that if X is K times greater than Y, the log-ratio is the equidistant from zero as in the situation where Y is K times greater than X (the log-ratios are log(K) and log(K) in these two situations). If values are naturally restricted to be in the range 0 to 1, not including the end-points, then a logit transformation may be appropriate: this yields values in the range (,).

Transforming to normality
It is not always necessary or desirable to transform a data set to resemble a normal distribution. However if symmetry or normality are desired, they can often be induced through one of the power transformations. To assess whether normality has been achieved, a graphical approach is usually more informative than a formal statistical test. A normal quantile plot is commonly used to assess the fit of a data set to a normal population. Alternatively, rules of thumb based on the sampleskewness and kurtosis have also been proposed, such as having skewness in the range of 0.8 to 0.8 and kurtosis in the range of 3.0 to 3.0.[citation needed]

Transforming to a uniform distribution


If we observe a set of n values X1, ..., Xn with no ties (i.e. there are n distinct values), we can replace Xi with the transformed value Yi = k, where k is defined such that Xi is the kth largest among all the X values. This is called the rank transform[citation needed], and creates data with a perfect fit to a uniform distribution. This approach has a population analogue. If X is any random variable, and F is the cumulative distribution function of X, then as long as F is invertible, the random variable U = F(X) follows a uniform distribution on the unit interval [0,1]. From a uniform distribution, we can transform to any distribution with an invertible cumulative distribution function. If G is an invertible cumulative distribution function, and U is a uniformly distributed random variable, then the random variable G1(U) has G as its cumulative distribution function.

C) Data reduction

Page | 8

Data Reduction is the transformation of numerical or alphabetical digital information derived empirical or experimentally into a corrected, ordered, and simplified form.

Columns and rows are moved around until a diagonal pattern appears, thereby making it easy to see patterns in the data.

When the information is derived from instrument readings then there may also be a transformation from analog to digital form. When the data are already in digital form the 'reduction' of the data typically involves some editing, scaling, coding, sorting, collating, and producing tabular summaries, and when the observations are discrete but the underlying phenomenon is continuous then smoothing and interpolation are likely to be needed. Often the data reduction is undertaken in the presence of reading or measurement errors. Some idea of the nature of these errors is needed before the most likely value may be determined. Data Reduction rules which have been suggested include: 1. Order by some aspect of size. 'Diagonalize' tables, so that unordered categories are re-arranged to make patterns easier to see. [1] Use averages to provide a visual focus as well as a summary. Use layout and labeling to guide the eye. Remove Chartjunk, such as pictures and lines. Give a brief verbal summary.[2]

2.
3. 4.

5. 6.

Page | 9

Anda mungkin juga menyukai