Anda di halaman 1dari 11

DATA MINING ASSIGNMENT 1. With a neat sketch explain the architecture of a data warehouse. Ans.

1.The bottom tier is a warehouse database server that is almost always a relational database system. Back-end tools and utilities are used to feed data into the bottom tier from operational databases or other external sources (such as customer profile information provided by external consultants). These tools and utilities perform data extraction, cleaning, and transformation. The data are extracted using application program interfaces known as gateways. A gateway is supported by the underlying DBMS and allows client programs to generate SQL code to be executed at a server. Examples of gateways include ODBC (Open Database Connection) and OLEDB (Open Linking and Embedding for Databases) by Microsoft and JDBC (Java Database Connection).This tier also contains a metadata repository, which stores information about the data warehouse and its contents.

2. The middle tier is an OLAP server that is typically implemented using either (1) a relational OLAP (ROLAP) model, that is, an extended relational DBMS that maps operations on multidimensional data to standard relational operations; or (2) a multidimensional OLAP (MOLAP) model, that is, a special-purpose server that directly implements multidimensional data and operations. 3. The top tier is a front-end client layer, which contains query and reporting tools,analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on). 2.Discuss the typical OLAP operations with an example. Ans. Roll-up Generalises one or a few dimensions and performs appropriate aggregations in the corresponding measures. For non spatial measures aggregation is implemented in the same way as in non-spatial data cubes. For spatial measures, the aggregate takes a collection of spatial pointers. Used for map-overlay. Performs spatial aggregation operations such as region merge etc. Drill-down Specialises one or a few dimensions and presents lowlevel data. Can be viewed as a reverse operation of roll-up. Can be implemented by saving a low- level cube and performing a generalisation on it when necessary. Slicing and dicing Selects a portion of the cube based on the constant(s) in one or a few dimensions. Can be done with regular queries. Pivoting Presents the measures in different cross-tabular layouts. Can be implemented in a similar way as in non-spatial cubes. 3.Discuss how computations can be performed efficiently on data cubes. Ans. General Strategies for Cube Computation 1: Sorting, hashing, and grouping. 2: Simultaneous aggregation and caching intermediate results. 3: Aggregation from the smallest child, when there exist multiple child cuboids. 4: The Apriori pruning method can be explored to compute iceberg cubes efficiently Cube Computation Algorithms Bottom-up first compute base cuboid, work its way up the lattice to apex cuboid e.g., multiway algorithm Top-down start with apex cuboid, work its way down to base cuboid e.g., BUC algorithm

4.Write short notes on data warehouse meta data. Ans. Metadata in a data warehouse contains the answers to questions about the data in the data warehouse. Here is a sample list of definitions: Data about the data Table of contents for the data Catalog for the data Data warehouse atlas Data warehouse roadmap Data warehouse directory Glue that holds the data warehouse contents together Tongs to handle the data The nerve center Metadata in a data warehouse is similar to the data dictionary or the data catalog in a database management system. In the data dictionary, you keep the information about the logical data structures, the information about the files and addresses, the information about the indexes, and so on. The data dictionary contains data about the data in the database.Similarly, the metadata component is the data about the data in the data warehouse.This definition is a commonly used definition.Metadata in a data warehouse is similar to a data dictionary, but much more than a data dictionary. Types of Metadata Metadata in a data warehouse fall into three major categories: 1. Operational Metadata 2. Extraction and Transformation Metadata 3. End-User Metadata 5.Explain various methods of data cleaning in detail

Parsing: Parsing in data cleansing is performed for the detection of syntax errors. A parser decides whether a string of data is acceptable within the allowed data specification. This is similar to the way a parser works with grammars and languages. Data transformation: Data transformation allows the mapping of the data from its given format into the format expected by the appropriate application. This includes value conversions or translation functions, as well as normalizing numeric values to conform to minimum and maximum values. Duplicate elimination: Duplicate detection requires an algorithm for determining whether data contains duplicate representations of the same entity. Usually, data is sorted by a key that would bring duplicate entries closer together for faster identification. Statistical methods: By analyzing the data using the values of mean, standard deviation, range, or clustering algorithms, it is possible for an expert to find values that are unexpected and thus erroneous. Although the correction of such data is difficult since the

true value is not known, it can be resolved by setting the values to an average or other statistical value. Statistical methods can also be used to handle missing values which can be replaced by one or more plausible values, which are usually obtained by extensive data augmentation algorithms. 6.Give an account on data mining Query language. Ans. DMQL has been designed at the Simon Fraser University,Canada. It has been designed to support various rule mining extractions (e.g., classification rules, comparison rules, association rules). In this language, an association rule is a relation between the values of two sets of predicates that are evaluated on the relations of a database. These predicates are of the form P(X, c) where P is a predicate taking the name of an attribute of a relation, X is a variable and c is a value in the domain of the attribute. A typical example of association rule that can be extracted by DMQL is buy(X, milk) A t m n ( X , Berlin) + buy(X, beer). An important possibility in DMQL is the definition of meta-patterns, i.e., a powerful way to restrict the syntactic aspect of the extracted rules (expressive syntactic constraints). For instance, the meta-pattern buy+(X, Y) A town(X, Berlin) + buy(X, 2) restricts the search to association rules concerning implication between bought products for customers living in Berlin. Symbol + denotes that the predicate buy can appear several times in the left part of the rule. Moreover, beside the classical frequency and confidence, DMQL also enables to define thresholds on the noise or novelty of extracted rules. Finally, DMQL enables to define a hierarchy on attributes such that generalized association rules can be extracted. The general syntax of DMQL for the extraction of association rules is the following one: Use database (database-name) {use hierarchy (hierarchy-name) For (attribute) ) Mine associations [as (pattern-name)] [ Matching (metapattern) I From (relation(s)) [ Where (condition)] [ Order by (order-list) I [ Group by (grouping1ist)I [ Having (condition)] With (interest-measure) Threshold = value 7.How is Attribute-Oriented Induction implemented? Explain in detail. Ans.The attribute oriented induction method has been implemented in data mining system prototype called DBMINER which previously called DBLearn and been tested successfully against large relational database and datawarehouse for multidimensional purposes. For the implementation attribute oriented can be implemented as the architecture design where characteristic rule and classification rule can be learned straight from transactional database (OLTP) or Data warehouse (OLAP) with the helping of concept hierarchy as knowledge generalization. Concept hierarchy can be created from OLTP database as a direct resource. For making easy the implementation a concept hierarchy will just only based on non rule based concept hierarchy and just learning for characteristic rule and classification/discriminant rule. 1) Characteristic rule is an assertion which characterizes the concepts which satisfied by all of the data stored in database. Provide generalized concepts about a property which can help people

recognize the common features of the data in a class. For example the symptom of the specific disease. 2) Classification/Discriminant rule is an assertion which discriminates the concepts of 1 class from other classes. Give a discriminant criterion which can be used to predict the class membership of of new data.For example to distinguish one disease from others, classification rule should summarize the symptoms that discriminate this disease from others. 8.Write and explain the algorithm for mining frequent item sets without candidate generation. Ans. FP-tree construction Input: A transaction database DB and a minimum support threshold ?. Output: FP-tree, the frequent-pattern tree of DB. Method: The FP-tree is constructed as follows. Scan the transaction database DB once. Collect F, the set of frequent items, and the support of each frequent item. Sort F in support-descending order as FList, the list of frequent items. Create the root of an FP-tree, T, and label it as null. For each transaction Trans in DB do the following: Select the frequent items in Trans and sort them according to the order of FList. Let the sorted frequent-item list in Trans be [ p | P], where p is the first element and P is the remaining list. Call insert tree([ p | P], T ). The function insert tree([ p | P], T ) is performed as follows. If T has a child N such that N.itemname = p.item-name, then increment N s count by 1; else create a new node N , with its count initialized to 1, its parent link linked to T , and its node-link linked to the nodes with the same item-name via the node-link structure. If P is nonempty, call insert tree(P, N ) recursively.

9.Discuss the approaches for mining multi level association rules from the transactional databases. Give relevant example. Ans.In general,a top-down strategy is employed,where counts are accumulated for the calculation of frequent itemsets at each concept level,starting at the concept level1 and working towards the lower,more specific concept levels,until no more frequent itemsets can be found.That is,once all frequent itemsets at concept levels,until no more frequent itemsets can be found.That is, once all frequent itemsets at concept level 1 are found,then the frequent itemsets at level2 are found,and so on,For each level,any algorithm for discovering frequent itemsets may be used,such as Apriori or its variations. Using uniform minimum support for all levels(referred to as uniform supprort) The same minimum support threshold is used when mining at each level of abstraction.When a uniform minimum support threshold is used,the search procedure is simplified.The method is also simple in that users are required to specify only one minimum support threshold.An optimization technique can be adopted,based on the knowledge that an ancestor is a superset of its descendents;the search avoids examining itemsets contaiining any item whose anestors do not have minimum support. The uniform supprot approach,however,has some difficulties.It is unlikely that items at lower levels of abstraction will occur as frequently as those at higher levels of abstraction.If the minimum support threshold is set too high,it could miss several meaningful associations occuring at low abstraction levels.If the threshold is set too low,it may generate many uninteresting associations occuring at high abstraction levels.This provides the motivation for the

following approach. Using reduced minimum support at lower levels (referred to as reduced sup-port);Each level of abstraction has its own minimum support threshold.The lower the abstraction level,the smaller the corresponding threshold. For mining multiple-level associations with reduced support,there are a number of alternative search strategies: Level-by-Level independent: This is a full-breadth search,where no background knowledge of frequent itemsets is used for pruning.Each node is examined,regardless of whether or not its parent node is found to be frequent. Level -cross-filtering by single item: An item at the ith level is examined if and only if its parent node at the (i-1)th level is frequent .In other words ,we investigate a more specific association from a more general one.If a node is frequent,its children will be examined;otherwise,its descendents are pruned from the search. Level-cross filtering by -K-itemset: A A-itemset at the ith level is examined if and only if its corresoponding parent A-itemset at the (i-1)th level is frequent. 10.Explain the algorithm for constructing a decision tree from training samples. Ans. Top-Down Decision Tree Induction Schema: BuildTree(Node n, datapartition D, algorithm C L) 1. Apply CL To D to find crit(n) 2. Let kS be the number of children of n 3. If(k>0) 4. Create k children c1..ck of n 5. Use best split to partition D into D1Dk 6. For(i=1;i<k;i++) 7. BuildTree(ci,Di) 8. endFor 9. endIf RainForest Refinement: 1. for each predictor attribute p 2. Call CL find_best_partition(AVC-set of p) 3. endfor 4. k=CL decide_splitting_criterion();

11.Explain Baye's theorem. Ans.Bayes' theorem gives the relationship between the probabilities of A and B, P(A) and P(B), and the conditional probabilities of A given B and B given A, P(A|B) and P(B|A). In its most common form, it is:

The meaning of this statement depends on the interpretation of probability ascribed to the terms

When applied, the probabilities involved in Bayes' theorem may have any of a number of probability interpretations. In one of these interpretations, the theorem is used directly as part of a particular approach to statistical inference. ln particular, with the Bayesian interpretation of probability, the theorem expresses how a subjective degree of belief should rationally change to account for evidence: this is Bayesian inference, which is fundamental to Bayesian statistics. However, Bayes' theorem has applications in a wide range of calculations involving probabilities, not just in Bayesian inference. Bayes' theorem is to the theory of probability what Pythagoras's theorem is to geometry. In the Bayesian (or epistemological) interpretation, probability measures a degree of belief. Bayes' theorem then links the degree of belief in a proposition before and after accounting for evidence. For example, suppose somebody proposes that a biased coin is twice as likely to land heads than tails. Degree of belief in this might initially be 50%. The coin is then flipped a number of times to collect evidence. Belief may rise to 70% if the evidence supports the proposition. For proposition A and evidence B,

P(A), the prior, is the initial degree of belief in A. P(A|B), the posterior, is the degree of belief having accounted for B. the quotient P(B|A)/P(B) represents the support B provides for A.

12.Explain the following clustering methods in detail: (i) BIRCH, Ans.BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets. An advantage of Birch is its ability to incrementally and dynamically cluster incoming, multidimensional metric data points in an attempt to produce the best quality clustering for a given set of resources (memory and time constraints). In most cases, Birch only requires a single scan of the database. In addition, Birch is recognized as the, "first clustering algorithm proposed in the database area to handle 'noise' (data points that are not part of the underlying pattern) effectively". Advantages with BIRCH It is local in that each clustering decision is made without scanning all data points and currently existing clusters. It exploits the observation that data space is not usually uniformly occupied and not every data point is equally important. It makes full use of available memory to derive the finest possible sub-clusters while minimizing I/O costs. It is also an incremental method that does not require the whole data set in advance (ii) CURE Ans. CURE (Clustering Using REpresentatives) is an efficient data clustering algorithm for large databases that is more robust to outliers and identifies clusters having non-spherical shapes and wide variances in size. To avoid the problems with non-uniform sized or shaped clusters, CURE employs a novel hierarchical clustering algorithm that adopts a middle ground between the centroid based and all point extremes. In CURE, a constant number c of well scattered points of a cluster are chosen and they are shrunk towards the centroid of the cluster by a fraction .

The scattered points after shrinking are used as representatives of the cluster. The clusters with the closest pair of representatives are the clusters that are merged at each step of CURE's hierarchical clustering algorithm. This enables CURE to correctly identify the clusters and makes it less sensitive to outliers The running time of the algorithm is O(n2 log n) and space complexity is O(n). The algorithm cannot be directly applied to large databases. So for this purpose we do the following enhancements Random sampling Partitioning for speed up Labeling data on disk 13.What is a multimedia database? Explain the methods of mining multimedia database? Ans.The multimedia database systems are to be used when it is required to administrate a huge amounts of multimedia data objects of different types of data media (optical storage, video tapes, audio records, etc.) so that they can be used (that is, efficiently accessed and searched) for as many applications as needed. recordings, signals, etc., that are digitalized and stored. Basic service: -user system (like program interface) the methods of mining multimedia database Classification models Machine learning (ML) and meaningful information extraction can only be realized, when some objects have been identified and recognized by the machine.The object recognition problem can be referred as a supervised labeling problem. Starting with the supervised models, we mention the decision trees. An overview of existing works in decision trees is provided in . Decision trees can be translated into a set of rules by creating a separate rule for each path from the root to a leaf in the tree. However, rules can also be directly induced from training data using a variety of rule-based algorithms. Clustering Models In unsupervised classification, the problem is to group a given collection of unlabeled multimedia files into meaningful clusters according to the multimedia content without a priori knowledge. Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods. Association rules The most association rules studies have been focusing on the corporate data typically in alphanumeric databases . There are three measures of the association: support, confidence and interest. The support factor indicates the relative occurrence of both X and Y within the overall data set of transactions. It is defined as the ratio of the number of instances satisfying both X and Y over the total number of instances. The confidence factor is the probability of Y given X and is defined as the ratio of the number of instances satisfying both X and Y over the number of

instances satisfying X. The support factor indicates the frequencies of the occurring patterns in the rule, and the confidence factor denotes the strength of implication of the rule. 14.(i) Discuss the social impacts of data mining. Ans. Profiling information is collected every time You use your credit card, debit card, supermarket loyalty card, or frequent flyer card, or apply for any of the above You surf the Web, reply to an Internet newsgroup, subscribe to a magazine, rent a video, join a club, fill out a contest entry form, You pay for prescription drugs, or present you medical care number when visiting the doctor Collection of personal data may be beneficial for companies and consumers, there is also potential for misuse (ii) Discuss spatial data mining. Ans. Spatial data mining is the process of discovering interesting, useful, non-trivial patterns from large spatial datasets. Spatial data mining is the application of data mining methods to spatial data. The end objective of spatial data mining is to find patterns in data with respect to geography. So far, data mining and Geographic Information Systems (GIS) have existed as two separate technologies, each with its own methods, traditions, and approaches to visualization and data analysis. Particularly, most contemporary GIS have only very basic spatial analysis functionality. The immense explosion in geographically referenced data occasioned by developments in IT, digital mapping, remote sensing, and the global diffusion of GIS emphasizes the importance of developing data-driven inductive approaches to geographical analysis and modeling. Challenges in Spatial mining: Geospatial data repositories tend to be very large. Moreover, existing GIS datasets are often splintered into feature and attribute components that are conventionally archived in hybrid data management systems. Algorithmic requirements differ substantially for relational (attribute) data management and for topological (feature) data management. 15.Write the A priori algorithm for discovering frequent item sets for mining singledimensional Boolean Association Rule and discuss various approaches to improve its efficiency. Ans.The first pass of the algorithm simply counts item occurrences to determine the frequent 1itemsets. A subsequent pass, say pass k,consists of two phases. First, the frequent itemsets Lk-1 found in the (k-1)th pass are used to generate the candidate itemsets Ck, using the apriori candidate generation function described below. Next, the database is scanned and the support of candidates in Ck is counted. For fast counting, we need to efficiently determine the candidates in Ck that are contained in a given transaction T. We now describe candidate generation,finding the candidates which are subsets of a given transaction, and then discuss bufer management. L1 := frequent 1-itemsets; k := 2; // k represents the pass number while ( Lk!=0 ) do begin Ck := New candidates of size k generated from Lk-1; For all transactions T 2 D do begin

Increment the count of all candidates in Ck that are contained in T. end Lk := All candidates in Ck with minimum support. k := k + 1; end Answer := Sk Lk; 16.Explain the different categories of clustering methods? Ans.1. Partitioning Methods The partitioning methods generally result in a set of M clusters, each object belonging to one cluster. Each cluster may be represented by a centroid or a cluster representative; this is some sort of summary description of all the objects contained in a cluster. The precise form of this description will depend on the type of the object which is being clustered. 2. Hierarchical Agglomerative methods The construction of an hierarchical agglomerative classification can be achieved by the following general algorithm. 1.Find the 2 closest objects and merge them into a cluster 2.Find and merge the next two closest points, where a point is either an individual object or a cluster of objects. 3.If more than one cluster remains, return to step 2 3.The Single Link Method (SLINK) The single link method is probably the best known of the hierarchical methods and operates by joining, at each step, the two most similar objects, which are not yet in the same cluster. The name single link thus refers to the joining of pairs of clusters by the single shortest link between them. 4. The Complete Link Method (CLINK) The complete link method is similar to the single link method except that it uses the least similar pair between two clusters to determine the inter-cluster similarity (so that every cluster member is more like the furthest member of its own cluster than the furthest item in any other cluster ). This method is characterized by small, tightly bound clusters. 5.The Group Average Method The group average method relies on the average value of the pair wise within a cluster, rather than the maximum or minimum similarity as with the single link or the complete link methods. Since all objects in a cluster contribute to the inter cluster similarity, each object is , on average more like every other member of its own cluster then the objects in any other cluster. 6.Text Based Documents In the text based documents, the clusters may be made by considering the similarity as some of the key words that are found for a minimum number of times in a document. Now when a query comes regarding a typical word then instead of checking the entire database, only that cluster is scanned which has that word in the list of its key words and the result is given. The order of the documents received in the result is dependent on the number of times that key word appears in the document.

17.Explain the back propagation algorithm for neural network- based classification of data. Ans.The back propagation algorithm performs learning on a multilayer fee-forward neural network. The inputs correspond to the attributes measured for each raining sample. The inputs are fed simultaneously into layer of units making up the input layer. The weighted outputs of these units are, in turn, fed simultaneously to a second layer of neuron like units, known as a hidden layer. The hidden layer s weighted outputs can be input to another hidden layer, and so on. The number of hidden layers is arbitrary, although in practice, usually only one is used. The weighted outputs of the last hidden layer are input to units making up the output layer, which emits the networks prediction for given samples. The units in the hidden layers and output layer are sometimes referred to as neurodes, due to their symbolic biological basis, or as output units. Multilayer feed-forward networks of linear threshold functions, given enough hidden units, can closely approximate any function. Back propagation learns by iteratively processing a set of training samples, comparing the networks predicition for each sample with the actual known class label. For each training sample, the weights are modified so as to minimize the mean squared error between the networks prediction and the actual class. These modifications are made in the backwards direction, that is , form the output layer through each hidden layer down to the first hidden layer (hence the name backpropagation). Although it is not guaranteed in general the weights will eventually converge, and the learning process stops. The algorithm is summarized below. Initialize the weights. The weights in the network are initialized to small random number(e.g., ranging from -1.0 to1.0,or -0.5 to 0.5). Each unit has a bias associated with it. The biases are similarly initialized to small random numbers. Each training sample: X, is processed by the following steps: Propagate the inputs forward Back propagate the error

Anda mungkin juga menyukai