Data Warehousing and Data Mining

PRIST University
II M.Tech CSE
III SEM
12250E33A -DATA WAREHOUSING AND DATA MINING - UNIT-1 1.Define Data mining. Data mining refers to extraction or mining knowledge from large databases.Data Mining and Knowledge discovery in the database is a new interdisciplinary field,Merging ideas from statistics,Machine Learning,database and parallel computing. Data mining is the non-trival Extraction of implicit,previously unknown and potentially useful information from the data 2.What is KDD? Knowledge Discovery in database is the process of identifying a valid,potentially useful and ultimately understandable structure in data.This process involves selection or sampling data from a data warehouse,cleaning or preprocessing it,transforming or reducing it,applying a data mining component to produce a structure, and then evaluation the derived structure. 3.What are the steps involved in KDD process? The Steps invloved in KDD process includes 1. Data Cleaning 2. Data Integration 3. Data Selection 4. Data Transformation 5. Data Mining 6. Pattern Evaluation 7. Knowledge Presentation. 4.What is the use of the knowledge base? Knowledge base is the Domain knowledge that is used to guide the search or evaluate the intrestingness of resulting patterns. Such knowledge can be include concept hierarchies used to organise attribute values into different levels of abstraction knowledge such as user beliefs,thresholds and metadata can be used the access a pattern's interestingness. 5.Draw the architecture of typical data mining system.
1
12250E33A -DATA WAREHOUSING AND DATA MINING
PRIST University
II M.Tech CSE
III SEM
6. Mention some of the data mining techniques. The most commonly used techniques in data mining are: Neural Networks Decision Trees Cluster Analysis Rule Induction Genetic Algorithms Data Visualization 7.What is the purpose of data mining techniques? The main purpose of data mining techniques is to analyze the data provide the further knowledge about the domain.it main purpose is to search for the relationship and global pattern that exist in large database burt are hidden among vast amounts of data. 8.What is data cube? Data warehouse is modeled by Data Cubes. Each Dimension is an attribute and each cell represents the aggregate measure. It collects information about subject that span an entire organization whereas data mart focused on selected subjects. The multidimensional data views makes online Analytical Processing easier. 9.Define association analysis. Give example Association analysis is the discovery of association rules showing attribute values conditions that occurs frequently together in a given set of data. An association rule is an expression of the form x=>y,where x and y are the set of items.this rule should satisfy tow measure support and confidence. Ex :Buys(x,computer )=>buys(x,software) 9.Define Classification. Classification involves finding rules that partition the data into disjoint groups. The input for the classification is the training data set,whose class labels are already known. Classification analyzes the training data sets and constructs a model based on class label and aims to assign a class label to the future unlabeled records. 10.What is outlier mining? Data object which differ significantly from the remaining data objects are referred to as outlier. Normally outlier do not comply with the general behavior or model of the data Hence,most of the data mining methods discard outliers as noise or exceptions. Outlier Mining is useed for identifying exceptions or rare events which can often lead to the discovery of interesting and unexpected knowledge in areas such as credit card fraud detection,cellular phone cloning fraud and detection of suspicious activites.
2
PRIST University
II M.Tech CSE
III SEM
11.List out the data mining functionalities. The various data mining functionalities are concept/class Description Association analysis classification and Prediction Cluster Analysis Outlier Analysis Evolution Analysis 12.Define cluster analysis. Clustering is a method of grouping data nto different groups,so that data in each grup share similar trends and patterns. o To uncover natural grouping o To initiate hypothesis about the data o To find consistent and valid organization of the data 13.Write short notes on classification of data mining systems. Datamining is an interdisciplinary field,the confluence of a set of disciplines,including database systemss,statistics.machine learning,Visualization and information source. Data Mining system can be classified according to the kinds of the database mined ,types,techniques,utilized,functionalities applied etc.. 1. classification according to data models 2. classification according to special types of data handled 3. classification according to the king of knowledge mined 4. classification according to the kind of technologies utilized 14.Describe the challenges to data mining regarding data mining methodology and user interaction issues. The challenges to data mining regarding data mining methodology and user interaction issues are Mining different kinds of knowledge in databases Interactive mining of knowledge at multiple levels of abstraction Incorporation of background knowledge Data mining query languages and ad hoc data mining Presentation and visualization of data mining results Handling noisy or incomplete data Pattern Evaluation-the interestingness problem 15.Describe challenges to data mining regarding performance issues. 1. Efficiency and scalability of data mining algorithm 2. Pattern distribution and incremental mining algorithms
3
PRIST University
II M.Tech CSE
III SEM
16.Describe issues relating to the diversity of database types. Handling of relational and complex types of data Mining information from heterogeneous database and global information system 17.How is data warehouse different from a database? DATABASE DATA WAREHOUSE
Database is a collection of data.It contain information about one particular enterprise. A database management system consists of a collection of interrelated data and set of programs to access the data
A data warehouse is a subjectoriented,time-variant,and nonvolatile collection of data in support of management's decisionmaking process. The data warehouse is a informational environment that provide an integrated and total view of the enterprise.
UNIT-2 1. Define data preprocessing. (or) Why it is need to preprocess the data? Real-world databases are highly susceptible to noisy, missing and inconsistent data to their typically huge size and their likely origin from multiple, heterogeneous sources. LowQuality data will lead to low-quality mining results, so we prefer a data preprocessing concepts. Incomplete data can occur for a number of reasons: Attributes of interest may not always be available Relevant data may not be recorded due to a misunderstanding or because of equipment malfunctions Data that were inconsistent with other recorded data may have been deleted. 2. What are the methods used for cleaning data? The following methods are useful for performing missing values over several attributes: 1.Ignore the tuple 2.Fill in the missing values manually 3.Use a global constant to fill in the missing values 4.Use the attribute mean to fill in the missing values. 5.Use the attribute mean for all samples belonging to the same class as the given tuple. 6.Use the most probable value to fill in the missing value.
4
PRIST University
II M.Tech CSE
III SEM
3. Problems using data cleaning techniques. Nested Discrepancies Lack of interactivity Increased interactivity 4. What are the reasons to have inconsistent data? Inconsistencies exist in the data stored in the transaction.Inconsistencies occur due to errors during data entry,Functional dependencies between the attribute and missing values.The inconsistencies can be detected and corrected either by manully pr by knowledge engineering tools. 5. What are the methods available to perform data transformation? In data transformation, the data are transformed or consolidated into forms appropriate for mining. Smoothing Aggregation Generalization of the data Normalization Attribute Construction 6. Define min-max normalization. Min-max normalization performs a linear transformation on the original data.Suppose that min A and max A are the minimum nd maximum values of an attributes,A Min-max normailization maps a value ,V of A to V in the range[new_min A,new_maxA] by computation. ([new_minA-new_maxA )+ new_minA Min-max normalization preserves the relationship among the original data. 7.What is numerosity reduction? Numerosity Reduction is used to reduce the data volume by choosing alternative,smaller forms of data representation. Techniques: a)Parametric b)NonParametric 8. What is regression? In this model only the data parameters need to be stored, instead of the actual data. Linear regression Multiple linear regression 9. Define clustering. Clustering Techniques consider data tuples as objects. They partition the object into groups or clusters, so that objects within a cluster are similar to one another and dissimilar to objects in
5
PRIST University
II M.Tech CSE
III SEM
other object. Similarity is defined in terms of how close the objects are in space based on a distance function. 10. What is concept description? Concept Description describes a given set of task-relavent data in a concise and summarative manner, presenting interesting general properties of the data.Concept description consists of characterization and comparison 11. Define association rule mining. Association rule mining aims to extract interesting coorelations,frequency patterns,associations or cascual structures among set of items or objects in transaction databases,relational databases or other data repositories.Association rules are widely used in various areas such as telecommunication networks,market and risk management etc.. 12. What are the steps involved in association rule mining? 1.Find all frequency itemsets:By ddefining each of theseitemsets will occur at least as frequency as a predetermined minium support count,min_sup. 2.Generate strong associationrules form the frequency itemsets: By defining thee rules must satisfy minimum support and minimum confidence. 13. Define support and confidence in association rule mining. Support:percentage of transaction in D that contain AUB Confidence:Percentage of transaction in D containing A that also contain B support(A=>B) =P(AUB) confidence(A=>B)=p(A/B) Rules that support both a minimum support threshold and minimum confidence threshold are called strong. 14. Define Join and Prune. The apriori property follows a two step process: Join step: CK is generated by joining LK-1 with itself. Prune step:Any (k-1)-itemset that is not frequency cannot be a subset of a frequent k-itemset. 15. What is Apriori property ? Apriori principle state that any subset of a frequency itemset must be frequent.it includes two step process 1.Join step 2.Prune Step
6
PRIST University
II M.Tech CSE
III SEM
16. Define anti-monotone property. Apriori principle holds due to the following property of the support measure: X,Y :(X Y) s(X) s(Y) Support of an itemset never exceeds the support of its subsets This is known as the anti-monotone property of support 17. How to generate association rules from frequent item sets? Once frequency itemsets from transactions in a database D have been found,it is straight forward to generate strong association rules from them(where strong association rule satisfy both minimun support and minimum confidence. confidence(A=>B)=P(B/A)=support_count(AUB)/support_count(A) 18. Give few techniques to improve the efficiency of Apriori algorithm. The techniques to improve the efficiency of apriori algorithm 1.Hash-based itemset counting 2.Transaction Reduction 3.partioning 4.sampling Dynamic itemset counting 19. Define Iceberg Query. Compute aggregates over one or a set of attributes only for those whose aggregate values is above certain threshold. Example: Select P,custID,P.itemID,sum(P.qty) From purchase P Group by P,custID,P.itemID Having sum(P.qty)>=10 20. Mention few approaches to mining multilevel association rules. Uniform support:The same minimum support threshold is used for all levels. Reduced Support:Each level of abstract has its own minimum support threshold. 21. What are multi dimensional association rules? Multi dimensional association rules references two or more dimensions EXAMPLE: age('X',30,...39)^income(X,42K...48K)=>buys(X,high resolution TV )
7
PRIST University
II M.Tech CSE
III SEM
22. Define constrain Based association mining. constraint-based association rule mining aims to develop a systematic method by which the user can find important association among items in a database of transactions. By doing so, the user can then figure out how the presence of some interesting items (i.e., items that are interesting to the user) implies the presence of other interesting items in a transaction
UNIT III 1.Define the concept of classification. Predicts categories class labels(discrete or nominal) Classifies the data(construct a model) based on the training set and the values(class Labels) in classifying the attribute and uses it in classifying the new data. 2. Compare supervised and unsupervised learning. Supervised Learning:The training data are accompanied by the labels indicating the class of the observations. New data is classified based on the training set. unsupervised learning:The class label of training data is unknown. Given a set of measures, observations etc.. with the aim of establishing the existence or cluster in the data. 3. What are the methods used to improve the classification? Data cleaning Relavance analysis Data Transformation and reduction Generalization 4. Compare classification and prediction methods. Classification and prediction methods can be compared and evaluated according to the following criteria: 1.Accuracy: This refers to the computational costs involved in generating and using the given classifier or predictor. 2.speed:This refers to the computational costs involved in generating and using the given classifier or predictor. 3.Robustness:this is the ability of the classifier or predictor to make correct preditions given noisy data or data with missing values. 4.Scalability:This refers to the ability to construct the classifier or predictor efficitently given large amounts of data.
8
PRIST University
II M.Tech CSE
III SEM
5. What is a decision tree? A decision tree is a flow chart like tree structure,where each internal node(non-leaf node) denotes a test on an attribute,each branch represents n outcomes of the test,and each leaf node holds a class label.The topmost node in a tree is the root node. 6.What is attribute selection measure? An attribute selection measure is a heuristic for selecting the splitting criterion that best seperates a given data partition,D of class-labeled training tuples into individual classes. It includes Information gain Gain ratio 7. Describe tree pruning methods. When the decision tree is built,many of the branches will reflect anomalies in the training data due to noise or outlliers.Tree puring methods address this problem as overfitting the data.Such methods typically use statistical measures to remove the least reliable Branches. 8. Define pre pruning and post pruning. Prepruning: A Tree is pruned by halting its construction early.Upon halting,the node becomes a leaf.The leaf may hold the most frequent class among the subset tuples or the probability distribution of those tuples.If portioning the tuples at a node would result in a split that falls below a prespecified threshold,then further portioning of the given subset is halted. Postpruing:In this approach,this removes subtrees from fully grown tree.A subtree at a given node is pruned by removing its branches and replacing it with leaf.The leaf is labeled with the most frequency class among the subtree being replaced. 9. What is Bayesian classifier? Bayesian classifier are statistical classifiers.they can predict class membership probabilities such as the probability that a given tuples belogs to a particular class.Bayesian classification is based on bayes Therom.Bayesian classifiers have also exhibited high accuracy and speed when applied to large database. 10. State Bayes Theorem. Bayes Therom plays a critical role in probabilistics learning and classification. Let x be data tuple whose class label is unknowm.Let H be some hypothesis,such as that the data tuple X belongs to a specified class c.Then the classification problem is determined by P(H|X) the probabiltity that the hypothsis H holds given the evidence or observed data tuple X. 11. How Nave Bayesian classifier works? The Nave Bayes model is a simple and well known method for performing supervised learning of a classification Problem.The nave Bayesian classifier makes the assumption of class
9
PRIST University
II M.Tech CSE
III SEM
conditional independence,i.e given the class label of a tuple,the values of the attributes are assumed to be conditionally independent of one another. 12. Define genetic algorithm. Genetic algorithm is an optimization search type algorithms based or human evolution and survival of fitttest.If represent an intial feasible solution and iteratively creates new better solution.It reperesents a solution as an individual ,i.e,String I=I 1,I2,In where Ij is called a gene.A set of individuals is called population. 13. What is K-Nearest Neighbor classifier? k-Nearest neighbor is a supervised learning algorithm where the result of new instance query is classified based on majority of k-nearest neighbor category,The purpose of this algorithm is to classify a new object based on attributes and trainingsamples.The classifiers do not use any model to fit and only based on memory. 14. Define Prediction. The Prediction models continuous valued functions,i.e predicts unknown or missing values.it includes 1.Linear Regression 2.Non-linear Regression. 3.Other Regression-Generalized linear model,Log linear models 14. What is linear regression and non linear regression? Linear Regression: Linear regression is the implest form of regression in which the data are modeled using a straight line Y=+X Nonlinear Regression:Non-linear regression doesnt have linear relation between response and predictor variables..They are linear dependent and it is modeled by polynomial function.The non-linear model can be converted to linear model by application transformation to the variables. 15. Define clustering. (or) What do you mean by cluster analysis? Finding the similarities between the data according to the characteristics found in the data a and grouping similar data objects in clusters. The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. 16. What are the requirements of clustering in data mining? The requirement of clustering in data mining are 1.Qualities of good clustering 2.Measuring the quality of clustering 3.Desirable properties of a clustering Algorithms 4.Cluster Analysis Applications.
10
PRIST University
II M.Tech CSE
III SEM
17. What are the different types of data used for cluster analysis? The different types of data used for cluster analysis 1.Interval-Scaled variables 2.Binary Variables 3.Nominal or categorical variable Ordinal variables Ratio-Scaled Variables 18. What is interval scaled variables? Interval-scaled variables describes distance measures that are commonly used for computing the disisimilarility of objects describes by such variables.These measures includes the Euclidean,Manhattan,and Minkowski distances. 19. Define Binary variables. What are the two types of binary variables? Binary Variables are one type of data in cluster analysis.It includes two types 1.Symmetric Binary Variables 2.Asymmetric Binary Variables 20. Define nominal, ordinal and ratio scaled variables. Nominal: A generalization of the binary variable in that it can take more that 2 states. Method1:simple matching d(i,j)=(p-m)/p where m:# of matches p:total # number of variables method 2: use a large number of binary variables. Ordinary variables: An ordinary Variable can be discrete or continuous,Order is important. Ratio scaled Variables:a positive measurement on a non-linear scale,approximately at exponential scale such as A exp(t) or A exp(-t) 21. Define Partitioning methods. Partitioning algorithms construct partitions of database of N objects into a set of k clusters. The construction involves determining the optimal partition with respect to an objective function. In general,Partitioning algorithms is Non-Hierarchical each instance is placed in exactly one of k Non overlapping clusters. Since only one set of clusters is output,the user normally has to input the desired number of cluster K. 22. What are the two types of Hierarchical clustering methods? There are two basic varieties is commonly used to represent the process of hierarchical clustering 1.Agglomerative hierarchical Methods. 2.Divisive Hierarchical Methods.
11
PRIST University
II M.Tech CSE
III SEM
UNIT IV PART-A 1. Define OLTP and OLAP systems. OLAP(ON-LINE TRANSACTION PROCESSING) o Major task of traditional relational DBMS o Day-to day operations : purchasing,inventory,banking,manufacturing,Payroll,Registration,Accounting,etc..OLAP-(ONLINE ANALYTICAL PROCESSING) o Major task of data warehouse system o Data Analysis and decision making. 2. List the four views of data warehouse. The four view of data warehouse 1.Top-down view 2.Data source view 3.Data warehouse view 4.Business query view 3. What is data warehouse? How it differ from a database? (or) Define Data warehouse. Data warehouse is a informational environment that Provides an integrated and total view of the enterprise. The process of construction and using data waehouses - A data warehouse is a subject-oriented,integrated,time-variant and non volatile collection of data in support of management decision-making process 4. What is data cube? A data warehouse is based on a multidimensional data mode which views data in the form of a data cube. A data cube such as a sales, allows data to be modeled and viewed in multiple dimensions. 5. What are facts and dimensions? (or) What are dimension table and fact table? Dimensions are perspectives or entities with respect to which an organization wants to keep records such as time,item,branch,location etc Dimension Tables such as item(item_name,brand,type) or time(Day,week),month,quarterly,year) given further descriptions about the dimensions. Fact Tables contains measures(such as Dollars_sold) and keys to each of the related dimensional tables. 6.List out the various forms of dimensional model. 1.star schema 2.snowflake shema 3.Fact constellations
12
PRIST University
II M.Tech CSE
III SEM
7. List out the components of Star schema with example. A star schema:A fact table in the middle connected to a set of dimensional tables
8. Explain Snow flake schema with example. Snowflake schema:A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables,forming a shape similar to snowflake.
9. Explain Fact constellation with example. The dimensions in this schema are segregated into independent dimensions based on the
levels of hierarchy. For example, if geography has five levels of hierarchy like teritary, region, country, state and city; constellation schema would have five dimensions instead of one.
Multiple fact table share dimension tables,viewed as a collection of stars
13
PRIST University
II M.Tech CSE
III SEM
10. Point out the major difference between star schema and snowflake schema. STAR SCHEMA SNOWFLAKE SCHEMA A multidimensional data warehouses model implemented within a relational database. The model consists of a fact table and one or more dimensional tables. A variation of the star schema where some of the dimension tables linked directly to the fact table are further subdivided. This permits the dimensional tables to be normalized,which in turn means less total storage.
11. How can we organize data cube measures? It can be organize in three catgories 1.Distributive: if the result derived by applying the function to n aggrerate valaues is the same as that derived by applying the function on all the data without partitioning eg:count(),sum() 2.Algebric:if it can be computed by an algebric function with M arguments each of which is obtained by applying a distributive aggrefate function eg:avg(),standard _deviation() 3.Holistic:if there is no constant bound on the storage size needed to descrinbe a subaggregate eg:median(),mode() 12. What is concept hierarchy? A concept hierarchy defines a sequence of mapping form a set of low-level concepts to higherlevel,most general concepts.
14
PRIST University
II M.Tech CSE
III SEM
13. What is rollup and drilldown? Rollup(drill-up):summarize data -By climbing up hierarchy or by dimension reduction. Drill Down(roll down):reverse of roll-up -from higher level summary to lower level summary or detailed data,or introduction new dimensions. 14. Describe the function of slice and dice. Slice and dice are main function of OLAP operations on multidimensional data It mainly refer to the project and select operations. 15. Define rotate or what is a pivot operation? The reorient the cube,visualization,3d to series of 2d planes is main operation involoved in rotate(pivot) operation. 16. List out the steps of warehouse design process. Datawarehouse design processing a)Top-down,bottom-up approchs or a combination of both b)From software engineering point of view c)Typical data warehouse design processing 17. What are metadata? Meta data is data about data that describes the datawrehouse.It is used for building,maintaining,mangaging and using the data warehouse. 18. What are the contents of metadata repository? Metadata is the data defining warehouse objects.it contain i)Description of the structure of the warehouse ii)Operational meta-data iii)The algorithm used for summarization iv)The mapping from operational environment to the data warehouse v)Data Related to system performances vi)Business data 19. List out the function of backend tools and utilities of data warehouse. The functions of back-end tools and utilities of data warehouse are a) Data Extraction b)Data Cleaning c)Data transformation d)Load e)Refresh
15
PRIST University
II M.Tech CSE
III SEM
20. What are the usages of data warehouse? Business Executive use the data I data warehouse and data marts to perform data analysis and make strategic decision. In many firms,data warehouses are used an integral part of a plan execute-assess closed loop feedback system for enterprise management Initially the data warehouse is mainly used for generating reports and answering predefined queries. 21. Write short notes on multidimensional data model. A database designed for onine analyitical processing.Structured as a mutidimensional hypercube with one axis per dimension. 22. What are the types of OLAP? Multidimensional OLAP (MOLAP) and Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP. UNIT V PART-A 1. What are the classifications of tools for data mining? Classifications of tools for data mining classification by decision tree Induction. Bayesian classifications classification by back propagation classification by Association rules 2. What are the areas in which data warehouse are used in present and in future? Enterprise Warehouse: covers all areas of interest for an organization Data Mart: covers a subset of corporate-wide data that is of interest for a specific user group (e.g., marketing). Virtual Warehouse: offers a set of views constructed on demand on operational databases. Some of the views could be materialized (precomputed) 3. What are the other areas for data warehousing and data mining? Games Business Science and engineering Human rights Medical data mining
16
PRIST University
II M.Tech CSE
III SEM
Spatial data mining Sensor data mining
4. Describe the use and application of DBMiner. DBMiner has been developed for interactive mining of multiple leve knowledge in large database.The system implements wide spectrum of data minng functionalities,including generalization,Characterization,association,claassifiction and prediction.the system provides a user-friendly,interactive data mining environment with good performance.
5. Give some data mining tools. Spatial data mining Multimedia Data mining Mining Text-series data Mining Text-Database Mining World-wide-web 6. Mention some of the application areas of data mining. Datamining has many and varied fields of applications Data mining for financial data analysis Data mining for the retail industry. Data mining for the Telecommunication Industry. Data mining for Biological Data Analysis. Data mining in other scientific Applications 7. Define text Mining. Extraction of knowledge from Large collections of documents from various source:news articles,research papers,books,digital libraries,e-mail messages and web pags,library database etc. 8. Define spatial mining. Spatial data mining refers to the extraction of knowledge,sptial relatioships, or other interesting patterns not explicitly stored in spatial database. A spatial database stores a large amount of space-related data,such as maps,preprocessed remote sensing or medicl imaging data, and VLSI chip layout data. 9. Why do we need Web Mining? Target potential customers for electronic commerce. Enhance the quality and delivery of Internet information services to the end users. Improve Web server system performances Identify Potential prime advertisement locations. Improve site design.
17
PRIST University
II M.Tech CSE
III SEM
10. Explain multimedia data mining. A Multimedia database system stores and manages a large collection of multimedia data such as audio,video,image,graphics,speech,text,document and hypertext data which contain txt,text markups and linkages.Extarction of information from multimedia database is mean to be multimedia data mining 11. Write some commercial data mining tool. 1. SAS Enterprise Miner 2. SPSS 3. IBM Intelligent Miner 4. Microsoft SQL Server 2005 5. Oracle Data Mining 6. STUDIO 7. KXEN 12. List out some of the challenges of www. Creating Knowledge from information available. Personalization of the information. Learning about the customers/Individual users Finding the relevant Information Search for web access patterns
18

Data Warehousing and Data Mining

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Data Warehousing and Data Mining

Diunggah oleh

Hak Cipta:

Format Tersedia

PRIST University

Multiple fact table share dimension tables,viewed as a collection of stars

Spatial data mining Sensor data mining

Anda mungkin juga menyukai