Anda di halaman 1dari 4

Jayamatha engineering college Aralvaimozhi CS1004-Data warehousing and mining UNIT-II Two Mark Questions: 1. Why data preprocessing?

To identify incomplete, noisy and inconsistent data and then clean them from database and data warehouse, using data cleaning, integration , transformation and reduction techniques of preprocessing. Preprocessing uses data cleaning which attempts to fill in missing values, smooth out noise while identifying outliers and correct inconsistencies in the data. 2. What do you meant by data cleaning? Data cleaning is a one of the data preprocessing technique to fill in missing values, smooth out noise while identifying outliers and correct inconsistencies in the data. 3. What do you meant by data reduction? It is a data preprocessing technique which can be applied to obtain a reduced representation of the data set that is much smaller in volume but produces the same result. 4. What do you meant by data integration? It is a data preprocessing technique which merges data from multiple sources into a coherent data store. 5. What is meant by missed values? Data which lacking attribute values or certain attribute of interest are missed values. 6. Define noisy data and give an example? Noise is a random error or variance in a measured variable. 7. How data transformation is done? The data are transformed into forms appropriate for mining. Data transformation can involve the following process. Smoothing Aggregation Generalization Normalization Attribute construction 8. Define min-max normalization The process of linear transformation on the original data is called Min-max normalization.

v-minA (new_ maxA-new_minA)+new_ minA maxA-minA Where minA and maxA are the minimum and maximum values of an attribute A. 9. What is dimensionality reduction? Dimensionality reduction reduces the data size by removing irrelevant or redundant attributes from it. It reduces the number of attributes appearing in the discovered patterns, helping to make the patterns easier to understand. 10. What is numerosity reduction? The data volume can be reduced by choosing alternative i.e, smaller forms of data representation. The technique of numerosity reduction may be parametric or nonparametric. For parametric methods, a model is used to estimate the data, so that only the data parameters need to be stored instead of actual data. Non-parametric methods for storing reduced representations of the data include histograms, clustering and sampling. 11. What do you mean by entropy? Entropy is one of the most commonly used discretization measures. 12. What are various data mining primitives? A data mining task can be specified in the form of data mining query, which is input to the data mining system. A data mining query is defined in terms of the following primitives. Task-relevant data The kinds of knowledge to be mined Background knowledge Interestingness measures Presentation and visualization of discovered patterns 13. What is meant by minable view? The set of task relevant data for data mining is called minable view. 14. ,List out various interestingness measures? Simplicity Rule length Certainty Utility 15. What are various coupling available? No coupling Loose coupling Semi tight coupling


Tight coupling 16. Define concept description. Concept description describes a given set of task relevant data in a concise and summarative manner, presenting interesting general properties of the data. It consists of characterization and comparison. 17. Distinguish between concept description and online analytical processing. Concept Description 1) The measures can be applied to numeric, non-numeric, spatial, text etc. 2) Aggregation of attributes includes sophisticated data types. 3) It is a fully automated process which help user to find the dimension Online analytical Processing 1) The measures can be applied to numeric data. 2) Aggregation of attribute is only for numeric data. 3) It is purely user-controlled process.

18. What do u meant by quantitative characteristics rule? A generalized relation may represent in the form of logic rules. Quantitative I information , such as the percentage of data tuples that satisfy the left and right hand side of the rules, should be associated with each rule. A logic rule that is associated with quantitative information is called quantitative rule. 19. Define: Median, Mean, and Mode. The measure of the center of a set of data is the mean. x=1/n Median is the middle value of set of the number of values n is an odd number, otherwise it is the average of the middle of two values. Median= Mode is the value that occurs frequently in the set. 20. What do u meant by attribute oriented induction? It is a relational database query-oriented, generalization based, online data analysis technique. It first collects the task relevant data using a relational database query and then performs generalizations based on the examination of number of distinct values of

each attribute in the relevant set of data. The generalizations are performed by attribute removals or attribute generalization. 21. What are the different graph displays? Histogram Quantile plot Quantile-quantile/q-q plot Loess curve

22. What is the need for discretization in data mining? It is used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Intervals labels can then be used to replace actual data values. It leads a concise, easy to use, knowledge level representation of mining results. 23. Define smoothing? Data transformation techniques which works to remove the noise from data.