Anda di halaman 1dari 8

Vel Tech Rangarajan Dr.

Sagunthala R&D Institute of


Science and Technology
School of Computing
Department of CSE
Academic Year: 2019-2020

Course: 1151CS114-Data warehousing and Data mining Sem: Summer Semester

Slot: S4&S5

Faculty: Florence S

Assignment – II

Level of learning
CO domain (Based on
Course Outcomes
Nos. revised Bloom’s
taxonomy)
Explain the concept of Data mining system and apply the
 CO3 various preprocessing techniques on large dataset. K2

Apply Association rule mining, classification and clustering


 CO4 techniques to discover various mining patterns. K3

Apply clustering techniques in various data mining applications


 CO5 K3

1.Diabetic prediction CO4 K3

Consider the diabetic dataset available in following url


https://archive.ics.uci.edu/ml/datasets/diabetes

Attribute Information:
Diabetes files consist of four fields per record. Each field is separated by a tab and each record is
separated by a newline.

File Names and format:


(1) Date in MM-DD-YYYY format
(2) Time in XX:YY format
(3) Code
(4) Value(class label)
 Apply apriori algorithm to find the frequent itemset from the above dataset
 Also apply FP growth algorithm o find the frequent itemset from the above dataset
 Compare the result of both algorithms
Use any one of the following language for implementation
C,C++,JAVA, PYTHON, R

2.Diabetic prediction CO5 K3

Consider the diabetic dataset available in following url


https://archive.ics.uci.edu/ml/datasets/diabetes

Attribute Information:
Diabetes files consist of four fields per record. Each field is separated by a tab and each record is
separated by a newline.

File Names and format:


(1) Date in MM-DD-YYYY format
(2) Time in XX:YY format
(3) Code
(4) Value(class label)
Download ORANGE data mining tool from the following link
https://orange.biolab.si/
Complete the following task using ORANGE TOOL
 Apply K-MEANS clustering
 Apply KNN algorithm
 Show frequent itemsets using apriori algorithm
3.
LUNG CANCER PREDICTION CO3&CO4 K3
Download the lung cancer dataset available in following url
https://archive.ics.uci.edu/ml/datasets/Lung+Cancer
Download RAPID MINER data mining tool from the following link
https://rapidminer.com/get-started/
Complete the following tasks using RAPID MINER TOOL
 Preprocess the dataset for removing missing values and noise
 Apply KNN Algorithm for classifying the data. Calculate the accuracy
 Classify the data available in dataset using decision tree
 Apply back propagation to classify the data(Neural Network)

4. APPLICATION FOR APRIORI ALGORITHM CO4 K3


Create an application for generating frequent itemsets using apriori algorithm. Your application
should work for different dataset.
Consider the following steps while creating the application
 Load the dataset files as text/csv file using input buttons (first page)
 give minimum support also.
 Your application should return frequent itemsets and association rules
 should calculate the support and confidence also
Example dataset

SAMPLE OUTPUT FORMAT


5. CO4 K3
PREDICTION OF STUDENTS’ ACADEMIC PERFORMANCE
Download the STUDENT PERFORMANCE dataset available in following url
https://archive.ics.uci.edu/ml/datasets/student+performance
This data approach student achievement in secondary education of two Portuguese schools. The
data attributes include student grades, demographic, social and school related features) and it was
collected by using school reports and questionnaires. Two datasets are provided regarding the
performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In
[Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification
and regression tasks. Important note: the target attribute G3 has a strong correlation with
attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period),
while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3
without G2 and G1, but such prediction is much more useful (see paper source for more details).
consider the attributes available in the dataset and analyze the dataset for student
performance using following operations
Download ORANGE data mining tool from the following link
https://orange.biolab.si/
 Predict the student academic performance with naive bayes classifier of orange tool
 Neural network for prediction of class label

6. Use the ID3 algorithm for the following question. CO4 K3


 Build a decision tree from the given tennis dataset. You should build a tree to predict
PlayTennis, based on the other attributes (but, do not use the Day attribute in your tree.).
Show all of your work, calculations, and decisions as you build the tree. What is the
classification accuracy?
 Is it possible to produce some set of correct training examples that will get the algorihtm
to include the attribute Temperature in the learned tree, even though the true target
concept is independent of Temperature? if no, explain. If yes, give such a set.
 Now, build a tree using only examples D1–D7. What is the classification accuracy for the
training set? what is the accuracy for the test set (examples D8–D14)? explain why you
think these are the results.
 In this case, and others, there are only a few labelled examples available for training (that
is, no additional data is available for testing or validation). Suggest a concrete pruning
strategy, that can be readily embedded in the algorithm, to avoid over fitting.Explain why
you think this strategy should work.

7. CO4 K3
This question aims to provide you a better understanding of the frequent pattern mining and the
closed/maximal pattern mining.
 Implement a frequent pattern mining algorithm (e.g., the Apriori algorithm or FP-
Growth) to mine the frequent patterns from a transaction dataset.

Sample Input 0
2
BAC E D
AC
CBD
Sample Output 0
3 [C]
2 [A]
2 [A C]
2 [B]
2 [B C]
2 [B C D]
2 [B D]
2 [C D]
2 [D]

Sample Input 1
2
data mining
frequent pattern mining
mining frequent patterns from the transaction dataset
closed and maximal pattern mining

Sample Output 1
4 [mining]
2 [frequent]
2 [frequent mining]
2 [mining pattern]
2 [pattern]

8. LUNG CANCER PREDICTION USING WEKA CO3&CO4 K3


Download the lung cancer dataset available in following url
https://archive.ics.uci.edu/ml/datasets/Lung+Cancer
Download WEKA data mining tool from the following link
https://weka.informer.com/3.9/
Complete the following tasks using WEKA TOOL
 Preprocess the dataset for removing missing values and noise
 Apply KNN Algorithm for classifying the data. Calculate the accuracy
 Classify the data available in dataset using decision tree
 Apply back propagation to classify the data(Neural Network)
9. .Diabetic data classification

Consider the diabetic dataset available in following url


https://archive.ics.uci.edu/ml/datasets/diabetes

Diabetes files consist of four fields per record. Each field is separated by a tab and each record is
separated by a newline.

File Names and format:


(1) Date in MM-DD-YYYY format
(2) Time in XX:YY format
(3) Code
(4) Value(class label)
Download RAPID MINER data mining tool from the following link
https://rapidminer.com/get-started/
Complete the following task using rapid miner TOOL
 Apply K-MEANS clustering
 Apply KNN algorithm
Show frequent itemsets using apriori algorithm
10. prediction of Students’ Academic Performance CO4 K3
Download the STUDENT PERFORMANCE dataset available in following url
https://archive.ics.uci.edu/ml/datasets/student+performance
This data approach student achievement in secondary education of two Portuguese schools. The
data attributes include student grades, demographic, social and school related features) and it was
collected by using school reports and questionnaires. Two datasets are provided regarding the
performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In
[Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification
and regression tasks. Important note: the target attribute G3 has a strong correlation with
attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period),
while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3
without G2 and G1, but such prediction is much more useful (see paper source for more details).
consider the attributes available in the dataset and analyze the dataset for student
performance using following operations
Download TANAGRA data mining tool from the following link
http://eric.univ-lyon2.fr/~ricco/tanagra/en/tanagra.html
 Predict the student academic performance with naive bayes classifier of orange tool
 Neural network for prediction of class label

11. HEART DISEASE PREDICTION CO3&CO4 K3

Consider the heart disease data set which is available in the following link
https://archive.ics.uci.edu/ml/datasets/Heart+Disease.
It contains 14 attributes . attribute num indicates the class label.
Download “Rapid miner” tool using following link
https://rapidminer.com/get-started/
Use “RAPID MINER” data mining tool to complete the following task
1. Preprocess your dataset .
2. Generate association rules using apriori algorithm
3. Mine the frequent patterns using Fp growth.
4. Identify the class labels using decision tree induction
5. Identify the class labels with Multilayer feed forward neural network

12. Apply different normalization for the following datasets CO3 K2

 Apply min max normalization for salary.


 Apply z score normalization for salary
 Implement min max normalization using anyone of the familiar language like
C,C++,JAVA, PYTHON.

Anda mungkin juga menyukai