Anda di halaman 1dari 3

Departamento de Electrnica, Telecomunicaes e Informtica Universidade de Aveiro

EXPLORAO DE DADOS & DATA MINING

PROJECT ON CLASSIFIERS

The purpose of this study is to achieve a comparison of the performance of two classifiers using Rapid Miner. The case studies are: 1. Wine Quality Data Set [1]: two datasets related with red and white variants of the Portuguese "Vinho Verde" wine. Each data set has 12 attributes and the output variable is a quality score between 0 and 10. The aim in this project is to discriminate good quality wines (score >= 6) from the lot of wines (score < 6). 2. Parkinson [2]: a data set composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds to one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD. 3. Semeion Handwritten Digit Data Set: 1593 handwritten digits from around 80 persons were scanned and stretched in a rectangular box 16x16 in a gray scale of 256 values. Then each pixel of each image was scaled into a Boolean (1/0) value using a fixed threshold. Each person wrote on a paper all the digits from 0 to 9, twice. The commitment was to write the digit the first time in the normal way (trying to write each digit accurately) and the second time in a fast way (with no accuracy). The aim in this project is to discriminate the digit 5 from others digits. 4. Blood Transfusion Service Center Data Set [3]: a set from 748 donors selected at random from the donor database of the Blood Transfusion Service Center in Hsin-Chu City in Taiwan. The attributes are R (Recency - months since last donation), F (Frequency - total number of donation), M (Monetary - total blood donated in c.c.), T (Time - months since first donation), and a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood). 5. Mammographic Mass Data Set: this data set can be used to predict the severity (benign or malignant) of a mammographic mass lesion from BI-RADS attributes and the patient's age. The aim is to discriminate benign from malignant cases assuming that all cases with BI-RADS assessments greater or equal a given value (varying from 1 to 5), are malignant and the other cases are benign. Please, refer to the UCI Machine Learning Repository1 for more information about the case studies listed above.

URL: http://archive.ics.uci.edu/ml/.

The work will be developed in groups of two or three elements, and will consist of the following tasks2 using one of the data sets listed above. Data Understanding The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information. Data Preparation The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data. Data preparation tasks may include record, and attribute selection as well as transformation and cleaning of data for modeling tools. Modeling In this phase, various modeling (classification) techniques are selected and applied, and their parameters are calibrated to optimal values. Evaluation Analysis, interpretation and presentation of main results in order to compare the performance of two classifiers adopted. It is highly recommendable to read the paper [4] to look for useful insights on data understanding and preparation tasks. Deliverables The projects must be presented on April, 30. The duration of each presentation is 10 minutes. The slides prepared for the presentation must be sent by email (ana@ua.pt and jose.moreira@ua.pt) until April, 30, 12 AM. Project assignments Option Nb of elements 1 2 2 2 3 3 4 2 5 2 6 3 7 2 8 2 9 3 10 2 11 2 12 3 13 2 14 2 15 2 16 3

Case study id 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 5

Classifiers Naive Bayes and KNN Naive Bayes and SVM Naive Bayes, decision trees and Neural networks Naive Bayes and KNN Naive Bayes and SVM Naive Bayes, decision trees and Neural networks Naive Bayes and KNN Naive Bayes and SVM Naive Bayes, decision trees and Neural networks Naive Bayes and KNN Naive Bayes and SVM Naive Bayes, decision trees and neural networks Naive Bayes and KNN Naive Bayes and SVM Naive Bayes and neural networks Naive Bayes, decision trees and neural networks

For more information on these steps refer to the CRISP-DM Process Model: http://www.crispdm.org/Process/index.htm.

References [1] Paulo Cortez, Antnio Cerdeira, Fernando Almeida, Telmo Matos and Jos Reis. Modeling wine preferences by data mining from physicochemical properties. Decis. Support Syst. 47, 4 (November 2009), 547-553.
http://dx.doi.org/10.1016/j.dss.2009.05.016.

[2] Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig. Suitability of dysphonia measurements for telemonitoring of Parkinson's disease, IEEE Transactions on Biomedical Engineering, vol. 56, April 2009.
http://precedings.nature.com/documents/2298/version/1/files/npre20082298-1.pdf. or http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=04636708

[3] I-Cheng Yeh, King-Jang Yang, and Tao-Ming Ting. Knowledge discovery on RFM model using Bernoulli sequence. Expert Syst. Appl. 36, 3 (April, 2009), 5866-5871.
http://dx.doi.org/10.1016/j.eswa.2008.07.018.

[4] Jianchao Han, Juan C. Rodriguez, Mohsen Beheshti. Diabetes Data Analysis and Prediction Model Discovery Using RapidMiner, Future Generation Communication and Networking, pp. 96-99, 2008 Second International Conference on Future Generation Communication and Networking, 2008.
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4734287.

Anda mungkin juga menyukai