25 tayangan

Diunggah oleh Yasmani Isaac

- IRJET- Terror Attack Identifier: Classify using KNN,SVM,Random Forest Algorithm and Alert through Messages
- Final Chapter 3 Methodology
- Research time management
- Notes
- Getting to Know Spss - Cont
- 2456
- vibration
- Chapter 11 Statistical Inference Using Two Samples
- out_7
- STAT SPRING SYLLABUS
- SPSS Summary
- Daily Patterns in Stock Returns as Evidence From the New Zealand Stock Market
- Face Recognition Under Uncontrolled Indoor Environment
- Hw34 Old Kiwi
- Phishing Detection Search Engine
- t-test-120919022935-phpapp02.pdf
- Quantitative Report
- Statnikov_svm in Biomedicine
- R_Lecture #2
- Hypothesis Testing and Sampling Distributions 2011

Anda di halaman 1dari 51

Introduction

What is Text Categorization? "Text Categorization (TC) is a supervised learning task, defined as assigning category labels to new documents based on the likelihood suggested by a training set of labelled documents".

(Yiming, and Liu)

Examples:

news stories are typically organized by subject categories or topics. They are one the most commonly investigated application domains in the TC literature.

spam filtering where email messages are classified into the two categories of spam and non-spam.

Approaches

An increasing number of learning approaches have been applied including: regression models nearest neighbor classification Bayesian probabilistic approaches decision trees inductive rule learning neural networks on-line learning support vector machines

Issues

While the richness of a particular subject provides valuable information about individual methods, clear conclusions about cross-methods comparison have been difficult because often the published results are not directly comparable. For example, one cannot tell whether the performance of neural networks (NNet) by Weiner is statistically better or worse than the performance of the Support Vector Machines (SVM) by Joachims because different data collections were used in the evaluations of those methods.

In this paper, the authors address these issues by conducting a controlled study on five well-known text categorization methods: Neural Network (NNet) Support Vector Machines (SVM) Naive Bayesian (NB) k-Nearest Neighbor (kNN) Linear Least-squares Fit (LLSF)

Specifically, this paper contains the following new contributions: Provides directly comparable results of the five methods on the new benchmark corpus Reuters-21578. Proposes a variety of statistical significance tests for different standard performance measures, and suggests a way to jointly use these tests for cross-method comparison. Observes the performance of each classifier as a function of the training-set category frequency.

To make their evaluation results comparable to most of the published results in TC evaluations, the authors chose newswire stories as the task and Reuters-21578 as the corpus. For this paper, they use the AptoMod version of Reuters21578, which was obtained by eliminating unlabeled documents and selecting categories in both the training and test sets. Note: Reuters-21578 is the most widely used test collection for text categorization research.

This process resulted in 90 categories in both the training and test sets. After eliminating documents that do not belong in those 90 categories they obtained a training set of 7769 documents, a test set of 3019 documents, and a vocabulary of 24240 unique words. The # of categories per document is 1.3 on average. The category distribution is skewed: -The most common category has a training set of 2877, but 82% have less 100 instances and 32% have less than 10 instances

For evaluating the effectiveness of category assignments by classifiers to documents, they use the standard recall, precision and F1 measure. Recall is the ratio of correct assignments by the system divided the total number of correct assignments. Precision is the ratio of correct assignments by the system divided by the total number of system assignments. F1 measure combines recall (r) and precision (p) with an equal weight in the following form:

F1 measure:

These scores can be computed for the binary decisions on each individual category first and then be averaged over categories. This is called macro-averaging. Or, they can be computed globally over all n x m binary decisions where n = # of total test documents and m = # of categories in consideration. Also, they use error as an additional measure which is the ratio of wrong assignments by the system divided by the total # of systems assignments (n x m)

Classifiers

SMV

Introduced by Vapnik in 1995 for solving two-class pattern recognition problems. The method is defined over a vector space where the problem is to find a decision surface that "best" separates the data points into two classes. In order to define the "best" separation the authors introduce the "margin" between two classes as shown in the next figures:

SVM (continued)

The SVM problem is to find the decision surface that maximizes the margin between the date points in a training set. The decision surface by SVM for linearly separable space is a hyperplane which can be written as:

vector x = an arbitrary data point; vector w and constant b are learned from a training set of linearly separable data.

SVM (continued)

Letting D = {(yi, vector xi)} denote the training set, and yi E {+-1} be the classification for vector x The SVM problem is to find vector w and b that satisfies the following constraints

and that the vector 2-norm of vector w is minimized. It can be solved using quadratic programming techniques.

kNN

kNN stands for k-nearest neighbor classification Simple algorithm: given a test document, the system finds the k-nearest neighbor among the training documents and uses the categories of the k neighbors to weight the category candidates. The rule can be written as:

where y(vector x, cj) E {0,1} = classification for document vector di with respect to category cj; sim(vector x, vector di) = similarity of test document vector x and training document vector di; bj = the category specific threshold for binary decisions

LLSF

LLSF stands for Linear Least Squares Fit A multivariate regression model is automatically learned from a training set of documents and their categories. input vector is a document in the conventional vector space model output vector consists of categories of the correspoing document.

LLSF (continued)

by solving a LLSF on the training pair of vectors one can obtain a matrix of word-category regression coefficients:

where matrices A and B present the training data and matrix FLS is the solution matrix, defining a mapping from an arbitrary document to a vector of weighted categories.

NNet

NNet techniques have been intensively studied in artificial intelligence. In their experiment, a practical consideration is the training cost. Training NNet is usually much more time consuming than the other classifiers. It would be too costly to train one NNet per category therefore, the authors trained one NNet for all the 90 categories of Reuters.

NB

Naive Bayes (NB) probabilistic classifiers are commonly studied in machine learning. the basic idea in NB approaches is to use the joint probabilities of words and categories to estimate the probabilities of categories given a document.

Significance Tests

This is a sign test design to compare two systems, A and B, based on their binary decisions on all the document/category pairs. They use the following notation:

Null hypothesis: k = 0.5n or k has a binomial distribution of Bin(n,p) where p=0.5 Alternative hypothesis: k has binomial distribution of Bin(n,p) where n > 5, meaning that system A is better than B.

If n <= 12 and k >= 0.5n the P-value is computed using the binomial distribution under the null hypothesis:

If n <= 12 and k < 0.5n the P-value for the other extreme is computed using the the formula:

The P-value indicates the significance level of the observed evidence against the null hypothesis. e.i: system A is better or worse than B

If n > 12, the P-value can be computed using the standard normal distribution:

This sign test is design to compare two systems A and B using the paired F1 values for individual categories. They use the following notation:

The test hypothesis and the P-value computation are the same as those an in Micro s-test

Same notation as defined for S-Test, and add the following items:

To compare two systems, A and B, based on the F1 values after rank transformation, in which the F1 values of the two systems on individual categories are pooled together and sorted, then these values are replaced by the corresponding ranks. The following notation is used:

For performance measures which are proportions, such as recall, precision, error, the scores of systems A and B are compared as below:

Again: s-test and p-test are design to evaluate the performance of systems at a micro level, which is based on the pooled decisions on individual document/category pairs p-test and t'-test are design to evaluate at a macro level, using the performance scores on each category as the unit measure.

Evaluation

Experiments set up

1000 features were selected for NNet 2000 for NB 2415 for kNN and LLSF 10000 for SVM

Experiments set up

Other setting were:

Results

Results

Micro averaged F1 score for SVM (.85099) is slightly lower than Joachims results(.860-.864) Micro averaged F1 score (.8567) for kNN is higher than Joachims (.823) Micro averaged F1 score (.7956) for NB is higher than Joachims (.720)

Cross-classifier Comparison

Cross-classifier Comparison

The micro level analysis on pooled binary decisions suggests that SVM > kNN >> {LLSF,NNet} >> NB where the classifiers with insignificant performance are grouped into one set. Macro level analysis on the F1 scores suggests: {SVM,kNN,LLSF} >> {NB,NNet}

Cross-classifier Comparison

Cross-classifier Comparison

Conclusions

Conclusions

- IRJET- Terror Attack Identifier: Classify using KNN,SVM,Random Forest Algorithm and Alert through MessagesDiunggah olehIRJET Journal
- Final Chapter 3 MethodologyDiunggah olehStephanie Joy Escala
- Research time managementDiunggah olehJohn Kevin Bihasa
- NotesDiunggah olehJeppy Tagoon
- Getting to Know Spss - ContDiunggah olehzailek
- 2456Diunggah olehsharpdreams
- vibrationDiunggah olehhemanth45
- Chapter 11 Statistical Inference Using Two SamplesDiunggah olehjdiorio88
- out_7Diunggah olehLasiminBinSuhadi
- STAT SPRING SYLLABUSDiunggah olehMatt
- SPSS SummaryDiunggah olehCarson Yu
- Daily Patterns in Stock Returns as Evidence From the New Zealand Stock MarketDiunggah olehhacheemaster
- Face Recognition Under Uncontrolled Indoor EnvironmentDiunggah olehSalwa Khaled Younes
- Hw34 Old KiwiDiunggah olehsangeethaadu
- Phishing Detection Search EngineDiunggah olehMN Em
- t-test-120919022935-phpapp02.pdfDiunggah olehsharif
- Quantitative ReportDiunggah olehOtto_Condliffe
- Statnikov_svm in BiomedicineDiunggah olehAnonymous PsEz5kGVae
- R_Lecture #2Diunggah olehNabanita Ghosh
- Hypothesis Testing and Sampling Distributions 2011Diunggah olehgreenthumbelina
- measuring-cep.pdfDiunggah olehAry guimaraes neto
- Prod CycleDiunggah olehZenn Vanrim Lopez
- 10.1.1.92.555Diunggah olehJayampathi Samarasinghe
- Recognition and Classification of Ancient Dwellings based on Elastic Grid and GLCMDiunggah olehMia Amalia
- Ecostat Lecture Note 10hDiunggah olehAli Arsalan Syed
- Automatic Image Captions for Lightly Labelled ImagesDiunggah olehEditor IJTSRD
- eecs444 Final PaperDiunggah olehLucBettaieb
- Spss GuideDiunggah olehRoshanRSV
- Proceedings CLICit 2014Diunggah olehGiovanni Filero
- K_Nearest_Neighbor_Algorithm.pdfDiunggah olehegvega

- Chatbot for Education SystemDiunggah olehInternational Journal Of Emerging Technology and Computer Science
- ML workshopDiunggah olehAlexandra Veres
- IJMECS-V5-N5-3Diunggah olehDesi Arisandi
- The Adaline Learning AlgorithmDiunggah olehDr Mohammed
- Question Bank for Digital Image ProcessingDiunggah olehRajesh Paul
- 9D06105 Neural Networks and ApplicationsDiunggah olehsubbu
- Digital - ComputationalDiunggah olehSanDro Cavaleiro Guerreiro
- Data Mining: A prediction Technique for the workers in the PR Department of Orissa (Block and Panchayat)Diunggah olehijcseitjournal
- Materials Management Overview in MaximoDiunggah olehHariharan Jeyaraman
- Generalizing Dynamic Time Warping to the Multi-Dimensional Case Requires an Adaptive ApproachDiunggah olehjamey_mork1
- Automatic timbre classification of ethnomusicological audio recordingsDiunggah olehdfourer33
- White Paper Neural NetworksDiunggah olehmpgent
- A Robust and Efficient Real Time Network Intrusion Detection System Using Artificial Neural Network In Data MiningDiunggah olehijitcs
- Data WarehouseDiunggah olehAditya Mishra
- Hand Gesture RecognitionDiunggah olehKartik Wadehra
- Curvas ROC (Receiver Operating Characteristic, o Característica Operativa del Receptor)Diunggah olehdatalife
- Review on Leaf Disease Detection using Image Processing TechniquesDiunggah olehIRJET Journal
- Uganda Computer Studies Teaching Syllabus for Secondary SchoolsDiunggah olehMukalele Rogers
- aem reportDiunggah olehapi-242101891
- DWM LAb Manual FinalDiunggah olehDevyani Patil
- 1.pdfDiunggah olehnadeemq_0786
- An Economic & Sectoral Study of the South African Fishing Industry - Fisheries Profiles - ESS Vol 2 - Fishery Profiles CapeDiunggah olehLibs222
- Evaluation of Fault Proneness of Modules in Open Source Software Systems Using k-NN ClusteringDiunggah olehResearch Cell: An International Journal of Engineering Sciences
- CS1634 Datawarehousing and Data MiningDiunggah olehR_Senthil
- Materials Classification Memo FINALDiunggah olehGerrit Van Der Walt
- AIS-JOEL.pptxDiunggah olehLeoj Joson
- deteccion de fraudeDiunggah olehjhonny
- Using Machine Learning for Real Time Activity Recognition and Estimation of Energy ExpeditureDiunggah olehMaster of puppets 13
- 02 Data Mining-Partitioning methodDiunggah olehRaj Endran
- BallParam_ReverseMolino1Diunggah olehCristobal Matias P