Anda di halaman 1dari 20

Spam Recognition using Linear Regression and Radial Basis Function Neural Network 513

X

Spam Recognition using Linear Regression and
RadiaI Basis Function NeuraI Network

Tich Ihuoc Tran
1
, Min Li
1
, Dal Tran
2
and Dan Duong Ton
3

1
Ccn|rc fcr Quan|um Ccmpu|a|icn an |n|c||igcn| Sqs|cms (QC|S)
Unitcrsi|q cf Tccnnc|cgq, Sqncq, NSI 2007, Aus|ra|ia
|liplran, ninIi il.ul.edu.au

2
|acu||q cf |nfcrma|icn Scicnccs an |nginccring
Unitcrsi|q cf Can|crra, ACT 2601, Aus|ra|ia
dal.lrancanlerra.edu.au

3
|acu||q cf Ccmpu|cr Scicncc
Unitcrsi|q cf |nfcrma|icn Tccnnc|cgq, VNU-HCMC, Vic|nam
dandluil.edu.vn

Kcywnrds: Span Recognilion, RediaI asis Iunclion, Linear Regression

Abstract:
Spanning is lhe aluse of eIeclronic nessaging syslens lo send unsoIiciled luIk nessages.
Il is leconing a serious prolIen for organizalions and individuaI enaiI users due lo lhe
groving popuIarily and Iov cosl of eIeclronic naiIs. UnIike olher vel lhreals such as
hacking and Inlernel vorns vhich direclIy danage our infornalion assels, span couId
harn lhe conpuler nelvorks in an indirecl vay ranging fron nelvork prolIens Iike
increased server Ioad, decreased nelvork perfornance and viruses lo personneI issues Iike
Iosl enpIoyee line, phishing scans, and offensive conlenl. Though a Iarge anounl of
research has leen conducled in lhis area lo prevenl spanning fron undernining lhe
usaliIily of enaiI, currenlIy exisling fiIlering nelhods perfornance sliII suffers fron
exlensive conpulalion (vilh Iarge voIune of enaiIs received) and unreIialIe prediclive
capaliIily (due lo highIy dynanic nalure of enaiIs). In lhis chapler, ve discuss lhe
chaIIenging prolIens of Span Recognilion and lhen propose an anli-span fiIlering
franevork, in vhich appropriale dinension reduclion schenes and poverfuI cIassificalion
nodeIs are enpIoyed. In parlicuIar, IrincipaI Conponenl AnaIysis lransforns dala lo a
Iover dinensionaI space vhich is sulsequenlIy used lo lrain an ArlificiaI NeuraI Nelvork
lased cIassifier. A cosl-sensilive enpiricaI anaIysis vilh a pulIicIy avaiIalIe enaiI corpus,
naneIy Ling-Span, suggesls lhal our span recognilion franevork oulperforns olher slale-
of-lhe-arl Iearning nelhods in lerns of span deleclion capaliIily. In lhe case of exlreneIy
28
Pattern Recognition 514

high niscIassificalion cosl, vhiIe olher nelhods perfornance deleriorales significanlIy as
lhe cosl faclor increases, our nodeI sliII renains slalIe accuracy vilh Iov conpulalion cosl.

1. Introduction

LnaiI is videIy accepled ly lhe lusiness connunily as a Iov cosl connunicalion looI lo
exchange infornalion lelveen lusiness enlilies vhich are physicaIIy dislanl fron one
anolher. Il nininizes lhe cosl of organizing an in-person neeling. Il is reporled ly a recenl
survey SureIayroII (SurepayroII, 2OO7), over 8O of snaII lusiness ovners leIieve enaiI is a
key lo lhe success of lheir lusiness and nosl peopIe loday spend lelveen 2O lo 5O of
lheir vorking line using enaiI, incIuding reading, sorling and vriling enaiIs. Due lo lhe
very Iov cosl of sending enaiI, one couId send lhousands of naIicious enaiI nessages each
day over an inexpensive Inlernel conneclion. These junk enaiIs, referred lo as spam, can
severeIy reduce slaff produclivily, consune significanl nelvork landvidlh and Iead lo
service oulages. In nany cases, such nessages aIso cause exposure lo viruses, spyvare and
inappropriale conlenls lhal can creale IegaI/conpIiance issues, Ioss of personaI infornalion
and corporale assels. Therefore, il is inporlanl lo accuraleIy eslinale cosls associaled vilh
span and evaIuale lhe effecliveness of counlerneasures such as span-fiIlering looIs.
Though such span prevenlion capaliIily is inpIenenled in exisling enaiI cIienls, lhere are
sone larriers lhal discourage users fron uliIizing lhis fealure incIuding error-prone and
Ialor-inlensive nainlenance of fiIlering ruIes. Many researchers have deveIoped differenl
aulonalic span deleclion syslens lul nosl of lhen suffer fron Iov accuracy and high faIse
aIarn rale due lo huge voIune of enaiIs, lhe vide speclrun of spanning lopics and
rapidIy changing conlenls of lhese nessages, especiaIIy in lhe case of high niscIassificalion
cosl (ayIer, 2OO8). To deaI vilh such chaIIenges, lhis chapler proposes an anli-span
fiIlering franevork using a highIy perforning Ar|ificia| Ncura| Nc|ucr| (ANN) lased
cIassifier. ANN is videIy considered as a fIexilIe nodeI-free" or dala-driven Iearning
nelhod lhal can fil lraining dala very veII and lhus reduce |carning |ias (hov veII lhe
nodeI fils lhe avaiIalIe sanpIe dala). Hovever, lhey are aIso susceplilIe lo lhe overfilling
prolIen, vhich can increase generaIizalion variance, i.e. naking lhe prediclive nodeI
unslalIe for unseen inslances. This Iinilalion can le overcone ly conlining ANN vilh a
sinpIe Linear Regression aIgorilhn vhich nakes lhe resuIling cIassificalion nodeI a slalIe
seni-paranelric cIassifier. Such nodeI conlinalion ains al slaliIizing non-Iinear Iearning
lechniques vhiIe relaining lheir dala filling capaliIily. LnpiricaI anaIysis vilh lhe Ling-
Span lenchnark confirns our superior span deleclion accuracy and Iov conpulalion cosl
in conparison vilh olher exisling approaches.
This chapler is organized as foIIov. IirslIy, an overviev of lhe span prolIen is presenled
vilh associaled negalive inpacls and proleclion lechniques. This is foIIoved ly lhe
appIicalion of Machine Learning (ML) lo span recognilion, reIaled vorks and delaiIs of lhe
Ling-Span corpus. Nexl, a lrief reviev of severaI connonIy used cIassificalion nodeIs and
our proposed franevork is given. The sulsequenl seclion conpares lhe perfornance of our
nelhod vilh olher Iearning lechniques using lhe lenchnark corpus under differenl cosl
scenarios. IinaIIy, ve provide sone concIusion renarks for lhis chapler and fulure research
direclions.

Spam Recognition using Linear Regression and Radial Basis Function Neural Network 515

2. Spam Recognition and Machine Learning techniques
2.1. Spam Recognition as a ChaIIenging Task

2.1.1. Overview of Spamming
Spanning is one of lhe liggesl chaIIenges facing Inlernel consuners, corporalions, and
service providers loday. LnaiI spanning, aIso knovn as Unsc|ici|c 8u|| |mai| (UL) or
Unsc|ici|c Ccmmcrcia| |mai| (UCL), is lhe praclice of sending unvanled enaiI nessages,
frequenlIy vilh connerciaI conlenl, in Iarge quanlilies lo an indiscrininale sel of recipienls
(Schryen, 2OO7). Due lo lhe increasing popuIarily and Iov cosl of enaiI, lhere are nore and
nore span circuIaling over lhe Inlernel. Span enaiIs vere found lo accounl for
approxinaleIy 1O of lhe inconing nessage lo corporale nelvorks (L. I. Cranor &
LaMacchia, 1998) and currenlIy cosls lusinesses US $ 13 liIIion annuaIIy (Svarlz, 2OO3). Nol
onIy vasling line and consuning landvidlh, undesired enaiIs are exlreneIy annoying lo
nosl users due lo lheir unsuilalIe conlenls, ranging fron adverlising vacalions lo
pornographic naleriaIs.
Span is usuaIIy cIassified inlo lvo calegories vhich have differenl effecls on Inlernel users.
Cancc||a||c Uscnc| spam is a singIe nessage senl lo nany Usenel nevsgroups (WoIfe, Scoll, &
Lrvin, 2OO4). This spanning allack can overvheIn lhe users vilh a larrage of adverlising
or olher irreIevanl posls. Il aIso sulverls lhe aliIily of syslen adninislralors and group
ovners lo nanage lhe lopics lhey accepl on lheir syslens. The second lype of span is |mai|
spam vhich largels individuaI users vilh direcl naiI nessages (ayIer, 2OO8). Nol onIy
causing Ioss of produclivily for enaiI users, il aIso cosls noney for Inlernel Service
Iroviders (ISI) and onIine services lo lransnil span, and lhese cosls are lransferred direclIy
lo olher sulscrilers. Though lhere are differenl lypes of span, lhey aII share sone connon
properlies. Iirsl, lhe senders idenlily and address are conceaIed. Second, span enaiIs are
senl lo a Iarge nunler of recipienls and in high quanlilies. In facl, span is econonicaIIy
vialIe lo ils senders nol onIy lecause il is Iov cosl lo send an enaiI, lul aIso lecause
spanners have no operaling cosls leyond lhe nanagenenl of lheir naiIing Iisls. This
allracls nunerous spanners and lhe voIune of unsoIiciled naiI has lecone very high.
IinaIIy, span nessages are unsoIiciled, lhal is, lhe individuaIs receiving span vouId
olhervise nol have opled lo receive il.
To successfuIIy send a span nessage, spanners usuaIIy underlake lvo sleps: (1) coIIecling
largel enaiI addresses and (2) lypassing anli-span neasures (Schryen, 2OO7). The Ialer lask
invoIves cIeverIy disguising an unsoIiciled nessage as a non-span nessage vilh nornaI
appearing suljecl Iines and olher vays of gelling around anli-span soflvare. The firsl lask
seens easier lul in facl couId le very chaIIenging. To coIIecl vaIid enaiI addresses of
polenliaI largel viclins, lhe foIIoving lechniques can le depIoyed (ayIer, 2OO8):
x uying Iisls of addresses fron sone conpanies or olher spanners.
x Harvesling enaiI addresses fron vel siles or UseNel Nevs posls vilh aulonaled
prograns.
x SleaIing users address looks on conpronised conpulers.
x CoIIecling addresses via Inlernel ReIay Chal (IRC) prograns.
x Cuessing enaiI addresses, lhen sending enaiI lo see if il goes lhrough (Direclory
Harvesl allacks).
x Using faIse reasons lo lrick a user inlo giving up lheir enaiI address (SociaI Lngineering
allacks).
Pattern Recognition 516

Though span enaiIs are lroulIesone, nosl of lhen can le easiIy recognized ly hunan
users due lo lheir olvious signalures. Ior exanpIe, span enaiIs nornaIIy reIale lo specific
lopics such as prescriplion drugs, gel-rich-quick schenes, financiaI services, quaIificalions,
onIine ganlIing, discounled or piraled soflvare. Hovever, vilh a huge voIune of span
nessages received every day, il vouId nol le praclicaI for hunan users lo delecl span ly
reading aII of lhen nanuaIIy. Iurlhernore, span sonelines cones disguised, vilh a
suljecl Iine lhal reads Iike a personaI nessage or a non-deIivery nessage. This nakes highIy
accurale span deleclion soflvare desiralIe for encounlering span.

2.1.2 Impacts of Spamming and Preventive Techniques
Lven lhough span does nol lhrealen our dala in lhe sane vay lhal viruses do, il does cause
lusinesses liIIions of Iosl doIIars vorIdvide. SeveraI negalive inpacls of span are Iisled as
foIIov (Schryen, 2OO7):
x Span is regarded as privacy invasion lecause spanners iIIegaIIy coIIecl viclins enaiI
address (considered as personaI infornalion)
x UnsoIiciled enaiIs irrilale Inlernel users.
x Non-span enaiIs are nissed and/or deIayed. Sonelines, users nay easiIy overIook or
deIele crilicaI enaiIs, confusing lhen vilh span.
x Span vasles slaff line and lherely significanlIy reduce enlerprises produclivily.
x Span uses a consideralIy Iarge landvidlh and uses up dalalase capacily. This causes
serious Ioss of Inlernel perfornance and landvidlh.
x Sone span conlains offensive conlenl.
x Span nessages can cone allached vilh harnfuI code, incIuding viruses and vorns
vhich can inslaII lackdoors in receivers syslens.
x Spanners can hijack olher peopIes conpulers lo send unvanled enaiIs. These
conpronised nachines are referred lo as zonlie nelvorks", nelvorks of virus- or
vorn-infecled personaI conpulers in hones and offices around lhe gIole. This ensures
spanners anonynily and nassiveIy increases lhe nunler of span nessages can le
senl.
Various counler-neasures lo span have leen proposed lo niligale lhe inpacls of
unsoIiciled enaiIs, ranging fron reguIalory lo lechnicaI approaches. Though anli-span IegaI
neasures are graduaIIy leing adopled, lheir effecliveness is sliII very Iiniled. A nore direcl
counler-neasure is soflvare-lased anli-span fiIlers vhich allenpl lo delecl span fron
Iegilinale naiIs aulonalicaIIy. Mosl of lhe exisling enaiI soflvare packages are equipped
vilh sone forn of progrannalIe span fiIlering capaliIily, lypicaIIy in lhe forn of
lIackIisls of knovn spanners (i.e. lIock enaiIs lhal cone fron a lIack Iisl, check vhelher
enaiIs cone fron a genuine donain nane or vel address) and handcrafled ruIes (i.e. lIock
nessages conlaining specific keyvords and unnecessary enledded HTML code). ecause
spanners nornaIIy use forged addresses, lhe lIackIisl approach is very ineffeclive.
Handcrafled ruIes are aIso Iiniled due lo lheir reIiance on personaI preferences, i.e. lhey
need lo le luned lo characlerislics of nessages received ly a parlicuIar user or groups of
users. This is a line consuning lask requiring resources and experlise and has lo le
repealed periodicaIIy lo accounl for changing nalure of span nessages (L. I. Cranor,
LaMacchia, .A. , 1998).
Span deleclion is cIoseIy reIaled lo Tcx| Ca|cgcriza|icn (TC) due lo lheir lexl-lased conlenls
and siniIar lasks. Hovever, unIike nosl TC prolIens, spanning is lhe acl of lIindIy nass-
Spam Recognition using Linear Regression and Radial Basis Function Neural Network 517

naiIing an unsoIiciled nessage lhal nakes il span, nol ils acluaI conlenl (Schryen, 2OO7):
any olhervise Iegilinale nessage lecones span if lIindIy nass-naiIed. Iron lhis poinl of
viev, spanning lecones a very chaIIenging prolIen lo lhe suslainaliIily of lhe Inlernel,
given lhe conlenl of enaiIs lhe onIy foundalion for span recoginilion. NeverlheIess, il
seens lhal lhe Ianguage of currenl span nessages conslilules a dislinclive genre, and lhal
lhe lopics of nosl currenl span nessages are rareIy nenlioned in Iegilinale nessages,
naking il possilIe lo lrain successfuIIy a lexl cIassifier for span recognilion.

2.2. Machine Learning for Spam Recognition
Recenl advances of Macninc |carning (ML) lechniques in Tcx| C|assifica|icn (TC) have
allracled innense allenlion fron researchers lo expIore lhe appIicaliIily of Iearning
aIgorilhns in anli-span fiIlering (ayIer, 2OO8). In parlicuIar, a coIIeclion of nessages is
inpul lo a Iearning aIgorilhn vhich infers underIying funclionaI dependencies of reIevanl
fealures. The resuIl of lhis process is a nodeI lhal can, vilhoul hunan inlervenlion, cIassify
a nev inconing enaiI as span or Iegilinale according lo lhe knovIedge coIIecled fron lhe
lraining slage. Aparl fron aulonalion vhich frees organizalions fron lhe need of nanuaIIy
cIassifying a huge anounl of nessages, lhis nodeI can le relained lo caplure nev
characlerislics of span enaiIs. To le nosl usefuI in reaI vorId appIicalions, lhe anli-span
fiIlers need lo have a good generaIizalion capaliIily, lhal is, lhey can delecl naIicious
nessages vhich never occur during lhe Iearning process. There has leen a greal deaI of
research conducled in lhis area, ranging fron sinpIe nelhods such as proposilionaI Iearner
Ripper vilh keyvord-spolling ruIes (Cohen, 1996) lo nore conpIicaled approaches such
as ayesian nelvorks using |ags cf ucrs represenlalion and linary coding (Sahani, 1998).
In (AndroulsopouIos, Koulsias, Chandrinos, IaIiouras, & SpyropouIos, 2OOO), a syslen
inpIenenling Nave ayes and a k-NN lechnique is reporled lo le alIe lo oulperforn lhe
keyvord-lased fiIler of OulIook 2OOO on lhe Ling-Span corpus. LnsenlIe nelhods aIso
prove lheir usefuIness in fiIlering span. Ior exanpIe, slaked Nave ayes and k-NN can
achieve good accuracy (Drucker, Wu, & Vapnik, 1999), and oosled lrees vere shovn lo
have leller perfornance lhan individuaI lrees, Nave ayes and k-NN aIone (Carreras &
Marquez, 2OO1). A supporl veclor nachine (SVM)(Drucker el aI., 1999) is aIso reporled lo
achieve a higher deleclion rale as veII as Iover faIse aIarn rale for span recognilion
conpared vilh olher discrininalive cIassificalion nelhods. Il is suggesled lhal enaiI
headers pIay a vilaI roIe in span recognilion, and lo gel leller resuIls, cIassifiers shouId le
lrained on fealures of lolh enaiI headers and enaiI lodies (AndroulsopouIos el aI., 2OOO).

2.3 Ling-Spam Benchmark
The Ling-Span corpus (AndroulsopouIos el aI., 2OOO) is used as a lenchnark lo evaIuale
our proposed aIgorilhn vilh olher exisling lechniques, lhe anli-span fiIlering lask. Using
lhis pulIicIy avaiIalIe dalasel, ve can conducl lraclalIe experinenls and aIso avoid
conpIicalions of privacy issues. WhiIe span nessages do nol pose lhis prolIen as lhey are
lIindIy dislriluled lo a Iarge nunler of recipienls, Iegilinale enaiI nessages nay conlain
personaI infornalion and cannol usuaIIy le reIeased vilhoul vioIaling lhe privacy of lheir
recipienls and senders.
Pattern Recognition 518

The corpus conlains Iegilinale nessages coIIecled fron a noderaled naiIing Iisl on
profession and science of Iinguislics and lhe span nessages coIIecled fron personaI
naiIloxes:
x 2412 Iegilinale nessages vilh lexl added ly lhe Iisls server renoved.
x 481 span nessages (dupIicale span nessages received on lhe sane day excIuded)
The headers, HTML lags, and allachnenls of lhese nessages are renoved, Ieaving onIy lhe
suljecl Iine and lody lexl. The dislrilulion of lhe dalasel (16.6 is span) nakes il easy lo
idenlify Iegilinale enaiIs lecause of lhe lopic-specific nalure of lhe Iegilinale naiIs. This
dalasel is parlilioned inlo 1O slralified sulsels vhich nainlain lhe sane ralio of Iegilinale
and span nessages as in lhe enlire dalasel. Though sone research (AndroulsopouIos el aI.,
2OOO, Carreras & Marquez, 2OO1, Hsu, Chang, & Lin, 2OO3, Sakkis el aI., 2OO3) has leen
conducled on lhis dala shoving lheir conparalive efficiency, nosl of lhen suffer fron high
a faIse aIarn rale vhich resuIls in a degraded perfornance vhen lhe niscIassificalion cosl is
high. Overconing lhis prolIen is our najor oljeclive in lhis chapler.

3. Spam Recognition Methods

This seclion discusses connonIy used Iearning aIgorilhns for span recognilion prolIens.

3.1. Nave Bayes
Naive ayes is a veII-knovn prolaliIislic cIassificalion aIgorilhn vhich has leen used
videIy for span recognilion (AndroulsopouIos el aI., 2OOO). According lo ayes' lheoren,
ve can conpule lhe prolaliIily lhal a nessage vilh veclor
leIongs lo a cIass :

The caIcuIalion of is prolIenalic lecause nosl aII noveI nessages are
differenl fron lraining nessages. Therefore, inslead of caIcuIaling prolaliIily for nessages
(a conlinalion of vords), ve can consider lheir vords separaleIy. y naking lhe
assunplion lhal are condilionaIIy independenl given lhe cIass c, ve have:


3.2. Memory Based Learning
In (AndroulsopouIos el aI., 2OOO), an anli-span fiIlering lechnique using Menory-ased
Learning (ML) lhal sinpIy slores lhe lraining nessages. The lesl nessages are lhen
cIassified ly eslinaling lheir siniIarily lo lhe slored exanpIes lased on lheir ctcr|ap nelric
vhich counls lhe allrilules vhere lhe lvo nessages have differenl vaIues. Civen lvo
inslances and , lheir overIap dislance is:

Where if x = q or 1 olhervise.
Spam Recognition using Linear Regression and Radial Basis Function Neural Network 519

The confidence IeveI lhal a nessage leIongs lo a cIass c is caIcuIaled lased on lhe cIasses
of olher neighlor inslances :

MLs perfornance can le significanlIy inproved ly inlroducing sone veighling schenes.

3.3. Distance Weighting
Depending on hov far a lesl inslance is avay fron ils neighlorhood, ils confidence IeveI is
eslinaled:


3.3.1 Attribute Weighting
UnIike lhe lasic k-neighlorhood cIassifiers vhere aII allrilules are lrealed equaIIy, ML
assigns differenl veighls lo lhe allrilules, depending on hov veII lhey discrininale
lelveen lhe calegories, and adjusl lhe dislance nelric accordingIy. In parlicuIar, an allrilule
has a veighl of vhich is lhe reduclion of enlropy H(C) (uncerlainly on any calegory C
of a randonIy seIecled inslance) and lhe expecled vaIue of enlropy (uncerlainIy
on any calegory C given lhe vaIue of allrilule X). This neans an allrilule vouId have a
higher veighl if knoving ils vaIue reduces uncerlainly on calegory C.

Where


The dislance lelveen lvo inslances is recaIcuIaled as leIov:


3.4. Boosted Decision Tree
oosled Tree (T) is a popuIar nelhod inpIenenled in nany anli-span fiIlers vilh greal
successes (Carreras & Marquez, 2OO1). Il uses lhe ADA-oosl aIgorilhn (Schapire & Singer,
2OOO) lo generale a nunler of Decision Tress cIassifiers vhich are lrained ly differenl
sanpIe sels dravn fron lhe originaI lraining sel. Lach of lhese cIassifiers produces a
hypolhesis fron vhich a Iearning error can le caIcuIaled. When lhis error exceeds a cerlain
IeveI, lhe process is lerninaled. A finaI conposile hypolhesis is lhen crealed ly conlining
individuaI hypolheses.

3.5. Support Vector Machine
Supporl Veclor Machines (SVM) (Drucker el aI., 1999) have lecone one of lhe popuIar
lechniques for lexl calegorizalion lasks due lo lheir good generaIizalion nalure and lhe
aliIily lo overcone lhe curse of dinensionaIily. SVM cIassifies dala ly a sel of
represenlalive supporl veclors. Assune lhal ve vanl lo find a discrininanl funclion f(x)
such lhal . A possilIe Iinear discrininanl funclion can le presenled as
Pattern Recognition 520

vhere is a separaling hyperpIane in lhe dala space.
ConsequenlIy, choosing a discrininanl funclion is lo find a hyperpIane having lhe
naxinun separaling nargin vilh respecl lo lhe lvo cIasses. A SVM nodeI is conslrucled
ly soIving lhis oplinizalion prolIen.

3.6. ArtificiaI NeuraI Network
ArlificiaI neuraI nelvork (ANN) has gained slrong inleresls fron diverse connunilies due
lo ils aliIily lo idenlify lhe pallerns lhal are nol readiIy olservalIe. MuIli-Layer Ierceplron
(MLI) is lhe nosl popuIar neuraI nelvork archileclure in use loday. This nelvork uses a
Iayered feed-forvard lopoIogy in vhich lhe unils each perforn a liased veighled sun of
lheir inpuls and pass lhis aclivalion IeveI lhrough a lransfer funclion lo produce lheir
oulpul (RuneIharl & McCIeIIand, 1986). Though nany appIicalions have inpIenenled MLI
for superior Iearning capacily, ils perfornance is unreIialIe vhen nev dala is encounlered.
A recenlIy energing lranch of ANN, lhe RI nelvorks, is aIso reporled lo gain greal
successes in diverse appIicalions. In lhis chapler, MLI is inpIenenled as lypicaI ANN
nodeIs for span recognilion.

4. CIassification Framework for Spam Recognition

AIlhough Ielling undelecled span pass lhrough a fiIler is nol as dangerous as lIocking a
Iegilinale nessage, one can argue lhal anong one niIIion inconing enaiIs, a fev lhousand
unsoIiciled nessage lhal are niscIassified as nornaI is sliII very coslIy. Hence, anli-span
fiIlers reaIIy need lo le accurale, especiaIIy vhen lhey are used in Iarge organizalions.
Advanced ML lechniques have leen used lo inprove perfornance of span fiIlering
syslens. Anongsl lhose nelhods, Ar|ificia| Ncura| Nc|ucr| (ANN) has leen gaining slrong
inleresls fron diverse connunilies due lo ils aliIily lo idenlify lhe pallerns lhal are nol
readiIy olservalIe. Despile recenl successes, ANN lased appIicalions sliII have sone
disadvanlages such as exlensive conpulalion and unreIialIe perfornance. In lhis sludy, ve
use a Modified IrolaliIislic NeuraI Nelvork (MINN) vhich is deveIoped ly Zaknich
(Zaknich, 1998).
If lhere exisls a corresponding scaIar oulpul for each IocaI region (cIusler) vhich is
represenled ly a cenler veclor , MINN can le nodeIed as foIIov (Zaknich, 2OO3):

Wilh Caussian funclion
Where
= cenler veclor for cIusler i in lhe inpul space
= scaIar oulpul reIaled lo
= nunler of inpul veclors vilhin cIusler
= singIe snoolhing paraneler chosen during nelvork lraining
M = nunler of unique cenlers
Though MINN is reporled lo provide acceplalIe accuracy and affordalIe conpulalion, il,
jusl Iike olher ANN, cannol cIassify reIialIy vhen an unusuaI inpul vhich differs fron lheir
lraining dala energes. As a resuIl, il is essenliaI lhal sone degree of generaIizalion capacily
Spam Recognition using Linear Regression and Radial Basis Function Neural Network 521

nusl le incorporaled in lhe MINN lased cIassifiers. A possilIe approach lo lhis prolIen is
lo incorporale MINN vilh a Iinear nodeI vhich offers slaliIily, anaIyzaliIily and fasl
adaplalion (Hayes, 1996).

4.1 Description
Iigure 1 shovs lhe overaII fiIlering franevork proposed for span recognilion prolIen.
There are 4 nain phases.


Iig. 1. Iroposed anli-span fiIlering franevork

4.1.1. Phase 1: Data Representation and Preprocessing
The purpose of dala preprocessing is lo lransforn nessages in lhe naiI corpus inlo a
uniforn fornal lhal can le underslood ly lhe Iearning aIgorilhns. Iealures found in naiIs
are nornaIIy lransforned inlo a veclor space in vhich each dinension of lhe space
corresponds lo a given fealure in lhe enlire corpus. Lach individuaI nessage can lhen le
vieved as a fealure veclor. This is referred lo as lhe |ag cf ucrs approach. There are lvo
nelhods lo represenl eIenenls of lhe fealure veclor: (1) mu||i-taria|c prcscn|a|icn assigns a
linary vaIue lo each eIenenl shoving lhal lhe vord occurs in lhe currenl naiI or nol and (2)
mu||i-ncmia| prcscn|a|icn represenls each eIenenl as a nunler lhal shovs lhe occurrence
frequency of lhal vord in lhe currenl naiI. A conlinalion of lag of vords and nuIli-
variale presenlalion is used in our experinenls (lhe order of lhe vords is negIecled). To
conslrucl lhe fealure veclors, lhe inporlanl vords are seIecled according lo lheir Mu|ua|
|nfcrma|icn (M|) (Sakkis el aI., 2OO3):
Pattern Recognition 522


The vords vilh lhe highesl M| vaIues are seIecled as lhe fealures. Assune lhal lhere are n
fealures lo le chosen, each naiI viII le represenled ly a fealure veclor vhere
are lhe vaIues of linary allrilules , indicaling lhe presence or alsence of
an allrilule (vord) in currenl nessage.
Moreover, vcr s|cmming and s|cp-ucr rcmcta| are lvo inporlanl issues lhal need lo le
considered in parsing enaiIs. Word slenning refers lo converling vords lo lheir
norphoIogicaI lase forns (e.g. gone and venl are reduced lo rool vord go). Slop-
vord renovaI is a procedure lo renove vords lhal are found in a Iisl of frequenlIy used
vords such as and, for, a. The nain advanlages of appIying lhe lvo lechniques are lhe
reduclion of fealure space dinension and possilIe inprovenenl on cIassifiers prediclion
accuracy ly aIIevialing lhe dala sparseness prolIen(AndroulsopouIos el aI., 2OOO). The
Ling-Span corpus has four versions, each differs fron each olher ly lhe usage of a
|cmma|izcr and a s|cp|is| (renoves lhe 1OO nosl frequenlIy used vords). We use lhe version
vilh Iennalizer and slopIisl enalIed lecause il perforns leller vhen differenl cosl
scenarios are considered (AndroulsopouIos el aI., 2OOO). Words lhal appear Iess lhan 4 lines
or Ionger lhan 2O characlers are discarded.
AIso, il is found lhal phrasaI and non-lexluaI allrilules nay inprove span recognilion
perfornance (AndroulsopouIos el aI., 2OOO). Hovever, lhey inlroduce a nanuaI
configuralion phase. ecause our largel vas lo expIore fuIIy aulonalic anli-span fiIlering,
ve Iiniled ourseIves lo vord-onIy allrilules.
IinaIIy, sone dala cIeaning lechniques are required afler converling rav dala inlo
appropriale fornal. In parlicuIar, lo deaI vilh nissing vaIues, lhe sinpIesl approach is lo
deIele aII inslances vhere lhere is al Ieasl one nissing vaIue and use lhe renainder. This
slralegy has lhe advanlage of avoiding inlroducing any dala errors. Ils nain prolIen is lhal
discard of dala nany danage lhe reIialiIily of lhe resuIling cIassifier. Moreover, lhe nelhod
cannol le used vhen a high proporlion of inslances in lhe lraining sel have nissing vaIues.
Togelher, lhese veaknesses are quile sulslanliaI. AIlhough il nay le vorlh lrying vhen
lhere are fev nissing vaIues in lhe dalasel, lhis approach is generaIIy nol reconnended.
Inslead, ve use an aIlernalive slralegy in vhich any nissing vaIues of a calegoricaI allrilule
are repIaced ly ils nosl connonIy occurring vaIue in lhe lraining sel. Ior conlinuous
allrilules, nissing vaIues are repIaced ly ils average vaIue in lhe lraining sel.

4.1.2. Phase 2: Feature Transformation
The lrenendous grovlh in conpuling pover and slorage capacily has nade lodays
dalalases, especiaIIy for lexl calegorizalion lasks, conlain very Iarge nunler of allrilules.
AIlhough fasler processing speeds and Iarger nenories nay nake il possilIe lo process
lhese allrilules, lhis is inevilalIy a Iosing slruggIe in lhe Iong lern. esides degraded
perfornance, nany irreIevanl allrilules viII aIso pIace an unnecessary conpulalionaI
overhead on any dala nining aIgorilhn. There are severaI vays in vhich lhe nunler of
allrilules can le reduced lefore a dalasel is processed. In lhis research, a imcnsicn rcuc|icn
(aIso caIIed fca|urc pruning or fca|urc sc|cc|icn) schene caIIed Principa| Ccmpcncn| Ana|qsis
(ICA) (}oIIiffe, 2OO2) is perforned on lhe dala lo seIecl lhe nosl reIevanl fealures. This is
necessary given lhe very Iarge size and correIaled nalure of lhe inpul veclors. ICA
Spam Recognition using Linear Regression and Radial Basis Function Neural Network 523

eIininales highIy correIaled fealures and lransforns lhe originaI dala inlo Iover
dinensionaI dala vilh nosl reIevanl fealures. Iron our olservalion, lhe seIecled fealures
are vords lhal express lhe dislinclion lelveen span and non-span groups, i.e. lhey are
eilher connon in span or Iegilinale nessages, nol in lolh. SeveraI punclualion and speciaI
synloIs (e.g. $, ) are aIso seIecled ly ICA, and lherefore, lhey are nol eIininaled
during preprocessing.

4.1.3. Phase 3: EmaiI CIassification
The dala afler leing processed ly lhe Iealure seIeclion noduIe is inpul lo lrain lhe
CIassificalion ModeI. The resuIling nodeI is lhen used lo IaleI enaiIs as eilher Iegil or
span, indicaling vhelher a nessage is cIassified as Iegilinale or a span enaiI. To
inpIenenl lhe CIassificalion ModeI, ve propose an inleIIigenl vay of conlining lhe Iinear
parl of lhe nodeIing vilh a sinpIe non-Iinear nodeI aIgorilhn. In parlicuIar, MINN is
adapled in lhe nonIinear conpensalor vhich viII onIy nodeI higher ordered conpIexilies
vhiIe Iinear nodeI viII doninale in case of dala far avay fron lraining cIuslers. Il is
descriled in lhe foIIoving equalion.

Where
= Linear Regression ModeI
= NonIinear ResiduaI Conpensalor (MINN)
= iniliaI offsel
= veighls of lhe Iinear nodeI
Conpensalion faclor
= difference lelveen lhe Iinear approxinalion and lhe lraining
oulpul
= dislance fron lhe inpul veclor lo cIusler i in lhe inpul space
The conlinalion of Iinear regression nodeI and MINN is referred lo as Linear Regression -
Modified IrolaliIislic NeuraI Nelvork (LR-MINN). The piecevise Iinear regression nodeI
is firslIy approxinaled ly using aII avaiIalIe lraining dala in a sinpIe regression filling
anaIysis. The MINN is lhen conslrucled lo conpensale for higher ordered characlerislics of
lhe prolIen. Depending on differenl porlions of lhe lraining sel and hov far lhe lesl dala is
avay fron lhe lraining dala, lhe inpacl of nonIinear residuaI funclion is adjusled such lhal
lhe overaII Mean Square Lrror is nininized. This adjuslnenl is fornuIaled ly lhe
ccmpcnsa|icn fac|cr . In parlicuIar, is conpuled lased on hov veII lhe Iinear nodeI
perforns on a lraining exanpIes and dislances fron a lesl veclor lo cIuslers of lraining dala.
IirslIy, lhe goodness of lhe Iinear nodeI on a parlicuIar lraining dala is neasures ly
vhich is defined as lhe difference lelveen lhe Iinear approxinalion and lhe acluaI
oulpul of lhe lraining dala. A very snaII vaIue of neans lhal fils lhe dala veII in
lhis case and lherefore il shouId have higher priorily or lhe inpacl of lhe nonIinear nodeI
is nininized. In conlrasl, Iarge vaIue of indicales lhal shouId conpensale
nore for lhe degraded accuracy of .
SecondIy, lo delernine hov far a given lesl veclor is avay fron lhe avaiIalIe lraining dala,
a dislance fron lhal veclor lo each lraining cIusler is caIcuIaled. Ior any dala vhich is
Pattern Recognition 524

far avay fron lhe lraining sel, i.e. is Iarge, lhe vaIue of viII le nininized. As lhe
resuIl, viII have nininaI residuaI effecl and viII doninale. This is lecause
has nore slalIe generaIizalion lhan for nev inslances.

4.1.4. Phase 4: EvaIuation
To evaIuale lhe overaII perfornance of lhe franevork, lhe Cosl-sensilive LvaIualion
noduIe conpules severaI perfornance nelrics and aIso lakes inlo consideralion differenl
cosl scenarios.

4.2. Performance EvaIuation
4.2.1. Performance Measures
To neasure lhe perfornance of differenl Iearning aIgorilhns, lhe foIIoving neasures are
used:




Iron lhe alove equalions, Span RecaII (SR) is, in facl, lhe percenlage of span nessages
( ) lhal are correclIy cIassified ( ) vhiIe Span Irecision (SI) conpares
lhe nunler of correcl span cIassificalions ( ) lo lhe lolaI nunler of nessages cIassified
(correclIy and incorreclIy) as span ( ). As lhe Miss Rale (MR) increases, lhe
nunler of niscIassificalions of Iegilinale enaiIs increases vhiIe lhe IaIse AIarn Rale (IAR)
increases, lhe nunler of niscIassificalions of span enaiIs (passing fron lhe fiIler) increases.
Therefore, lolh of IAR and MR shouId le as snaII as possilIe for a fiIler lo le effeclive
(shouId le O for a perfecl fiIler).

4.2.2. Cost-Sensitive AnaIysis
a) Cosl Scenarios
Depending on vhal aclion is laken ly a span fiIler in response lo a delecled span nessage,
lhere are lhree najor niscIassificalion cosl scenarios. The nc-ccs| case is vhen lhe fiIler
nereIy fIags a delecled span nessage. This nolificalion of span does nol risk Iosing any
Iegilinale naiI due lo niscIassificalion error (no niscIassificalion cosl), lul il sliII lakes line
for lhe hunan users lo check and deIele lhe span nessages nanuaIIy. To nininize lhe user
efforls on eIininaling span, lhe fiIler can aulonalicaIIy delecl and renove lhe suspicious
nessages. Hovever, lhe lolaI cosl of niscIassificalion in lhis case can le exlreneIy high due
lo lhe seriousness of faIseIy discarding Iegilinale naiIs. This refers lo lhe nign-ccs| scenario.
eside lhe alove approaches, lhe fiIler nay nol eilher fIag or conpIeleIy eIininale lhe
delecled span nessages. Inslead, il nighl resend lhe nessage lo lhe sender. This approach,
referred lo as mccra|c-ccs|, conlals spanning ly increasing ils cosl via Human |n|crac|itc
Prccfs (HII) (L. I. Cranor & LaMacchia, 1998). Thal is, lhe sender is required lo give a proof
of hunanily lhal nalches a puzzIe lefore his nessage is deIivered. The puzzIes couId le, for
Spam Recognition using Linear Regression and Radial Basis Function Neural Network 525

exanpIe, inages conlaining sone lexl lhal is difficuIl lo aulonalicaIIy anaIyze ly pallern
recognilion soflvare. AIlernaliveIy, for anli-span prograns, sinpIe queslions (e.g. vhal is
one pIus one) can le used inslead of graphicaI puzzIes.
The concepl of HII has leen inpIenenled in nany securily reIaled appIicalions. Ior
exanpIe, cerlain vel-lased enaiI syslens use HII lo verify lhal passvord cracking
soflvare is nol syslenalicaIIy lrule-forcing lo guess a correcl passvord for enaiI accounls.
When a user lypes his passvord vrong lhree lines, a dislorled inage is presenled lhal
conlains a vord or nunlers and lhe user nusl verify lefore leing aIIoved lo conlinue. A
hunan can easiIy converl lhe inage lo lexl, lul lhe sane lask is exlreneIy difficuIl for a
conpuler. Sone enaiI cIienl prograns have anli-span fiIlering heurislics using HII
inpIenenled. When such prograns receive an enaiI lhal is nol in lhe vhile-Iisl of lhe user,
lhey send lhe sender a passvord. A hunan sender can lhen resend lhe enaiI conlaining lhe
received passvord. This syslen can effecliveIy defeal spanners lecause span is |u||,
neaning lhal lhe spanners do nol lolher lo check repIies nanuaIIy or connonIy use a
forged source enaiI address. The cosl of crealing and verifying lhe proofs is snaII, lul lhey
can le conpulalionaIIy inpossilIe for aulonaled nass-naiIing looIs lo anaIyze. Though
spanners can sliII use hunan Ialor lo nanuaIIy read and provide lhe proofs and finaIIy
have lheir span nessage senl. HII acluaIIy reslricls lhe nunler of unsoIiciled nessages
lhal lhe spanner can send for a cerlain period of line due lo lhe inaliIily lo use cheap
aulonaled looIs (Carreras & Marquez, 2OO1). This larrier for spanners effecliveIy
inlroduces addilionaI cosl lo sending span nessages.
In lhis chapler, span recognilion experinenls are conducled in a cosl-sensilive nanner. As
enphasized previousIy, niscIassifying a Iegilinale nessage as span is generaIIy nore
severe lhan nislakenIy recognizing a span nessage as Iegilinale. Lel (Iegilinale
cIassified as span) and (span cIassified as Iegilinale) denole lhe lvo lypes of error,
respecliveIy. We invoke a decision-lheorelic nolion of cosl, and assune lhal is lines
nore coslIy lhan . A naiI is cIassified as span if lhe foIIoving crilerion is nel
(AndroulsopouIos el aI., 2OOO):

In lhe case of anli-span fiIlering:

The alove crilerion lecones:
,vilh
Depending on vhich cosl scenarios are considered, lhe vaIue of is adjusled accordingIy.
x No-cosl scenario (e.g. fIagging span nessages):
x Moderale-cosl scenario (e.g. seni-aulonalic fiIler vhich nolifies senders aloul lIocked
nessages):
x High-cosl scenario (e.g. aulonalicaIIy renoving lIocked nessages):
l) TolaI Cosl Ralio
Accuracy and error rales assign equaI veighls lo lhe lvo error lypes (|S, S|) and are
defined:

Hovever, in lhe cosl-sensilive conlexls, lhe accuracy and error rales shouId le nade
sensilive lo lhe cosl difference, i.e. each Iegilinale nessage is counled for lines. Thal is,
Pattern Recognition 526

vhen a Iegilinale nessage is niscIassified, lhis counls as errors, and vhen il passes lhe
fiIler, lhis counls as successes. This Ieads lo lhe definilion of ucign|c accuracq and ucign|c
crrcr (IAcc an I|rr):

The vaIues of perfornance neasures (veighled or nol) are nisIeadingIy high. To gel a lrue
piclure of lhe perfornance of a span fiIler, ils perfornance neasures shouId le conpared
againsl lhose of a laseIine approach vhere no fiIler is used. Such a laseIine fiIler never
lIocks Iegilinale nessages vhiIe span enaiIs aIvays pass lhrough lhe fiIler. The veighled
accuracy and error rales for laseIine are:

Tc|a| ccs| ra|ic (TCR) is anolher neasure vhich evaIuales perfornance of span fiIler lo lhal
of a laseIine.

Crealer TCR vaIues indicale leller perfornance. Ior TCR < 1, lhe laseIine is leller. If cosl is
proporlionaI lo vasled line, a TCR is inluiliveIy equivaIenl lo neasuring hov nuch line is
vasled lo nanuaIIy deIele aII span nessages vhen lhe fiIler is used ( ) conpared lo lhe
line vasled lo nanuaIIy deIele any span nessages lhal passed lhe fiIler ( ) pIus lhe
line needed lo recover fron nislakenIy lIocked Iegilinale nessages ( )

5.Experiment ResuIts

5.1. Experiment Design
The proposed span recognilion franevork is lesled on lhe Ling-Span corpus lo conpare
vilh olher exisling Iearning nelhods incIuding Nave ayes (N), Weighled Menory ased
Learning (WML), oosled Trees (T), Supporl Veclor Machine (SVM) and NeuraI Nelvork
nodeIs (MuIliIayer Ierceplron - MLI. UnIike olher lexl calegorizalion lasks, fiIlering span
nessages is cosl sensilive (Cohen, 1996), hence evaIualion neasures lhal accounl for
niscIassificalion cosls are used. In parlicuIar, ve define a cosl faclor vilh differenl vaIues
corresponding lo lhree cosl scenarios: firsl, no cosl considered ( ) e.g. narking nessages
as span, second, seni-aulonalic fiIlering ( ) e.g. issuing a nolificalion aloul span, and
fuIIy aulonalic fiIlering ( ), e.g. discarding lhe span nessages.
The rale al vhich a Iegilinale naiI is niscIassified as span is caIcuIaled ly IaIse AIarn Rale
(|AR) and il shouId le Iov for a fiIler lo le usefuI. Span RecaII (SR) neasures lhe
cffcc|itcncss of lhe fiIler, i.e. lhe percenlage of nessages correclIy cIassified as span, vhiIe
Span Irecision (SP) indicales lhe fiIlers safc|q, i.e. lhe degree lo vhich lhe lIocked nessages
are lruIy span. ecause SR can le derived fron |AR (e.g. |AR = 1 - SR), ve viII use SR, SP,
and TolaI Cosl Ralio (TCR) for evaIualion. esides conparing hov accuraleIy lhe fiIlers
perforn, lheir conpulalion is aIso neasured using lhe conpulalion line (in seconds)
required for each cIassifier. IarlicuIarIy, lhe lolaI conpulalion line is a sunnalion of lhe
line lhal a cIassifier needs lo perforn cross vaIidalion, lesling on dala and lo caIcuIale lhe
reIevanl perfornance nelrics (e.g. niscIassificalion rale, accuracy .).
Slralified lenfoId cross vaIidalion is enpIoyed for aII experinenls. Thal is, lhe corpus is
parlilioned inlo 1O slralified parls and each experinenl vas repealed 1O lines, each line
Spam Recognition using Linear Regression and Radial Basis Function Neural Network 527

reserving a differenl parl as lhe lesling sel and using lhe renaining 9 parls as lhe lraining
sel. Ierfornance scores are lhen averaged over lhe 1O ileralions.
In addilion lo lhe sludies conducled ly olher researchers on lhe sane Ling-Span corpus
(N (AndroulsopouIos el aI., 2OOO), WML (Sakkis el aI., 2OO3), SVM (Hsu el aI., 2OO3), T
(Carreras & Marquez, 2OO1)), ve aIso reproduced lheir experinenls (lased on lhe average
vaIue of TCR of lhree cosl scenarios) lo confirn and delernine lhe paranelers vaIues lhal
give lesl perfornance for differenl Iearning nelhods. The oplinaI allrilule size of lhese
nelhods can le found in Iigure 2. An MLI vlh 15 neurons in hidden Iayer is depIoyed
using lhe MalIal NeuraI Nelvork looIlox.

5.2. Experiment ResuIt

5.2.1. TCR and Attribute SeIection
Iron Iigure 2, for and , nosl of fiIlers denonslrale a slalIe perfornance, vilh
TCR conslanlIy grealer lhan 1. These fiIlers differ fron one anolher in lerns of lheir
sensilivily on allrilule seIeclion and lhe nunler of allrilules vhich give naxinun TCR.
Our LR-MINN is found lo le noderaleIy sensilive lo allrilule seIeclion and il ollains lhe
highesl TCR for vilh 3OO allrilules seIecled. When , LR-MINN achieves very
conpelilive TCR conpared lo SVM lul vilh Iess nunler of allrilules (2OO allrilules) and
hence invoIves Iess conpulalion overheads.

a)
Pattern Recognition 528


l)

c)
Iig. 2. TCR score of span recognilion nelhods

Ior , aII cIassifiers have lheir TCR reduced significanlIy for lhe effecl of very high
niscIassificalion cosl. The difference lelveen Iov and high vaIues of niscIassificalion cosl
is lhe increased perfornance of lhe laseIine fiIler vhen increases. Thal is, vilhoul a fiIler
in use (laseIine), aII Iegilinale naiIs are relained, prevenling lhe laseIine fron
niscIassifying lhose Iegilinale naiIs as span. Therefor, Iarge lenefils lhe laseIine and
nake il hard lo le defealed ly olher fiIlers. RecaII lhal TCR is lhe neasure of perfornance
lhal a fiIler inproves on lhe laseIine case. As a resuIl, TCR generaIIy reduces vhen
increases. Anolher inporlanl olservalion is lhal, lhe perfornance of nosl cIassifiers, excepl
for T and LR-MINN, faII leIov lhe lase case (TCR<1) for sone nunlers of seIecled
allrilules. This is due lo lhe reIalive insensilivily of T and LR-MINN lo allrilule seIeclion.
In lhis case, lhe LR-MINN is considered lo le lhe lesl perforning fiIler vilh lhe highesl
TCR.
Spam Recognition using Linear Regression and Radial Basis Function Neural Network 529

5.2.2. Spam Precision and Spam RecaII
In lhis experinenl, lhe cIassifiers are run ileraliveIy ly a lenfoId cross-vaIidalion process.
The SI ans SR rales are recorded in TalIe 1. We olserve lhal, for lhe no-cosl scenario (
), our nelhod, LR-MINN, is found lo have lesl SI vhiIe ils SR (O.958) is very siniIar lo lhe
highesl SR of N (O.959). Ior , LR-MINN ollains lhe highesl SR (O.869) and second
highesl SI (O.991) afler T aIgorilhn. IinaIIy, in lhe case of exlreneIy high niscIassificalion
cosl ( ), LR-MINN significanlIy oulperforns olher nelhods vilh aII evaIualion
nelrics are of highesl vaIues.
Melhod
O
=1
O
=9
O
=999
5R 5P 5R 5P 5R 5P
N 0.959 O.973 O.861 O.975 O.79O O.984
WML O.86O O.917 O.79O O.982 O.6O1 O.857
SVM O.954 O.981 O.847 O.983 O.671 O.995
T O.957 O.98O O.864 0.993 O.768 O.996
MLI O.852 O.975 O.782 O.977 O.623 O.979
|R-MPNN 0.958 0.991
TalIe 1. Irecision/RecaII evaIualion on Ling-Span dala

5.2.3. ComputationaI Efficiency
Aparl fron conparing precision, recaII and TCR scores lelveen cIassifiers, ve aIso neasure
lheir conpulalionaI efficiency. TalIe 2 shovs lhal WML had lhe nininun conpulalion
line (2.5 nins), foIIoved ly N, LR-MINN, SVM, MLI, T respecliveIy. LR-MINN can
achieve conparalive span precision and recaII vilh a shorler conpulalion line (3.5 nins)
conpared vilh T (11.5 nins) and SVM (5 nins). Moreover, considering TCR scores, lhe
nodeIs lhal require Iess line (WML, N) lhan LR-MINN do nol perforn as accuraleIy as
LR-MINN.
Melhod Conpulalion Tine (nins)
O
=1
O
=9
O
=999
TCR TCR TCR
N 3 1O.8O 7.8O 5.O2
WML 2.5 7.11 5.62 1.33
SVM 5 2O.47 9.11 3.42
T 11.5 21.18 7.35 4.91
MLI 7 12.2O 4.5O O.25
LR-MINN 3.5 24.17 8.99 5.25
TalIe 2. Conpulalion Tine, Menory size evaIualion on Ling-Span dala

In sunnary, lhe nosl inporlanl finding in our experinenl is lhal lhe proposed LR-MINN
nodeI can achieve very accurale cIassificalion (high TCR, SI, SR) conpared lo olher
convenlionaI Iearning nelhods. Such superior perfornance of LR-MINN vas olserved
nosl cIearIy for lhough il aIvays ollains lhe highesl TCR and very conpelilive SI,
SR rales for olher cases of . Our aIgorilhn aIso requires reIaliveIy snaII conpulalion line
lo ollain conparalIe or even higher prediclive accuracy lo olher nelhods.

Pattern Recognition 530

6. ConcIusions and Future Work

In lhis chapler, ve proposed a noveI anli-span fiIlering franevork in vhich appropriale
dinension reduclion schenes and poverfuI cIassificalion nodeIs are enpIoyed.
IarlicuIarIy, IrincipaI Conponenl AnaIysis lransforns dala lo a Iover dinensionaI space.
Al lhe cIassificalion slage, ve conline a sinpIe Iinear regression nodeI vilh a Iighlveighl
nonIinear neuraI nelvork in an adjuslalIe vay. This Iearning nelhod, referred lo as |incar
Rcgrcssicn Mcific Prc|a|i|is|ic Ncura| Nc|ucr| (LR-MINN), can lake advanlage of lhe
virlues of lolh. Thal is, lhe Iinear nodeI provides reIialIe generaIizalion capaliIily vhiIe lhe
nonIinear can conpensale for higher order conpIexilies of lhe dala. A cosl-sensilive
evaIualion using a pulIicIy avaiIalIe corpus, Ling-Span, has shovn lhal our LR-MINN
cIassifier conpares favoralIy lo olher slale-of-lhe-arl nelhods vilh superior accuracy,
affordalIe conpulalion and high syslen roluslness. LspeciaIIy for exlreneIy high
niscIassificalion cosl, vhiIe olher nelhods perfornance deleriorales as increases, lhe LR-
MINN denonslrales an alsoIuleIy superior oulcone lul relains Iov conpulalion cosl. LR-
MINN aIso has significanl Iov conpulalionaI requirenenl, i.e. ils lraining line is shorler
lhan olher aIgorilhns vilh siniIar accuracy and cosl. Though our proposed nodeI achieves
good resuIls in lhe conducled experinenls, il is nol necessariIy lhe lesl soIulion for aII
prolIens. Hovever, conparaliveIy high prediclive accuracy aIong vilh Iov conpulalionaI
conpIexily dislinguish il fron olher slale-of-lhe-arl Iearning aIgorilhns, and parlicuIarIy
suilalIe for cosl-sensilive span deleclion appIicalions.

7. References

AndroulsopouIos, I., Koulsias, }., Chandrinos, K. V., IaIiouras, C., & SpyropouIos, C. D.
(2OOO). An LvaIualion of Naive ayesian Anli-Span IiIlering. Iaper presenled al
lhe Iroc. of lhe LCML.
ayIer, C. (2OO8). Ienelraling ayesian Span IiIlers: VDM VerIag Dr. MueIIer e.K.
Carreras, X., & Marquez, L. (2OO1). oosling Trees for Anli-Span LnaiI IiIlering. Iaper
presenled al lhe RANLI, Tzigov Chark, uIgaria.
Cohen, W. (1996). Learning ruIes lhal cIassify enaiI. AAAI Sunp. On Machine Learning in
Inf. Access, 18-25.
Cranor, L. I., & LaMacchia, . A. (1998). Span! Iaper presenled al lhe Connunicalions of
ACM.
Cranor, L. I., LaMacchia, .A. . (1998). Span! Iaper presenled al lhe Connunicalions of
ACM.
Drucker, H., Wu, D., & Vapnik, V. N. (1999). Supporl Veclor Machines for Span
Calegorizalion. ILLL Transaclions On NeuraI Nelvorks, 1O48-1O54.
Hayes, M. H. (1996). SlalislicaI DigilaI SignaI Irocessing and ModeIing: }ohn WiIey & Sons,
Inc.
Hsu, C. W., Chang, C. C., & Lin, }. C. (2OO3). LISVM: a Iilrary for supporl veclor nachines.
fron hllp://vv.csie.nlu.edu.lv/~cjIin/Iilsv
}oIIiffe, I. T. (2OO2). IrincipIe Conponenl AnaIysis (2 ed.). Nev York, USA: Springer.
RuneIharl, D. L., & McCIeIIand, }. L. (1986). IaraIIeI Dislriluled Irocessing: LxpIoralions in
lhe Microslruclure of Cognilion (VoI. 2): The MIT Iress. .
Spam Recognition using Linear Regression and Radial Basis Function Neural Network 531

Sahani, M., S. Dunais, D. Heckernan, and L. Horvilz. (1998). A ayesian Approach lo
IiIlering }unk L-MaiI. Iaper presenled al lhe Learning for Texl Calegorizalion -
AAAI TechnicaI Reporl WS-98-O5.
Sakkis, C., AndroulsopouIos, I., IaIiouras, C., KarkaIelsis, V., SpyropouIos, C. D., &
SlanalopouIos, I. (2OO3). A Menory-ased Approach lo Anli-Span IiIlering for
MaiIing Lisls. Iaper presenled al lhe Infornalion RelrievaI.
Schapire, R. L., & Singer, Y. (2OOO). oosTexler: a oosling-ased Syslen for Texl
Calegorizalion. .Machine Learning, 39(2), 135-168.
Schryen, C. (2OO7). Anli-Span Measures: AnaIysis and Design: Springer.
SurepayroII. (2OO7). More lhan 8O Iercenl of SnaII usiness Ovners Consider L-naiI
LssenliaI lo Success, SureIayroII Insighls Survey.
Svarlz, N. (2OO3). Span cosls lusinesses $13 liIIion annuaIIy. Infornalion Managenenl
}ournaI, 37(2).
WoIfe, I., Scoll, C., & Lrvin, M. (2OO4). Anli-Span TooI Kil: McCrav-HiII Oslorne Media.
Zaknich, A. (1998). Inlroduclion lo lhe nodified prolaliIislic neuraI nelvork for generaI
signaI processing appIicalions. ILLL Transaclions on SignaI Irocessing, 46(7), 198O-
199O.
Zaknich, A. (2OO3). NeuraI Nelvorks for InleIIigenl SignaI Irocessing. Sydney: WorId
Scienlific IulIishing.

Pattern Recognition 532

Anda mungkin juga menyukai