Anda di halaman 1dari 82

White Paper Series

THE
HUNGARIAN
LANGUAGE IN
THE DIGITAL
AGE

Fehr knyvek sorozat

A MAGYAR
NYELV A
DIGITLIS
KORBAN
Simon Eszter
Lendvai Piroska
Nmeth Gza
Olaszy Gbor
Vicsi Klra

White Paper Series

THE
HUNGARIAN
LANGUAGE IN
THE DIGITAL
AGE

Fehr knyvek sorozat

A MAGYAR
NYELV A
DIGITLIS
KORBAN
Simon Eszter MTA Nyelvtudomnyi Intzet
Lendvai Piroska MTA Nyelvtudomnyi Intzet
Nmeth Gza BME
Olaszy Gbor BME
Vicsi Klra BME

Georg Rehm, Hans Uszkoreit


(szerkesztk, editors)

ELSZ PREFACE
Ez a fehr knyv egy sorozat rszt kpezi, amelynek

is white paper is part of a series that promotes know-

clja, hogy felhvja a gyelmet a nyelvtechnolgira

ledge about language technology and its potential. It

s az abban rejl lehetsgekre. Elssorban oktatkat,

addresses journalists, politicians, language communi-

jsgrkat, politikusokat s nyelvi kzssgeket szlt

ties, educators and others. e availability and use

meg. Az eurpai nyelvek nyelvtechnolgiai feldol-

of language technology in Europe varies between lan-

gozottsga s a nyelvtechnolgia elterjedtsge megle-

guages. Consequently, the actions that are required to

hetsen eltr. Ezrt a nyelvtechnolgia fejldshez

further support research and development of language

s a kutats elsegtshez szksges lpsek is nyel-

technologies also dier. e required actions depend

venknt msok s msok, s olyan klnfle tnyezkn

on many factors, such as the complexity of a given lan-

mlnak, mint az adott nyelv sszetettsge, vagy a nyel-

guage and the size of its community.

vet hasznl kzssg nagysga.

META-NET, a Network of Excellence funded by the

A META-NET, az Eurpai Bizottsg ltal alaptott

European Commission, has conducted an analysis of

hlzat felmrst vgzett a rendelkezsre ll nyelvi

current language resources and technologies in this

erforrsokrl s technolgikrl (lsd a 73. oldalt).

white paper series (p. 73). e analysis focused on the

Ez a felmrs a 23 hivatalos eurpai nyelv mellett egyb

23 ocial European languages as well as other impor-

nemzeti s regionlis nyelvekre is kiterjed, s ered-

tant national and regional languages in Europe. e re-

mnyei rmutatnak az egyes nyelvek tern fellelhet

sults of this analysis suggest that there are tremendous

kutatsi hinyossgokra. Egy, a jelenlegi helyzetet be-

decits in technology support and signicant research

mutat rszletes szakrti elemzs s rtkels segthet

gaps for each language. e given detailed expert anal-

a tovbbi kutatsok hatsnak maximalizlsban.

ysis and assessment of the current situation will help

A META-NET 33 orszg 54 kutatkzpontjbl ll

maximise the impact of additional research.

(2011. novemberi helyzet szerint, lsd a 69. oldalt),

As of November 2011, META-NET consists of 54

akik a terlettel foglalkoz vllalkozsokkal, kormny-

research centres from 33 European countries (p. 69).

zati szervekkel, kutatszervezetekkel, szovercgekkel,

META-NET is working with stakeholders from econ-

szolgltatkkal s eurpai egyetemekkel dolgoznak

omy (soware companies, technology providers, users),

egytt. Egysges technolgiai vzit alkotva egy olyan

government agencies, research organisations, non-

stratgiai kutatsi terv ltrehozsn dolgoznak, amely-

governmental organisations, language communities

ben megfogalmazzk, hogyan tudnak a nyelvtechnol-

and European universities. Together with these com-

giai alkalmazsok a kutatsi hinyossgokon enyhteni

munities, META-NET is creating a common technol-

a 2020-ig terjed idszakban.

ogy vision and strategic research agenda for multilingual Europe 2020.

III

META-NET oce@meta-net.eu http://www.meta-net.eu

A dokumentum szerzi ksznettel tartoznak a nmet fehr


knyv szerzinek azrt, hogy engedlyeztk a nmet vltozat
egyes nyelvfggetlen rszeinek jrafelhasznlst [1].

e authors of this document are grateful to the authors of


the white paper on German for permission to re-use selected
language-independent materials from their document [1].

A fehr knyv megrst az Eurpai Bizottsg 7. keretprogramja s ICT PSP programja tmogatta a T4ME (szerzdsszm: 249 119), a CESAR (szerzdsszm: 271 022),
a METANET4U (szerzdsszm: 270 893) s a METANORD (szerzdsszm: 270 899) projekteken keresztl.

e development of this white paper has been funded by the


Seventh Framework Programme and the ICT Policy Support
Programme of the European Commission under the contracts
T4ME (Grant Agreement 249 119), CESAR (Grant Agreement 271 022), METANET4U (Grant Agreement 270 893)
and META-NORD (Grant Agreement 270 899).

IV

TARTALOMJEGYZK TABLE OF CONTENTS


A MAGYAR NYELV A DIGITLIS KORBAN
1 Vezeti sszefoglal

2 Veszlyben a nyelveink: Kihvs a nyelvtechnolginak

2.1

Az eurpai informcis trsadalom gtjai: a nyelvi hatrok . . . . . . . . . . . . . . . . . . . . .

2.2

Veszlyben a nyelveink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3

Nyelvtechnolgia: egy kulcsfontossg technolgia . . . . . . . . . . . . . . . . . . . . . . . . .

2.4

A nyelvtechnolgia lehetsgei . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.5

A nyelvtechnolgia kihvsai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.6

Emberi s gpi nyelvelsajtts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 A magyar nyelv az eurpai informcis trsadalomban

10

3.1

ltalnos tnyek . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2

A magyar nyelv klnlegessgei . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3

Modernkori fejlds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.4

Nyelvmvels Magyarorszgon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.5

A magyar nyelv az oktatsban . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.6

Nemzetkzi vonatkozsok . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.7

A magyar nyelv az interneten . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Nyelvtechnolgia magyarul

15

4.1 A nyelvtechnolgiai alkalmazsok felptse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15


4.2 A f alkalmazsi terletek . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Tovbbi alkalmazsi terletek . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4 Nyelvtechnolgia az oktatsban . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5 Hazai projektek . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.6 Az eszkzk s erforrsok elrhetsge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.7 Nyelvek kztti sszehasonlts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.8 sszegzs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 A META-NET-rl

34

THE HUNGARIAN LANGUAGE IN THE DIGITAL AGE


1 Executive Summary

35

2 Languages at Risk: a Challenge for Language Technology

38

2.1

Language Borders Hold back the European Information Society . . . . . . . . . . . . . . . . . . 39

2.2

Our Languages at Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.3

Language Technology is a Key Enabling Technology . . . . . . . . . . . . . . . . . . . . . . . . 40

2.4

Opportunities for Language Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.5

Challenges Facing Language Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

2.6

Language Acquisition in Humans and Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3 The Hungarian Language in the European Information Society

43

3.1

General Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2

Particularities of the Hungarian Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.3

Recent Developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.4

Ocial Language Protection in Hungary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.5

Language in Education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.6

International Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.7

Hungarian on the Internet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4 Language Technology Support for Hungarian

48

4.1 Application Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48


4.2 Core Application Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Other Application Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4 Educational Programmes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5 National Projects and Initiatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.6 Availability of Tools and Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.7 Cross-language comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5 About META-NET

66

A Hivatkozsok -- References

67

B META-NET tagok -- META-NET Members

69

C A META-NET fehr knyvek sorozat -- The META-NET White Paper Series

73

1
VEZETI SSZEFOGLAL
Az informcis technolgia jelentsen megvltoz-

Becslsek szerint a magyar nyelvet sszesen 13 mil-

tatta mindennapi letnket. Jellemzen szmtgpet

lian beszlik, ezzel a 12. helyen ll a legtbb beszlvel

hasznlunk rsra, szerkesztsre, szmolsra s inform-

rendelkez nyelvek listjn Eurpban.

cikeressre, tovbb egyre inkbb olvassra, zenehall-

Kztrsasg llamnyelve, ahol a 10 millis lakossgnak

gatsra, fot- s lmnzsre. Kis szmtgpeket hor-

kb. 97%-a magyar anyanyelv. A szomszdos ht orszg-

dunk a zsebnkben, s hasznljuk ket telefonlsra,

ban is tallunk magyar nyelv kzssgeket, amelyek

e-mailrsra,

informciszerzsre s szrakozsra,

kzl a legnagyobb a romniai, megkzeltleg msfl

brmerre is jrunk. Milyen hatssal van az inform-

milli nyelvhasznlval. Ezen fell emigrns kzssgek

cinak, a tudsnak s a mindennapi kommunikcinak

hasznljk vilgszerte, elssorban az Amerikai Egyeslt

ez a masszv digitalizldsa a nyelvnkre? Megvltozik

llamokban, Kanadban s Izraelben.

a nyelvnk, vagy eltnik? Szmtgpeink egy nagy s

A magyar nyelv szigetet alkot Eurpban, ugyanis a

erteljes globlis hlzat rszeit kpezik. A lny Ipane-

legtbb eurpai nyelv az indoeurpai nyelvcsaldba tar-

mban, a hivatalnok Budapesten s a mrnk Delhiben

tozik, de a magyar nem. A magyar nnugor nyelv,

ugyangy tudnak csetelni a bartaikkal a Facebookon,

rokonai a nn, az szt s nhny ms, Oroszorszg-

de nem valszn, hogy valaha is tallkoznak egymssal

ban l npek ltal beszlt nyelv. A magyar nyelv a

online kzssgekben vagy frumokon. Ha azon aggd-

legtbb beszlvel rendelkez nem indoeurpai nyelv

nak, hogy hogy lehet kezelni a flfjst, akkor mindan-

Eurpban, de ellenttben olyan vilgnyelvekkel, mint

nyian a Wikipdit fogjk megnzni, de nem ugyanazt

az angol vagy a knai, s az olyan gyakran hasznlt eur-

a cikket fogjk olvasni. Amikor eurpai netezk a

pai nyelvekkel, mint a nmet vagy a francia a magyar

fukushimai atomkatasztrfnak az eurpai energetikai

nem jtszik prominens szerepet nemzetkzi szinten.

piacra gyakorolt hatsairl beszlgetnek a frumokon,


akkor ezt jellemzen nyelvileg szeparlt kzssgekben
teszik. Amit az internet sszekt, azt a hasznlk nyelvi
korltai mg mindig sztvlasztjk. Ez vajon mindig
gy lesz? A vilg 6000 nyelve kzl sok nem fog letben maradni egy globalizlt digitlis informcis trsadalomban. Becslsek szerint legalbb 2000 nyelv kihalsra van tlve a kvetkez vszzadokban. Msok
szerepet fognak ugyan jtszani a csaldi s ismersi krben, de a szlesebb zleti s tudomnyos szfrban nem.
Milyenek a magyar nyelv eslyei a tllsre?

A Magyar

Sokan panaszkodnak Magyarorszgon az anglicizmusok egyre ersd hasznlata miatt, s attl tartanak,
hogy a magyar nyelvet elrasztjk az angol szavak s kifejezsek. Ez a megkzelts flrevezet. A magyar nyelv
mr tllte az j szavak hatst, melyeket klnbz
trk nyelvekbl vettnk t a honfoglals eltti korban,
s tllte az ers szlv hatst is a Krpt-medencben.
Ksbb az Oszmn Birodalom rsze volt az orszg 150
vig, majd a Habsburg Birodalom ideje alatt a latin s a
nmet nyelv hatsa volt nagyon ers. Kedves kis magyar
szavaink elvesztsnek egy j ellenszere, ha hasznljuk

ket gyakran s tudatosan. Nyelvszeti polmik az

generciba olyan szoverek fognak tartozni, amelyek

idegen befolysrl s kormnyzati rendelkezsek nem

nem csak a kiejtett hangokat vagy a lert betket, hanem

segtenek. Nem a nyelvnk fokozatos anglicizldsa

a szavakat s a mondatokat is rtik, s a felhasznlkat

miatt kellene aggdnunk, inkbb amiatt, hogy nyelvnk

jobban tmogatjk azltal, hogy beszlik s rtik a

a szemlyes let fbb terleteirl eltnhet. Nem a tu-

nyelvket. Ezeknek a fejlesztseknek az elfutrai az

domnyra, a lgiirnytsra vagy a globlis zleti piacra

olyan szabadon elrhet online szolgltatsok, mint a

gondolunk, hanem az let olyan terleteire, ahol sokkal

Google Translate, amely 57 nyelv kztt fordt, vagy eu-

fontosabb, hogy a nyelv kzel lljon az orszg lakihoz,

rpai versenytrsa, az itranslate4.eu (egy Magyarorszg

mint nemzetkzi partnerekhez, ilyenek pldul a helyi

ltal vezetett konzorcium termke), tovbb az IBM

szoksok, az gyintzs, a trvnyalkots, a kultra s a

szuperszmtgpe, Watson, amely megverte a Jeopardy

vsrls.

nev jtk amerikai bajnokt, s az Apple mobilalkal-

Egy nyelv sttusza nem csak beszlinek szmtl fgg,

mazsa, a Siri, amely reagl a hangvezrlsre, s vlaszol

hanem attl is, hogy mennyire van jelen az inform-

angol, nmet, francia s japn nyelven.

cis trben s a szoveralkalmazsokban. Egy meglehetsen aktv magyar nyelv webes kzssg ltezsrl
tanskodik az, hogy a magyar Wikipdia a 19. legnagyobb, megelzve olyan tbb beszlvel rendelkez eurpai nyelveket, mint a trk, a romn vagy a dn,
s olyan vilgnyelveket, mint az arab vagy a koreai.
Nhny fontos nemzetkzi szover magyar vltozatban
is elrhet, azonban a magyar nyelv specialitsa megnehezti az angolalap alkalmazsok adaptlst. A
kltsges magyar nyelv technolgik fejlesztst az is
htrltatja, hogy a magyar piac meglehetsen kicsi.

Az informcis technolgia kvetkez genercija


olyan szint nyelvi kpessggel fog rendelkezni, hogy
az emberi felhasznlk sajt nyelvkn tudnak majd
kommuniklni ezt a technolgit hasznlva. Az eszkzk kpesek lesznek automatikusan megtallni a
legfontosabb hreket s informcikat a vilg digitlis tudsbzisbl mindezt knnyen hasznlhat
hangvezrlssel. A nyelvtechnolgia kpes lesz automatikusan fordtani, vagy segteni a tolmcsok munkjt;
beszlgetseket s dokumentumokat sszefoglalni; s
tmogatni a felhasznlkat a tanulsban.

Ami a nyelvtechnolgit illeti, a magyarorszgi helyzet


vatos optimizmusra ad okot.

Nagyrszt llami,

az utbbi idben eurpai tmogatssal, de ltezik


nyelvtechnolgiai kutats Magyarorszgon. A jelenleg
fut kt eurpai, Magyarorszg ltal koordinlt ICT
projekt kzl mindkett nyelvtechnolgiai tmj. Sz-

Az informcis s kommunikcis technolgia


kvetkez genercija elri az ipari robotokat is (jelenleg
fejleszts alatt ll a kutatlaboratriumokban), melynek
kvetkezmnyekppen rteni fogjk, hogy a felhasznl
mit akar, s mg be is szmolnak az eredmnyekrl.

mos technolgia s erforrs ll rendelkezsre a magyar

A teljestmnynek ez a szintje azt jelenti, hogy tl kell

nyelvre, br kzel sem annyi, mint az angolra, s ezek

lpni az egyszer karakterhalmazokon, sztrakon,

nem elgsgesek egy valdi tbbnyelv tudsalap tr-

helyesrs-ellenrzkn s kiejtsi szablyokon.

sadalom ignyeinek kielgtsre.

technolginak meg kell haladnia az egyszerst

Az informcis s kommunikcis technolgia

megkzeltseket, s el kell kezdenie olyan nyelvmo-

mr a kvetkez forradalomra kszl.

A szemlyi

delleket gyrtani, amelyek gyelembe veszik a szin-

szmtgpek, hlzatok, miniatrizci, multimdia,

taxissal egytt a szemantikt is ahhoz, hogy megrtsk

mobileszkzk s a cloud technolgia utn a kvetkez

a krdsek sort, s relevns vlaszokat adjanak rjuk.

Azonban az angol s a magyar nyelv kztt hatal-

gyarra. Ez igaz az informcikinyersre, a nyelvi el-

mas technolgiai szakadk ttong, s egyre mlyl.

lenrzsre, a gpi fordtsra s sok ms alkalmazsra is.

A kutats-fejlesztsi tmogatsok folytonossga nem

Sok kutat szerint ez annak ksznhet, hogy az elmlt

megfelel.

Rvid tv programok vltjk egymst

tven vben a szmtgpes nyelvszeti mdszerek s

alacsony tmogats idszakokkal, s az EU-s orsz-

algoritmusok, valamint a nyelvtechnolgiai alkalmaz-

gok s az Eurpai Bizottsg programjainak koordin-

sok fejlesztse elssorban az angolra fkuszlt. Ms

cijban is ltalnos hinyossgok mutatkoznak. Ennek

kutatk azt gondoljk, hogy az angol jellegnl fogva

eredmnyekppen Magyarorszg (s az EU ltalban)

alkalmasabb a szmtgpes feldolgozsra, tovbb az

tbb nagyon gretes innovcit vesztett az Amerikai

olyan nyelvek, mint a spanyol vagy a francia, szintn

Egyeslt llamokkal szemben, ahol a stratgiai kutat-

knnyebben kezelhetk a jelenlegi mdszerekkel, mint

sok tervezsben nagyobb kontinuits tapasztalhat,

a magyar. Ez azt jelenti, hogy kvetkezetes s fenn-

s ahol jobban tmogatjk az j technolgik piacra

tarthat erfesztseket kell tennnk a kutats tern,

kerlst. A technolgiai innovci versenyben csak

ha a kvetkez genercis infokommunikcis tech-

egy jvbe mutat koncepcival rendelkez korai start

nolgikat magyarul akarjuk hasznlni privt s hivata-

biztosthat versenyelnyt, persze csak akkor, ha valban

los letnkben is.

elri a clt. Klnben minden amit elrnk, csak egy

sszefoglalva: a magyar nyelv romlsrl szl prf-

tiszteletbeli emlts a Wikipdiban.

cik ellenre nyelvnk nincs veszlyben, mg az angol

Mindezek ellenre mg mindig nagy kutati potencil

nyelv erejvel szemben sem. Viszont a helyzet drmaian

rejlik Magyarorszgban s az EU-ban. A nemzetkzi-

megvltozhat akkor, amikor a technolgik j gener-

leg is elismert kutatkzpontokon s egyetemeken kvl

cija elkezdi valban hatkonyan kezelni az emberi nyel-

szmos innovatv nyelvtechnolgiai kis- s kzpvl-

vet. A gpi fordts tkletestse ltal a nyelvtechnol-

lalkozs mkdik, melyek nagy kreativitssal s hatal-

gia segt a nyelvi korltok ledntsben, de csak azon

mas erfesztsekkel prblnak tllni, stabil tmogats

nyelvek esetben, amelyek kpesek fennmaradni a digi-

s kockzati tke hinyban is.

tlis vilgban. Ha ltezik hasznlhat nyelvtechnolgia,

Habr Magyarorszg fontos fejlesztseket tmogatott a

akkor mg a kevs beszlvel rendelkez nyelvek is biz-

korpuszpts s a nyelvi erforrsok ltrehozsa tern, a

tosthatjk tllsket. Ha nem, akkor mg a nagyobb

magyar nyelvtechnolgiai erforrsok s eszkzk mg

nyelvek is erteljes nyoms al kerlhetnek.

mindig nem rik el minsgben s lefedettsgben an-

A fogorvos trfs intse: Csak azt a fogt mossa, ame-

gol nyelv megfelelik sznvonalt, amelyek majd min-

lyet meg akar tartani! Ez a monds a kutatsi politikra

den nyelvtechnolgiai terleten az lvonalat kpviselik.

is igaz egy kiktssel: megtanulhatsz minden nyelvet,

Minden nemzetkzi nyelvtechnolgiai verseny azt mu-

ami csak ltezik a nap alatt, de kltsges technolgikat

tatja, hogy az eredmnyek az angol nyelv automatikus

csak azokra fejlessz, amelyeket igazn letben akarsz tar-

elemzse tern sokkal jobbak, mint ugyanezek a ma-

tani.

2
VESZLYBEN A NYELVEINK:
KIHVS A NYELVTECHNOLGINAK
Digitlis forradalom szemtani vagyunk, amely drma-

A ltrejv mdiumtpusok, gymint az jsg, a

ian befolysolja a kommunikcit s a trsadalmat. A

knyvkiads, a rdi s a televzi klnbz kom-

digitlis s hlzati kommunikcis technolgia leg-

munikcis ignyeket tudtak kielgteni.

jabb vvmnyait sokszor Gutenberg invencijhoz, a


nyomtats feltallshoz hasonltjk. Mit sugall neknk
ez az analgia az eurpai informcis trsadalom s fleg nyelveink jvjrl?

Az elmlt hsz vben az informcis technolgia szmos folyamat automatizlst s knnyebb hasznlatt
segtette el:
A kiadvnyszerkeszt szover felvltotta a gprst
s a nyomdai formzst.

Digitlis forradalom szemtani vagyunk,


amelyet Gutenberg invencijhoz, a nyomtats
feltallshoz hasonltanak.

A Microso PowerPoint szover felvltotta az


rsvett flikat.
E-mailben gyorsabban kldnk s fogadunk dokumentumokat, mint faxszal.

Gutenberg tallmnya utn a kommunikciban s


tudscserben a kvetkez nagy ttrst Luther biblia-

A Skype segtsgvel interneten keresztl telefonlhatunk s szervezhetnk virtulis tallkozkat.

fordtsa jelentette. Az ezt kvet szzadokban a kln-

A hang- s videkdolsi formtumok segtsgvel

bz technikk fejldse segtette a hatkonyabb nyelvi

knnyen cserlhetnk multimdis tartalmakat.

feldolgozst s tudscsert:
A nagy nyelvek helyesrsi s nyelvtani szabvnyostsa lehetv tette az j tudomnyos s intellektulis tletek gyors terjesztst.
A hivatalos nyelvek kialakulsa lehetv tette a
polgrok szmra a (gyakran politikai) hatrokon
tvel kommunikcit.
A nyelvtants s fordts elsegtette a nyelvek
kztti csert.
Az jsgri s bibliograi tmutatk biztostottk
a nyomtatott anyagok minsgt s elrhetsgt.

A keresprogramok kulcsszavas keresst tesznek


lehetv.
Az online fordtprogramok, mint a Google Translate, gyors nyersfordtst adnak.
A kzssgi mdiaplatformok, mint a Facebook, a
Twitter s a Google+ elsegtik az egyttmkdst
s az informcimegosztst.
Br ezek az eszkzk s alkalmazsok fontos segtsget jelentenek, tovbbra sem tudnak olyan fenntarthat, tbbnyelv eurpai informcis trsadalmat
kialaktani, amelyben az informci s a javak szabadon
ramolhatnak.

2.1 AZ EURPAI
INFORMCIS TRSADALOM
GTJAI: A NYELVI HATROK
Nem tudjuk pontosan, hogyan fog kinzni a jvbeli informcis trsadalom. Azonban igen valszn, hogy
a kommunikcis technolgia forradalma a klnbz
nyelveket beszl embereket sszehozza. Ez a folyamat

zsiai nyelv) online tartalom mennyisge robbansszeren megntt.


Eddig meglepen kevs gyelmet kapott a nyilvnos
vitkban a nyelvi hatrok miatti digitlis megosztottsg,
amely mindenhol jelentkezik; manapsg azonban felvetdik az az get krds, hogy mely eurpai nyelvek
fognak boldogulni s kitartani a tudsalap informcis
trsadalomban.

az embereket j nyelvek tanulsra, mg a fejlesztket j


alkalmazsok ltrehozsra kszteti, ami ersti a klcsns megrtst s elrhetv teszi a kzs tudst.

2.2 VESZLYBEN A NYELVEINK


A nyomtatott sajt megjelense pratlan mrtk informcicsert indtott el Eurpban, ez azonban sok

A globlis informcis s gazdasgi trben


tbb nyelvvel, kommunikcis partnerrel s
tartalommal kerlnk kapcsolatba.

eurpai nyelv pusztulst is elidzte. A regionlis s


kisebbsgi nyelvek alig kerltek nyomtatsba. Ennek
eredmnyeknt sok nyelv, mint pldul a dalmt vagy a
kelta, csak beszlt formban lt tovbb, s ez korltozta

A globlis informcis s gazdasgi trben tbb

tovbbi fejldsket s hasznlatukat. Vajon az inter-

nyelvvel, kommunikcis partnerrel s tartalommal

netnek is hasonl hatsa lesz a nyelveinkre?

kerlnk kapcsolatba, s mindez arra ksztet min-

Eurpa legfontosabb s leggazdagabb kulturlis rtkei

ket, hogy gyorsan hasznostsuk az j mdia tpusait.

kz tartozik a trsgben hasznlt csaknem 80 nyelv. Eu-

A kzssgi mdia (Wikipedia, Facebook, Twitter,

rpa nyelvi soksznsge is hozzjrul trsadalmi sike-

YouTube s legjabban Google+) jelenlegi npszersge

rhez [3]. Mg a npszer nyelvek, mint az angol vagy

csak a jghegy cscsa.

a spanyol biztosan megmaradnak a feltrekv digitlis

Manapsg tbb gigabjtnyi szveget tudunk tovbb-

trsadalomban s a piacon, sok ms eurpai nyelv el

tani a vilg krl pr msodpercen bell anlkl, hogy

fog tnni a digitlis kommunikcibl s az internetes

szrevennnk, hogy a szveg olyan nyelven van, amelyet

trsadalom ltkrbl. Ez biztosan nem jrhat t.

nem rtnk. Az Eurpai Bizottsg felkrsre ksztett

Egyrszt elveszne egy stratgiai lehetsg, s ez Eurpa

legutbbi jelentsbl kiderl, hogy az eurpai inter-

globlis helyzett gyengten. Msrszt az ehhez ha-

nethasznlk 57%-a nem a sajt anyanyelvn vsrol

sonl vltozsok szemben llnak azzal az elkpzelssel,

rukat s szolgltatsokat. (Az angol a leggyakoribb ide-

hogy az eurpai polgrok azonos mrtkben vehessenek

gen nyelv a francia, a nmet s a spanyol eltt.) A fel-

rszt az ket rint gyekben, nyelvtl fggetlenl.

hasznlk 55%-a olvas idegen nyelv szveget az interneten, mg csak 35%-uk hasznl ms nyelvet e-mailek
vagy egyb zenetek rshoz a weben [2]. Pr vvel
ezeltt mg az angol volt a lingua anca a weben

Eurpa legfontosabb s leggazdagabb kulturlis


rtkei kz tartozik nyelvi soksznsge.

az interneten megtallhat tartalom nagy rsze angolul


volt , a helyzet azonban mostanra jelentsen megvl-

Egy tbbnyelvsgrl szl UNESCO beszmol sze-

tozott. A nem angol nyelv (klnsen az arab s egyb

rint a nyelvek az alapvet jogok, mint pldul a politikai

nkifejezs, az oktats vagy a trsadalomban val rszv-

vonatkozan bemutassk, milyen kszltsgi llapotban

tel fontos kzvetti [4].

vannak az azokra kidolgozott alapvet technolgik.

2.3 NYELVTECHNOLGIA: EGY


KULCSFONTOSSG
TECHNOLGIA

A kzeljvben minden eurpai nyelvre


elrhet, robusztus s olcs nyelvtechnolgira
van szksgnk.

A mltban a nyelvvd beruhzsok fleg a nyelvok-

Ahhoz, hogy fenntartsuk pozcinkat a globlis innov-

tatsra s fordtsra fkuszltak.

Pldaknt: becs-

ci lvonalban, a kzeljvben minden eurpai nyelvre

lsek szerint Eurpnak 2008-ban 8,4 millird eu-

elrhet, robusztus, olcs s nagyobb szoverkrnyezet-

rs fordt, tolmcs, szoverlokalizcis s honlapglo-

be integrlhat nyelvtechnolgira van szksgnk. Az

balizcis piaca volt, s mindehhez mg vi 10%-os

interaktv, multimdis s tbbnyelv internethasznlat

nvekedst vrtak [5]. Azonban ez a kapacits mg

nyelvtechnolgia nlkl elkpzelhetetlen.

mindig nem elg ahhoz, hogy kielgtse a jelenlegi


s a jvbeli ignyeket. A minden terletet lefed
holnap Eurpjban a megfelel technolgia hasznlata,

2.4 A NYELVTECHNOLGIA
LEHETSGEI

hiszen pldul a szlltshoz, az energiaiparban vagy a

A nyomtats vilgban a szvegrl kszlt kpek gyors

fogyatkkal lk letnek megknnytshez szintn fej-

sokszorostsa jelentette a technolgiai ttrst. Em-

lett technolgit hasznlunk.

berek vgeztk az informcikeress s -feldolgozs, a

A nyelvtechnolgia (az rott szveg s a beszd minden

fordts s az sszefoglals kemny munkjt. A beszd

formjt lefedve) lehetv teszi az egyttmkdst, a ta-

rgztsre Edison tallmnyig kellett vrnunk, ami

nulst, az zletktst, a tudsmegosztst s a trsadalmi

megint csak analg msolatok ksztsre volt j.

s politikai vitkban val rszvtelt, szmtstechnikai

A digitlis nyelvtechnolgia segtsgvel elrhetv v-

tudstl s nyelvi hatroktl fggetlenl. Gyakran bo-

lik az automatikus fordts s tartalom-elllts, az

nyolult rendszerekbe beptve dolgozik a httrben,

informcifeldolgozs s a tudsmenedzsment min-

segtve minket, amikor pldul:

den eurpai nyelven.

nyelvhasznlatot biztost legizgalmasabb megolds a

informcit keresnk internetes keresvel;


helyesrst s nyelvtant ellenrznk szvegszerkesztben;

Emellett elsegti az intui-

tv, termszetesnyelv-alap interfszek fejlesztst a


hztartsi elektronika, a gpszet, a jrmgyrts, a
szmtstechnika s a robotika terletn is. Br mr
sok prototpus ltezik, a kereskedelmi s ipari alkalma-

termkajnlsokat olvasunk online vsrlskor;

zsok mg mindig a fejleszts kezdetleges fzisban van-

egy navigcis rendszer szbeli utastsait hallgatjuk;

nak. A kutatsban s fejlesztsben elrt eredmnyek

online szolgltatssal fordtunk weboldalakat.

lehetsgek egsz trhzt nyitottk meg. Pldul a


gpi fordts adott tmkon bell kell pontossggal

A nyelvtechnolgiai fejlesztsek tipikusan nagyobb

mkdik, a ksrleti alkalmazsok pedig szmos eur-

alkalmazsokban jelennek meg. A META-NET fe-

pai nyelven nyjtanak tbbnyelv informci- s tuds-

hr knyvek clja, hogy minden eurpai nyelvre

menedzsment szolgltatsokat.

Nyelvi alkalmazsokat, hangvezrelt felhasznli in-

A kutats aktv rsze a nyelvtechnolginak a

terfszeket s dialgusrendszereket ltalban specilis

katasztrfa sjtotta helyeken, mentsi munklatoknl

terleteken tallunk, m ezek gyakran korltozott tel-

val felhasznlsa is. Az ilyen, magas rizikfaktor kr-

jestmnyt mutatnak. Nagy piaci lehetsgek rejlenek

nyezetben a fordts pontossga let-hall krdse lehet,

a nyelvtechnolginak az oktatsba s a szrakoztat-

s az intelligens robotok nyelvi kpessgeikkel leteket

iparba, pldul jtkokba, oktatprogramokba, szimul-

menthetnek.

cis krnyezetekbe val integrlsban is. A mobilinformcis szolgltatsok, a szmtgppel tmogatott


a plgiumszr szoverek csak kiragadott pldk arra,

2.5 A NYELVTECHNOLGIA
KIHVSAI

hogy hny helyen jtszik fontos szerepet a nyelvtech-

Br a nyelvtechnolgia nagy fejldsen ment keresztl

nolgia. A kzssgi oldalak, mint a Twitter vagy a

az utbbi vekben, a termkinnovci s -fejleszts mg

Facebook npszersge szintn arra utal, hogy igny

mindig meglehetsen lassan halad elre. A szles krben

van a kinomultabb nyelvtechnolgiai alkalmazsokra,

hasznlt nyelvtechnolgiai alkalmazsok, mint pldul

amelyek gyelemmel kvetik a bejegyzseket, sszegzik

a szvegszerkesztk helyesrs-ellenrzi, tipikusan egy-

a vitkat, ajnlsokat tesznek, kiszrik az rzelmi tar-

nyelvek, s mindssze nhny nyelvre elrhetek.

nyelvtanuls, az e-learning, az nellenrz eszkzk s

talm vlaszokat, szerzi jogi szablytalansgokat vagy


visszalseket.
A nyelvtechnolgia hatalmas lehetsget jelent az Eurpai Uni szmra mind gazdasgi, mind kulturlis szempontbl. Eurpban trvnyszer a tbbnyelvsg; az
eurpai cgek, szervezetek s iskolk multinacionlisak
s sokflk. Az EU polgrait azonban mg ma is htrltatjk az Eurpai Kzs Piac nyelvi hatrai.

A technolgiai fejleszts mg mindig


meglehetsen lassan halad elre.
Az online gpi fordt szolgltatsok kitnen hasznlhatk arra, hogy nyersfordtst adjanak a dokumentum
tartalmrl, de nem alkalmasak pontos fordtsra. Az
emberi nyelv komplexitsnak ksznheten nyelveink
szmtgpes modellezse s a val vilgban val tesztelse id- s pnzignyes vllalkozs, ami hosszt-

A nyelvtechnolgia segthet a
nyelvi gtak ledntsben.

v tmogatst ignyel. Eurpnak fenn kell tartania


ttr szerept a tbbnyelv kzssgek ignyeinek
megfelel technolgik ellltsban j mdszerek

A nyelvtechnolgia segthet a nyelvi gtak ledn-

kifejlesztsvel, melyek Eurpa-szerte erstik a fejl-

tsben, tmogatva a szabad s nyilvnos nyelvhasznla-

dst. Ezek kz tartoznak a szmtgpes jtsok s

tot.

pldul a tvmunka lehetsge is.

Emellett az innovatv, tbbnyelv nyelvtech-

nolgia segt a nemzetkzi partnerekkel s a tbb-

tekinthetnk, amely segt a nyelvi diverzitsbl add

2.6 EMBERI S GPI


NYELVELSAJTTS

htrny legyzsben s a nyelvi kzssgek egymshoz

Ahhoz, hogy bemutassuk, hogyan birkznak meg a

kzelebb hozsban.

szmtgpek a nyelvvel, s mirt olyan nehz a nyelvel-

nyelv szervezetekkel val kommunikciban is. A


nyelvtechnolgira egyfajta tmogat technolgiaknt

sajtts, elszr egy kis kitekintst adunk arra, hogyan

reket tbb milli mondaton tantjk. Ez az egyik oka

sajttja el az ember az anyanyelvt, valamint idegen

annak, amirt a keres programok szolgltati lehetsg

nyelveket, majd felvzoljuk, hogy a nyelvtechnolgiai

szerint minl tbb rott anyagot akarnak sszegyjteni.

rendszerek hogyan mkdnek.

A szvegszerkesztkben tallhat helyesrs-ellenrzk,

Kt klnbz mdon sajtthatunk el egy nyelvet. A

a webes keresk s gpi fordt szolgltatsok, mint

gyermek elszr a krnyezetben foly beszdet hall-

a Google keresje s fordtja, egyarnt statisztikai

gatva tanul beszlni. A nyelvhasznlk, vagyis a szlk,

megkzeltsen alapulnak. A statisztikai megkzelts

testvrek s ms csaldtagok ltal hasznlt konkrt

nagy elnye, hogy a gpek gyorsan tanulnak, habr en-

nyelvi pldk segtik a gyerekeket abban, hogy ktves

nek a minsge meglehetsen vltoz.

koruk krl kiejtsk els szavaikat s rvid mondataikat.


Ez egy specilis, genetikailag adott nyelvi kpessgnek
ksznhet, amely lehetv teszi, hogy elsajttsunk egy

Kt klnbz mdon sajtthatunk el egy


nyelvet: pldkon vagy szablyokon keresztl.

nyelvet.
A msodik nyelv elsajttsa mr ennl sokkal nagyobb
erfesztsbe kerl, amennyiben ez nem anyanyelvi
kzegben zajlik. Iskols korban az idegen nyelv elsajttsa a nyelv nyelvtani szerkezetnek, szkincsnek
s helyesrsnak knyvekbl s oktat anyagokbl val
megtanulsval zajlik, amelyek a nyelvet szablyokon,
tblzatokon s pldaszvegeken keresztl mutatjk be.
Egy idegen nyelv megtanulsa sok erfesztst s idt
ignyel, s mindez az vek mlsval egyre nehezebb
vlik.

A nyelvtechnolgia msik nagy tpust a szablyalap


rendszerek alkotjk.

Ebben az esetben nyelvszek,

szmtgpes nyelvszek s szmtstechnikusok dolgozzk ki a nyelvtani szablyokat, s ptik meg a


lexikont. Egy szablyalap rendszer megalkotsa roppant id- s munkaignyes feladat, amely magasan kvaliklt szakembereket ignyel. A vezet szablyalap
gpi fordt rendszerek nmelyike tbb mint hsz ve
fejleszts alatt ll. A szablyalap rendszerek elnyei
kz tartozik viszont, hogy a szakrtk jobban tudjk

A nyelvtechnolgiai rendszereknek is kt f tpust

irnytani a nyelvfeldolgozs folyamatt, vagyis kny-

klntjk el, hasonlan az emberi nyelvelsajttshoz.

nyebben tudjk javtani a szisztematikus hibkat, illetve

A statisztikai (vagy adatvezrelt) megkzeltst kvet

vissza tudnak jelezni a felhasznlnak. Ez utbbi ab-

rendszerek a nyelvtudst nagy mennyisg szvegbl

ban az esetben lehet klnsen hasznos, ha a szablyala-

nyerik. Mg az olyan alkalmazsok tantshoz, mint

p rendszert nyelvtanulsra hasznljk. Pnzgyi szem-

pldul a helyesrs-ellenrzk, elegend egynyelv sz-

pontbl viszont a szablyalap technolgia csak a nagy

vegeket gyjteni, egy gpi fordt rendszer tantshoz

nyelvekre kizetd.

kt- vagy tbbnyelv prhuzamos szvegekre van szk-

Mivel a statisztikai s a szablyalap rendszerek elnyei

sg. Ezutn a gpi tanul algoritmusok olyan mintkat

s htrnyai kiegsztik egymst, a jelenlegi kutatsok

tanulnak meg a szvegbl, amelyek azt mutatjk meg,

inkbb a hibrid megkzeltsre fkuszlnak, amely kom-

hogy a szavakat, rvid kifejezseket s mondatokat

binlja a kt megkzeltst. Ezek a mdszerek azon-

hogyan fordtjuk le.

ban az ipari krnyezetben kevsb sikeresek, mint a ku-

A statisztikai mdszerek hatalmas mennyisg szveget

tatlaboratriumban.

ignyelnek; teljestmnyk az elemzett szveg meny-

Ahogy ebben a fejezetben lthattuk, sok olyan alkal-

nyisgvel nvekszik. Nem ritka, hogy az ilyen rendsze-

mazst hasznlunk a mai informcis trsadalomban,

amely ersen pt a nyelvtechnolgira. Tbbnyelv

szerek minsgnek javtsban. A kvetkezkben be-

kzssgnek ksznheten ez klnsen igaz az eur-

mutatjuk a magyar nyelv szerept az eurpai inform-

pai gazdasgi s informcis trsgre. Br a nyelvtech-

cis trsadalomban, s felmrjk a magyar nyelvtech-

nolgia erteljesen fejldtt az elmlt pr vben, mg

nolgia jelenlegi helyzett.

mindig nagy potencil rejlik a nyelvtechnolgiai rend-

3
A MAGYAR NYELV AZ EURPAI
INFORMCIS TRSADALOMBAN
3.1 LTALNOS TNYEK

van; klnsen a mvelt nyelvhasznlat s a helyesrs

A magyar nyelv a legtbb beszlvel rendelkez nem

nak. Mg a magyarorszgi magyar dnten nmet hats

indoeurpai nyelv Eurpban.

A Magyar Kztr-

alatt fejldtt, addig a romniai magyar inkbb romn

sasg llamnyelve, ahol a 10 millis lakossgnak kb.

hats alatt l. A csng kzssg viszonylag szeparltan

97%-a magyar anyanyelv. A szomszdos ht orszg-

lt a tbbi magyartl, ezrt k egy, a kzpkori magyar-

ban is tallunk magyar nyelv kzssgeket, amelyek

hoz kzelebb ll nyelvvltozatot riztek meg.

egysges. Apr, de jellemz klnbsgek persze add-

kzl a legnagyobb a romniai, megkzeltleg msfl milli nyelvhasznlval. Becslsek szerint a ma-

listjn Eurpban [6]. A magyar nyelv hivatalos nyelv

3.2 A MAGYAR NYELV


KLNLEGESSGEI

mg a Vajdasgban, tovbb hrom szlovniai kzsg-

A legtbb eurpai nyelv az indoeurpai nyelvcsaldba

ben. Regionlis vagy kisebbsgi nyelvknt beszlik mg

tartozik, s gy egymsnak rokona az orosz, a spanyol, a

Ausztriban, Horvtorszgban, Ukrajnban, Szlovki-

grg, a norvg, az angol, az albn de a magyarnak

ban s a mr emltett Romniban. Ezen fell emig-

nem! A magyar az Url hegysgbl szrmazik, Eurpa

rns kzssgek hasznljk vilgszerte, elssorban az

s zsia hatrvidkrl. Az urli nyelvcsaldnak kt

Amerikai Egyeslt llamokban, Kanadban s Izrael-

ga van: szamojd s nnugor. A magyar az utbbiba

ben.

tartozik, ezrt szoktuk nnugor nyelvnek is nevezni.

rdekes, hogy a magyarnak alig vannak rdemleges

Rokonai a nn, az szt s nhny ms, Oroszorszgban

vltozatai: a nyelvjrsok egymstl s a kznyelvtl

l npek ltal beszlt nyelv.

gyar nyelvet sszesen 13 millian beszlik, ezzel a


12. helyen ll a legtbb beszlvel rendelkez nyelvek

kevss trnek el, megrtsi nehzsgeket alig okoznak.


Ez taln a hossz szomszdsgi lt miatt van, mely
ms nyelvekkel folyamatosan tkzve egysgessgre
indthatta a beszlket. A hagyomnyos feloszts szerint

A magyar nyelv a legtbb beszlvel rendelkez


nem indoeurpai nyelv Eurpban.

a magyar nyelvnek ht dialektust klnbztetik meg


Magyarorszg mai terletn. Ezen fell kt magyar dia-

Az urli nyelvek nhny kzs, si jellemzje:

lektus ltezik Romniban: a szkely s a csng.


A Magyar Kztrsasgban s a szomszd orszgokban hasznlt magyar kztt ugyancsak kevs klnbsg

Nincsenek nemek: ugyanaz a sz () fejezi ki a he


s a she fogalmt.

10

Csak kt igeid van: jelen s mlt; ezek rnyalatait,


valamint a jv idt krlrssal lehet kifejezni.
Az irnyhrmassg: a helyet kifejez ragokbl 3x3

birtokoshoz, Plhoz teszi a ragot, hanem a birtokhoz,


a rdihoz: Pl rdi-ja, ami olyan, mintha azt mondanm: Paul radio-his.

van, mint az 1. tblzat mutatja a doboz sz pldjn

Inkbb kultrtrtneti, mint nyelvszeti rdekessg,

(a nvel vltozatlan, s nincs egyeztetve a fnvvel).

hogy a magyarban a csaldnv ll ell, az utnv


(given name, Christian name) htul, teht Liszt Fe-

A magyart latin betkkel rjk, de a magyar szveg

renc (=Franz Liszt), Bem Jzsef (=Jzef Bem), Bartk

mgsem hasonlt egyik eurpai nyelvre sem. me egy

Bla, Mrai Sndor a megszokott sorrend.

klasszikus vers kt sora, egyszer fordtsban (Klcsey

A magyar n. szintetikus nyelv:

Ferenc 1823-as Hymnus cm versbl, amely ma a ma-

meket tbbnyire egyetlen szban, toldalkokkal fejezi

gyarok nemzeti himnusza):

ki, szemben az analitikus nyelvekkel, melyek inkbb

a nyelvtani ele-

kln szavakat elljrkat, nvmsokat, segdigket


Isten, ldd meg a magyart

hasznlnak. Pldul az angol can megfelelje a -hat/-het

Jkedvvel, bsggel.

rag.

God, bless the Hungarian


With merriment and plenty.
Egyetlen szt sem lehet felismerni az tlagos eurpai nyelvkincs alapjn; a magyarok nemcsak Godot hvjk Istennek, de sajt magukat sem hvjk
Hungarian-nek, hanem magyarnak. De tbbrl van
sz, mint a szavak klnbzsrl:

Le-val

a kocsi-bl

utaz-hat

jr-ogat

with Leo

from the car

can travel

usually goes

A vgzdseket szigor sorrend szerint kell a szhoz


ragasztani, gyakran tbbet is egyms utn, s gy a
szavak j hosszra nhetnek. A szintetikus szptsnek ezt a mdjt agglutincinak (azaz szragasztsnak) nevezzk. Pldul: bolondozhattunk we could

Isten

ldd

meg

magyart

fool [around] (=fool-verb-can-past-we); sztnzhet-

God

bless

the

Hungarian

tnk we could stimulate (=stimulus-verb-can-past-

A krdjellel jelzett sz nem ltezik a legtbb nyelvben:


a neve igekt. Szerepe igen sokfle: itt a befejezettsget
fejezi ki. A magyar nyelv egyik szpsge (s nehzsge)
ppen az igektk hasznlatban van. De nzzk a msodik sort:
jkedvwith

we). A kt sz felptse azonos a ltszlagos klnbsget a magnhangzk okozzk, az n. magnhangzharmnia (ms nven illeszkeds) miatt. A magnhangzk kt osztlyba soroldnak: mlyek (deep): a
o u s magasak (high): e i . A vgzdsekben a
magnhangz az alapsznak megfelelen jelenik meg: a

-vel

merriment

bsgwith

-gel

plenty

Ahol az angolban with elljrsz ll, ott a magyarban vgzdsek vannak.

bolond mly, gy a tbbi magnhangz is mly: o - o + o


- a - u, mg az sztn magas, ezrt a tbbi magnhangz
is magas: - + - e - [6].

A magyarban nincsenek

elljrszk, pldnkban a -vel, -gel ragok fejezik ki azt,


amit az angol with.

3.3 MODERNKORI FEJLDS

Mg egy fontos magyar sajtsgot emltnk: a birtokvi-

A magyar nyelv bizonyos szempontbl mindig kisebb-

szonyt fordtva fejezik ki, mint az indoeurpai nyelvek.

sgi nyelv volt, s ms nyelvekbl folyamatosan vett t

Pldul a Pauls radio megfeleljben a magyar nem a

szavakat. Br a magyar a trsg legnpesebb nyelve volt,

11

Hova?
Where to?

Hol?
Where?

Honnan?
Where from?

bell
inside

a dobozba
into the box

a dobozban
inside the box

a dobozbl
out of the box

rajta
on

a dobozra
onto the box

a dobozon
on the box

a dobozrl
o the box

kzelben
near

a dobozhoz
to the box

a doboznl
at the box

a doboztl
from near the box

1: Az irnyhrmassg a doboz sz pldjn bemutatva


sosem kerlt abszolt tbbsgbe: sszessgben mindig

az Oktatsi Minisztrium alaptott. A msik a Magyar

tbb msnyelv lt a Krpt-medencben: szlv (el-

Tudomnyos Akadmia Nyelvtudomnyi Intzete.

ssorban szlovk, szerb, horvt), ksbb pedig nmet,

A Balassi Intzet a magyarorszgi nyelvmvels egyik

romn, zsid s cigny npessg. Hivatalos nyelvknt

kzpontja, amely a hatron tli magyar kultra magyar-

a latin volt hasznlatos egszen a 19. szzad elejig, ez

orszgi s az egyetemes magyar kultra klfldi bemu-

volt a kzigazgats s a tudomny nyelve. A magyar

tatsrt felel, hasonlan, mint a nmet Goethe Ins-

orszggyls csak 1844-tl vezette be a magyarul val

titut vagy az angol British Council. Az egysges s

lsezst, addig latinul folyt a vita.

egyetemes magyar kultrt terjeszti s npszersti a

A magyar nyelv mindig inkbb importr volt, mint

nagyvilgban gy, hogy ezzel prhuzamosan segti a

exportr. A mai magyar szkincs szmos szlv, latin,

klfldn vagy hatron kvl ltez magyar hagyom-

romn s olasz eredet szt tartalmaz. A legersebb a

nyok s kultra megismertetst Magyarorszgon. A

nmet hats volt, hiszen Magyarorszg 400 vig volt a

Balassi Intzet kzponti szerepet tlt be a magyar nyelv

Habsburg Birodalom rsze. Rengeteg nmet eredet sz

tanulsa, tantsa, a kpzs mdszertani kzpontjnak

van, ilyen pldul a tnc s a hering. A ms nyelvekbl

kialaktsa vonatkozsban is [7].

val sztvtel napjainkban is folytatdik: francia itz,


bagett; olasz maz, paparazzi; angol tnesz, szerver
stb. A mostanban tvett szavak nagy rsze anglicizmus,
ksznheten az amerikai lmipar, zene s technolgia

Magyarorszgon kt intzmny van,


amely aktv szerepet jtszik a magyar nyelv
polsban s terjesztsben.

ers hatsnak.
A magyar nyelv kutatsnak vezet magyarorszgi

3.4 NYELVMVELS
MAGYARORSZGON

kzpontja a Magyar Tudomnyos Akadmia Nyelvtu-

Magyarorszgon kt intzmny van, amely aktv

alatt, majd 1951-ben kerlt az MTA felgyelete al.

szerepet jtszik a magyar nyelv polsban s ter-

Alapfeladata a magyar nyelvszet, az ltalnos s al-

jesztsben. Az egyik a Balassi Blint Intzet, amelyet

kalmazott nyelvszet, az urli nyelvszet s a fonetika

domnyi Intzete. A Nyelvtudomnyi Intzet 1949ben jtt ltre, a Kzoktatsi Minisztrium felgyelete

12

terletn tudomnyos kutatsokat vgezni, a magyar

tzmnyek nyelve is a magyar lett. Ma mr a Krpt-

irodalmi s kznyelv nagysztrt elkszteni, archv

medence szmos felsoktatsi intzmnyben lehet

anyagt gondozni, valamint a magyar nyelv klnbz

magyar nyelv diplomt szerezni, Nyitrtl (Nitra,

vltozatait s az orszgon bell s kvl beszlt kisebb-

Szlovkia) a magyarorszgi egyetemeken, fiskolkon t

sgi nyelveket vizsglni, belertve az eurpai integ-

jvidkig (Novi Sad, Szerbia) vagy Kolozsvrig (Cluj-

rcin belli nyelvpolitikai krdseket is. Kiegszt

Napoca, Romnia).

feladatknt nyelvi korpuszok s adatbzisok ltre-

A 19. szzad ta a magyar nyelv s irodalom

hozsval, szmtgpes alkalmazsok nyelvszeti alap-

meghatroz szerepet tlt be az oktatsban. A magyar

jainak megalkotsval, valamint kznsgszolglati

tantrgy 6-tl 18 ves korig ktelez az iskolkban. Az

tevkenysggel, szakrti vlemnyek ksztsvel is

ltalnos iskola als vfolyamaiban, 6 s 10 ves kor

foglalkozik. Mindemellett a felsoktatsban is rszt

kztt a tananyag rs, olvass s fogalmazs terletekre

vesz, az itt mkd MTA-ELTE Elmleti Nyelvszet

oszlik. 10 ves kor utn a magyar nyelvtant s irodalmat

Szakcsoport rvn [8].

kln tantjk.

A magyar helyesrsi krdsek akadmiai szablyozs al

A 2009-es PISA felmrs szerint, amely a tanulk

tartoznak: a magyar helyesrst a Nyelvtudomnyi In-

szvegrtsi kpessgeit mrte, a magyar tanulk t-

tzet Helyesrsi Bizottsga szablyozza helyesrsi sza-

lageredmnye emelkedett 2000-hez kpest, ezzel elrte

blyzatok kiadsval. A szablyok alkalmazsa nem

az OECD-tlagot. gy Magyarorszg olyan orszgokkal

ktelez, de Magyarorszgon a helyesrsnak presztzs-

kerlt egy csoportba, mint Franciaorszg, Nmetorszg

rtke van.

vagy az Egyeslt Kirlysg [9].

Manapsg sok lelkes hagyomnyrz rvel amellett,


hogy az elssorban az angolbl szrmaz neologizmusok nem erstik, hanem inkbb gyengtik a magyar nyelvet. Nyelvvd tevkenysgknek ksznheten 2002-ben bevezettk az n. nyelvtrvnyt, amely

3.6 NEMZETKZI
VONATKOZSOK

ktelezv teszi az sszes angol nyelv hirdets s szlo-

Magyarorszg szmos vilghres zikust (Teller Ede,

gen magyarra cserlst. Emellett egyb nyelvmvel s

Wigner Jen s Szilrd Le, a Manhattan terv

-vd lpsek is trtntek: pldul 2011 elejn lpett

rsztvevi), matematikust (Rnyi Alfrd, Erds Pl, az

letbe az j mdiatrvny, amely megszabja a televzi-

Erds-szm nvadja) s zenszt (Liszt Ferenc, Bartk

ban s a rdiban sugrzott magyar s klfldi zenk

Bla) adott a vilgnak.

arnyt.

Nobel-djat nyertek a zika, a kmia s az orvostu-

A magyar tudsok szmos

domny tern.
Ahogy mindenhol mshol a tudomnyos vilgban, a

3.5 A MAGYAR NYELV AZ


OKTATSBAN

magyar kutatk is szembeslnek az lland publikcis

A magyar nyelv 1844-ben lett a kzigazgats, a tu-

A helyzet hasonl az zleti vilgban is: a nagy multi-

domny s az oktats hivatalos nyelve azta lehet ma-

nacionlis vllalatoknl az angol lett a lingua anca a

gyarul tanulni az ltalnos iskolkban is. Az 1868-as

szbeli s az rott kommunikciban is. m egy 2005-

oktatsi reform utn pedig a felsbb szint oktatsi in-

s felmrs szerint Magyarorszgon a valamilyen idegen

nyomssal. Mivel a vezet nemzetkzi folyiratok jelents rsze angol nyelv, tovbb n az angol nyelv szerepe.

13

nyelvet beszl emberek szma mg mindig az eurpai

forrst nyjt a nyelvhasznlat statisztikai elemzshez.

tlag alatt van: a magyar embereknek csak 35%-a beszl

Msrszt az internet adja a nyelvtechnolgiai alkalma-

legalbb egy idegen nyelvet [10].

zsok elsdleges felhasznlsi helyt.

A nyelvtechnolgia erre a kihvsra ms nzpontbl

A leggyakrabban hasznlt alkalmazs a webes keress,

tud megoldst nyjtani: olyan szolgltatsokkal, mint a

ami felttelezi a nyelv tbbszint automatikus feldol-

gpi fordts vagy a nyelvkzi informci-visszakeress,

gozst, ahogy majd rszleteiben ltni fogjuk fehr

ezzel cskkentve a nem angol anyanyelvek szemlyes s

knyvnk msodik felben.

gazdasgi htrnyait.

den nyelvre klnbz, szosztiklt nyelvtechnolgit

A webes keress min-

ignyel. Pldul a magyarra nzve ez magban foglalja

3.7 A MAGYAR NYELV AZ


INTERNETEN

azt is, hogy a fnevek, mellknevek s igk klnbz

2009-ben a magyarorszgi lakossg 61,6%-a volt inter-

esetben.

nethasznl [11]. A atal generci krben, 14-17 ves

Magyarorszgon nincs hivatalos trvny, amely a fogya-

korban, ez az arny magasabb. Az internetpenetrci

tkkal lk eslyegyenlsgt biztostan, de a Fogya-

az eurpai tlag alatt van, de folyamatosan emelkedik.

tkos Szemlyek Eslyegyenlsgrt Kzalaptvny ki-

2011 janurjban a .hu kzdomainek alatt deleglt do-

dolgozott egy ajnlst a komplex akadlymentestsre.

mainek szma kzel 600.000 volt [12], s hatrozottan

Ez magban foglalja azt is, hogy a kzintzmnyeknek

nvekszik. Krlbell 70.000 regisztrlt domain ltezik

a fogyatkos szemlyek szmra is elrhetv s hasznl-

az orszgban a .hu rendszeren kvl (nagy rszk .com)

hatv kell tennik a weboldalukat s internetes szolgl-

[13].

tatsaikat. A felhasznlbart nyelvtechnolgiai eszk-

vgzdsekkel elltott alakjait, illetve az eltr tvltozatokat is meg kell tallnunk, mint pldul a l-loak

zk kulcsszerepet jtszhatnak ezeknek a kvetelmnyek-

A magyar Wikipdia a 19. legnagyobb,


megelzve ms, tbb beszlvel rendelkez
eurpai s vilgnyelveket.

nek a teljestsben: pldul a beszdszintzis a vakok


szmra is elrhetv teszi a weboldalak tartalmt.
Az internethasznlk s szolgltatk azrt ennl kevsb
transzparens mdon is protlhatnak a nyelvtech-

Egy 2010-es eurpai felmrs szerint a kzssgi oldalak

nolgibl, pldul abban az esetben, amikor webes tar-

hasznlata az eurpai tlag fltt van, ami taln annak

talmakat fordtanak egyik nyelvrl egy msikra. Te-

ksznhet, hogy Magyarorszgon a Facebook megje-

kintve az emberi fordts magas kltsgeit, ebben az

lense eltt mr ltezett egy npszer kzssgi oldal,

esetben mg az olyan nyelvtechnolgiai eszkzk fej-

az iWiW. Meglehetsen aktv magyar nyelv webes

lesztse is megri, amelyek az elvrtnl kevsb tel-

kzssg ltezsrl tanskodik az is, hogy a magyar

jestenek jl. Ez utbbi helyzet elllhat amiatt is, mert

Wikipdia a 19. legnagyobb, megelzve olyan tbb

a magyar nyelv meglehetsen komplex, tovbb mert

beszlvel rendelkez eurpai nyelveket, mint a trk, a

egy tipikus nyelvtechnolgiai alkalmazs kifejleszts-

romn vagy a dn, s olyan vilgnyelveket, mint az arab

ben nagyszm ms technolgia is rintve van.

vagy a koreai.

A kvetkez fejezetekben bevezetst adunk a nyelvtech-

A magyar nyelvtechnolgia szmra az internet

nolgiba s annak fbb alkalmazsi terleteibe,

nvekv jelentsge kt szempontbl is fontos. Egyrszt

valamint rtkeljk a magyarorszgi nyelvtechnolgia

a digitlisan elrhet nyelvi adatok mennyisge gazdag

jelenlegi llapott.

14

4
NYELVTECHNOLGIA MAGYARUL
A nyelvtechnolgiai rendszerek olyan szoverek, ame-

helyesrs-ellenrzs,

lyek kifejezetten a termszetes emberi nyelv feldolgo-

szerzi tmogatsi rendszerek,

zsra lettek specializlva. Ezrt ezeket a technolgikat


sszefoglal nvvel termszetesnyelv-feldolgozsnak is
szoktk nevezni. Az emberi nyelv elfordul beszlt

gp ltal tmogatott nyelvtanuls,


informci-visszakeress,

s rott vltozatban is. Mg a beszd a legsibb s

informcikinyers,

legtermszetesebb mdja az emberi kommunikcinak,

szvegtmrts,

a komplex informci, gy az emberi tuds nagy rsze l-

krdsmegvlaszol rendszerek,

talban rott formban ltezik. A beszd- s a nyelvtech-

beszdfelismers s

nolgia az emberi kommunikcinak ezt a kt klnbz formjt dolgozza fel, illetve lltja el, s mindket-

beszdszintzis.

thz hasznl sztrakat, nyelvtani szablyokat s sze-

A nyelvtechnolgia kiterjedt szakirodalommal ren-

mantikt. Vagyis a nyelvtechnolgia a tudsreprezen-

delkezik, melyek kzl az rdekld olvast a kvetkez

tci klnfle formit hasznlja, amelyek fggetlenek

olvasnivalkhoz irnytjuk: [14, 15, 16, 17].

lehetnek a nyelvet kzvett mdiumtl (beszd vagy

Mieltt a fenti alkalmazsi terleteket trgyalnnk, r-

szveg). A 2. bra a termszetesnyelv-feldolgozs egszt

viden bemutatjuk egy tipikus nyelvtechnolgiai rend-

illusztrlja.

szer felptst.

Kommunikcinkban vegytjk a nyelvet s a kommunikci ms mdjait s csatornit. A beszdet gesztusokkal s arckifejezsekkel ksrjk. A digitlis szvegek
kpekkel s hangz anyagokkal egytt jelennek meg.
A lmek a nyelvet beszlt s rott formban is megjelentik. Vagyis a beszd- s nyelvtechnolgia tfed
s egyttmkdik ms technolgikkal, amelyek gy
egytt erstik a multimodlis kommunikci s a multimdis tartalmak feldolgozst.

4.1 A NYELVTECHNOLGIAI
ALKALMAZSOK FELPTSE
A tipikus nyelvtechnolgiai alkalmazsok tbb komponensbl llnak ssze, amelyek a nyelv egyes szintjeit
tkrzik. A 3. bra egy szvegfeldolgoz rendszer egyszerstett felptst mutatja. Az els hrom modul a
bemen szveg szerkezett s jelentst dolgozza fel:

A kvetkezkben a nyelvtechnolgia f alkalmazsi


terleteit fogjuk trgyalni, melyek a kvetkezk: nyelvi

1. Elfeldolgozs: adattisztts, a formzs eltvoltsa,

ellenrzs, webes keress, beszdtechnolgia s gpi

a bemen szveg nyelvnek megllaptsa, a specilis

fordts. Ezek olyan alkalmazsokat s technolgikat

karakterek kezelse (pl. a magyar kezetes betk ese-

foglalnak magukban, mint pldul

tben) stb.

15

Beszdtechnolgia
Multimdis &
multimodlis
technolgik

Termszetesnyelvfeldolgozs

Tudsreprezentci

Nyelvtechnolgia

2: Termszetesnyelv-feldolgozs

2. Nyelvtani elemzs: az ige s argumentumainak megkeresse, a mondat szerkezetnek feltrsa.

3. Szemantikai elemzs: egyrtelmsts (adott sznak


az adott kontextusban mi a jelentse), az anafork
feloldsa (a nvmsok kire/mire vonatkoznak), a

4.2 A F ALKALMAZSI
TERLETEK
Ebben a fejezetben a legfontosabb nyelvtechnolgiai
eszkzkre s erforrsokra fkuszlunk, s ttekintst
adunk a magyarorszgi nyelvtechnolgiai tevkenysg-

mondat jelentsnek reprezentlsa valamilyen gp

rl.

ltal olvashat formban.

4.2.1 Nyelvi ellenrzs


Mindenki, aki hasznlt mr a Microso Wordhz hasonl szvegszerkesztt, tallkozott helyesrs-ellenrz
programmal, amely jelzi a helyesrsi hibkat, s javtsi

Ezutn kvetkeznek a klnfle feladatspecikus mo-

javaslatokat tesz. Az els helyesrs-ellenrz prog-

dulok, mint pldul a bemen szveg automatikus

ramok szimpln sszehasonltottk az ellenrizend

tmrtse, az adatbzisokban val keress s ehhez ha-

szavakat a helyesen rt szavak listjval. A mai eszk-

sonlk. Mindez az alkalmazsok felptsnek egysze-

zk ennl sokkal kinomultabbak. A szvegelemzshez

rstett s idealizlt lersa, amely a nyelvtechnolgiai al-

nyelvfgg algoritmusokat hasznlnak, amelyek a mor-

kalmazsok komplexitst illusztrlja.

folgit (pl. a tbbes szm alakokat) is tudjk kezelni,

A legfontosabb alkalmazsi terletek bemutatsa utn


rvid kitekintsben beszmolunk a nyelvtechnolgiai
kutatsi s oktatsi helyzetrl, klns tekintettel a mr
lezrult s a foly kutatsi programokra. A fejezet vgn
szakrti rtkelst adunk a legfontosabb nyelvtech-

valamint a mondatszint hibkat is detektljk, pldul


ha hinyzik a ragozott ige a mondatbl, vagy ha az ige s
az alany nincsenek szmban-szemlyben egyeztetve (pl.:
n *rsz leelet). Azonban a legtbb nyelvi ellenrz
nem tallna hibt a kvetkez szvegben [18]:

nolgiai eszkzkrl s erforrsokrl olyan dimenzik

I have a spelling checker,

mentn, mint az elrhetsg, a fejlettsg s a minsg. A

It came with my PC.

29. oldalon tallhat 9. tblzat j ttekintst ad a ma-

It plane lee marks four my revue

gyar nyelvtechnolgia helyzetrl.

Miss steaks aye can knot sea.

16

Bemen szveg

Elfeldolgozs

Kimenet

Nyelvtani elemzs

Szemantikai
elemzs

Feladatspecifikus
modulok

3: Egy tipikus szvegfeldolgoz alkalmazs felptse

Az ilyen tpus hibk kezelshez az esetek nagy

tmogatsi rendszerekben is, olyan szoverkrnyezetek-

rszben a kontextus elemzst is el kell vgezni. A ma-

ben, amelyekben hasznlati utastsokat s egyb doku-

gyarban vannak olyan ragozott szavak, amelyek kln-

mentcikat rnak specilis sztenderdek alapjn az in-

bz jelentsekkel brhatnak: pldul a vrunk lehet a

formcitechnolgiai, az egszsggyi, a mszaki s

vr ige tbbes szm els szemly alakja, illetve a vr

egyb termkek terletn. A hibs vagy nehezen rt-

fnv birtokos szemlyraggal elltott alakja.

het hasznlati tmutatk miatt bekvetkez krokrl

A jelensg kezelshez nyelvspecikus nyelvtani sza-

szl vsrli panaszoktl tartva a vllalatok egyre na-

blyok ellltsra, vagyis magas szint szakrti

gyobb hangslyt fektetnek a technikai dokumentci

munkra, vagy pedig statisztikai alap nyelvmodel-

minsgre, nemzetkzi viszonylatokban is (fordts,

lekre van szksg, amelyek alapjn egy bizonyos sz

lokalizls). A termszetesnyelv-feldolgozs eredm-

adott krnyezetben val elfordulsnak valsznsgt

nyei a szerzi tmogatsi rendszerekben is fejldst hoz-

tudjuk kiszmolni. Pldul a vrunk valsznleg nem

tak: a technikai dokumentci szerzit sztrak, termi-

ige, ha a mondatban mr szerepel egy msik rago-

nolgiai adatbzisok s mondattani szablyok segtik,

zott ige. Statisztikai alap nyelvmodellek automatiku-

melyek kvetik az adott terlet elrsait.

san elllthatk nagy mret, ellenrztt adatot tartalmaz szveghalmazokbl, ms nven korpuszokbl.
Ez a megkzelts elssorban angol nyelv adatokra lett
kifejlesztve, de a magyarra is alkalmazhat. Azt azonban
gyelembe kell venni, hogy a mdszerek nem ltethetk
t egy az egyben a magyar nyelv agglutinl jellege s szabad szrendje miatt.

Tekintettel a magyar nyelv ersen agglutinl jellegre,


egy magyar nyelv helyesrs-ellenrznek tartalmaznia kell egy morfolgiai elemz komponenst, hogy
kezelni tudja a ragozott s sszetett szavakat is. Az
els magyar helyesrs-ellenrzt a MorphoLogic K.
[19] fejlesztette ki a nyolcvanas vekben, amely egy
helyesrs-ellenrz modul s egy morfolgiai modell
kombincijbl llt el. A Helyes-e? programcsomag

A nyelvi ellenrzk hasznlata nem csak a


szvegszerkesztkre korltozdik, alkalmazzk
mg az n. szerzi tmogatsi rendszerekben is.

a Microso Oce, a uarkXPress, az Adobe InDesign s ms szveg- s kiadvnyszerkesztvel is hasznlhat. A MorphoLogic nyelvhelyessg-ellenrz programokat is fejlesztett, amelyek felismernek olyan he-

A nyelvi ellenrzk hasznlata nem csak a szvegszer-

lyesrsi hibkat, amelyeket a szellenrz programok

kesztkre korltozdik, alkalmazzk mg az n. szerzi

nem tudnak megtallni, mert a szveget nem sszefg-

17

Statisztikai alap nyelvmodell

Bemen szveg

Helyesrs-ellenrzs

Nyelvtani ellenrzs

Javtsi javaslatok

4: Nyelvi ellenrzs (lent: szablyalap, fent: statisztikai)

gseiben, hanem szavanknt vizsgljk. A program nem

a kereskifejezst [22]. A Google sikersztorija azt mu-

felttlenl hibkat jelez, hanem csak gyelmeztet. A

tatja, hogy nagy mennyisg adattal s hatkony in-

jelzsek nagy rsze tnyleges hibra utal, msok csak fel-

dexelsi technolgival a statisztikai alap megkzelts

hvjk a gyelmet egy-egy lehetsges hibra. Az utbbi

kielgt eredmnyt tud hozni.

esetben a felhasznlnak kell eldntenie, hogy tnyleges

Azonban ha bonyolultabb informcihoz akarunk

hibrl van-e sz.

jutni, mlyebb nyelvi tudsra van szksgnk a szveg-

Nylt forrskd helyesrs-ellenrz is ltezik a ma-

rtelmezshez. Az olyan lexikai erforrsok, mint a

gyarra. A Hunspell [20] a MySpellen alapul, s in-

gp ltal olvashat tezauruszok s a WordNethez ha-

tegrlva lett az OpenOce-ba, a Mozilla Firefoxba s

sonl ontolgik, javtjk a keress hatkonysgt azl-

underbirdbe, valamint a Google Chrome-ba is.

tal, hogy a kereskifejezs szinonimit (pl. atomenergia,

A helyesrs-ellenrzs s a szerzi tmogats mellett

magenergia, nukleris energia) s a hozz kapcsold

a nyelvi ellenrzs a gp ltal tmogatott nyelvtanu-

szavakat is gyelembe veszik.

ls tern is fontos szerepet tlt be, tovbb a webes


kereskben is alkalmazzk a lekrdezsek automatikus
javtsra, pldul a Google keressi javaslatai esetben.

A keresmotorok j genercijnak sokkal kinomultabb nyelvtechnolgit kell alkalmaznia, klnsen


az olyan esetekben, amikor a keress krdst vagy ms
tpus mondatot tartalmaz, nem csak szavak listjt.

4.2.2 Webes keress

Pldul kpzeljnk el egy olyan lekrdezst, hogy Sorold

A weben, intraneten vagy digitlis knyvtrakban val

fel azokat a cgeket, amelyeket az elmlt t vben

keress valsznleg a legtbbet hasznlt s a legkevsb

vsroltak fel! A relevns vlasz megtallshoz szksg

fejlett nyelvtechnolgiai alkalmazs jelenleg. A Google

van a mondat szintaktikai s szemantikai szint elemz-

keres 1998-ban indult, s napjainkban a vilg sszes

sre, valamint a relevns dokumentumok gyors elrst

lekrdezsnek 80%-t vgzi [21].

Mr a magyar

lehetv tev indexelsre is. A kielgt vlaszadshoz

nyelvben is elterjedt a guglizni sz, br a nyomta-

a mondat teljes szintaktikai elemzst el kell vgezni,

tott sztrakba mg nem kerlt bele. Sem a Google

s r kell jnni, hogy a felhasznl azokra a cgekre

lekrdez fellete, sem a tallati lista prezentcija nem

kvncsi, amelyeket felvsroltak, s nem azokra, ame-

vltozott jelentsen az els verzi ta. A jelenlegi vl-

lyek felvsroltak cgeket. Ezen fell az idt jell

tozatban van viszont ellenrz program, amely az el-

kifejezst is fel kell dolgozni ahhoz, hogy kiderljn,

gpelseket javtja, tovbb nemrg alapszint szeman-

hogy mely vekrl van sz. Vgl a feldolgozott ke-

tikai keres alkalmazst ptettek be, amely nveli a

reskifejezst ssze kell vetni nagy mennyisg struk-

tallati pontossgot azzal, hogy kontextusban vizsglja

turlatlan adattal, hogy megtalljuk azt az informcit,

18

Weboldalak

Elfeldolgozs

Szemantikai elemzs

Indexels
Egyezs
&
Relevancia

Elfeldolgozs

A lekrdezs elemzse

Felhasznli
lekrdezs

Keressi eredmnyek

5: A webes keress architektrja

amelyet a felhasznl keres. Ezt, vagyis a keresst s

videfjlok esetben szksg van egy beszdfelismer

a relevns tallatok sorrendezst hvjk informci-

modulra is, amely a beszdet szvegg alaktja t, amely-

visszakeressnek. Tovbb ahhoz, hogy cgek listjt

ben gy mr lehet keresni.

kapjuk, ki kell nyernnk azt az informcit a dokumen-

Mivel a magyar nem olyan kttt szrend, mint

tumokbl, hogy szavak egy adott sorozata egy cgre utal.

pldul az angol, a magyar mondatelemzk fejlesz-

Ezt a fajta informcikinyerst vgzik az automatikus

tse sorn nem tudunk pusztn a mondat lineris

tulajdonnv-felismerk.

szerkezetre tmaszkodni.

Viszont az esetragok s

nvutk fogdzt jelentenek, mivel ezek hatrozzk

A keresmotorok j genercijnak sokkal


kinomultabb nyelvtechnolgit kell alkalmaznia.

meg a mondatrszek szerept. Az igk s a hozzjuk tartoz vonzatok alkotjk a mondat szerkezetnek alapjt,
ezrt fontosak az n. vonzatkerettrak.

Egy ilyen

Mg tbb nyelvtechnolgit ignyel egy kereskife-

adatbzist fejlesztettek az MTA Nyelvtudomnyi In-

jezs megtallsa ms nyelv dokumentumokban. A

tzetnek munkatrsai, amely magasabb szint elemz

nyelvkzi informci-visszakeresshez elszr le kell

alkalmazsokba, pldul szablyalap szintaktikai elem-

fordtani a kereskifejezst az sszes lehetsges forrs-

zbe is bepthet. Ez utbbibl tbb is ltezik a ma-

nyelvre, majd a tallatokat vissza kell fordtani a cl-

gyarra egyik a Szeged Treebankbe, egy msik pedig

nyelvre.

a MetaMorpho nev szablyalap gpi fordtba lett

A nem szveges formban lev adatok nvekv arnya

beptve.

hvta letre az ignyt a multimdis informci-visz-

A nyelvtechnolgival foglalkoz cgek s kutatmhe-

szakeres szolgltatsokra, vagyis a kpekben, hangz

lyek f kutatsi irnyai kztt szerepel olyan trend-

anyagokban, videkban val keressre. Az audi- s

s szvegelemz eszkzk fejlesztse, amelyek term-

19

szetesnyelv-feldolgoz alkalmazsokat integrlnak an-

A beszdtechnolgia az albbi ngy f technolgiai

nak rdekben, hogy a strukturlatlan szvegben meg-

terletet foglalja magban:

talljk a relevns informcikat.

Erre a clra ma-

gyar nyelv morfolgiai elemzk s egyrtelmstk,


valamint tulajdonnv-felismerk llnak rendelkezsre,
melyek nagyrszt statisztikai tanul algoritmusokon

1. Az automatikus beszdfelismers hatrozza meg,


hogy milyen szavakat mondott ki a felhasznl.
2. A szintaktikai elemzs s a szemantikai interpretci segtsgvel elemezhet a felhasznl kzlsnek

alapulnak.
Ltezik egy magyar nyelv ltalnos cl metakeres,
a PolyMeta [23], amely lehetsget nyjt tetszleges
szm, interneten keresztl elrhet adatbzis, forrs
egyidej lekrdezsre. A tallati eredmnyekbl kzs
lista kszl, amelyben az elemek fontossgi sorrend szerint vannak rendezve. A metakeres termszetesnyelvfeldolgozsi s informci-visszakeressi algoritmusokat hasznl a kereskifejezsek elemzshez s a talla-

szintaktikai szerkezete, valamint lekpezhet annak


szemantikai interpretcija az adott rendszer cljainak megfelelen.
3. A dialgusvezrls az input nyelvi jellemzi, az
adott felhasznl s feladat egyni belltsai alapjn
valstja meg a rendszer megfelel lpst, az
adatbzis-lekrdezst.
4. A beszdszintzis technolgijt alkalmazzk arra,
hogy a gp ellltsa a megfelel beszdkimenetet.

tok sorrendezshez.
De nemcsak kis- s kzpvllalatok fejlesztenek informcikinyer eszkzket Magyarorszgon.

Sz-

mos olyan projekt fut klnbz egyetemeken s


kutatintzetekben, melyek clja szemantikai alap
keresrendszerek fejlesztse, vagy magyar nyelv ontolgik (pl. Magyar WordNet, Magyar Egysges Ontolgia) ptse.

A beszdtechnolgia szolgltatja az alapot


olyan interfszek ellltshoz, amelyek
lehetv teszik, hogy a felhasznlk a gpekkel
termszetes emberi nyelven, s ne csak grakus
fellet, billentyzet vagy egr segtsgvel
kommunikljanak.
Az egyik legnagyobb kihvst az automatikus beszd-

4.2.3 Beszdtechnolgia

felismers jelenti, vagyis hogy a rendszer minl pontosabban felismerje a felhasznl ltal kiejtett szavakat.

A beszdtechnolgia szolgltatja az alapot olyan inter-

Ez ktflekppen trtnhet: vagy a felhasznl ltal

fszek ellltshoz, amelyek lehetv teszik, hogy a

hasznlhat kifejezseket cskkentjk kulcsszavak egy

felhasznlk a gpekkel termszetes emberi nyelven, s

limitlt nagysg halmazra, vagy nyelvmodelleket l-

ne csak grakus fellet, billentyzet vagy egr segt-

ltunk el, amelyek a termszetes nyelvi kifejezsek egy

sgvel kommunikljanak. Napjainkban ilyen beszd-

nagyobb hnyadt fedik le. A gpi tanulsi technikkat

interfszeket alkalmaznak bizonyos szolgltatsok rsz-

hasznlva nyelvmodelleket llthatunk el automatiku-

leges vagy teljes automatizlsra. Az zleti szfrban

san szvegkorpuszokbl, vagyis audifjlok s szveges

elssorban a bankok, a logisztikval, a szlltssal s a

tirataik nagymret gyjtemnybl. Mg a kulcssza-

telekommunikcival foglalkoz cgek hasznljk. A

vas mdszer merev s nehezen hasznlhat beszdin-

beszdtechnolgit alkalmazzk mg auts navigcis

terfszt, valamint kevsb elfogadhat kimenetet ered-

rendszerekben s az okostelefonokban a grakus fellet

mnyez, addig a nyelvmodellek ellltsa s nomhan-

alternatvjaknt.

golsa a kltsgeket emeli meg erteljesen. Azonban

20

Beszdkimenet

Beszdszintzis

Fonetikai keress &


intoncitervezs
Termszetes nyelvi
megrts s
dialgus

Beszdbemenet

Jelfeldolgozs

Felismers

6: Egyszer beszdalap dialgus felptse

a nyelvmodelleket alkalmaz beszdinterfsz nagyobb

A magyar nyelv specilis jellege miatt a vilgszerte szles

elfogadottsggal rendelkezik a felhasznlk krben, s

krben alkalmazott mdszerek vagy egyltaln nem,

elnysebb, mint a kevsb rugalmas rendszervezrelt

vagy csak nehezen adaptlhatk a magyarra. Viszont a

megkzelts.

kifejezetten a magyarra kifejlesztett mdszerek knnyen


alkalmazhatk lehetnek a hasonlan agglutinl nyel-

Ami a beszdinterfsz kimeneti oldalt illeti, a vllalatok egyre inkbb elre felvett kifejezseket hasznlnak.
A statikus kifejezsek esetben, amikor a beszd nem
fgg adott kontextustl vagy a felhasznl adataitl, ez a
mdszer kell mrtk felhasznli elgedettsget eredmnyez. Viszont minl dinamikusabb a lejtszani kvnt
tartalom, annl rosszabb lesz az elemekbl sszelltott
mondat prozdija az audifjlok sszevgsa miatt
mg akkor is, ha a mai beszdszintetizl rendszerek
egyre jobban teljestenek, ksznheten az egyre termszetesebb vl prozdinak.
A beszdtechnolgia piacn az elmlt vtizedekben
fontos szabvnyostsi lpsek trtntek a klnbz technolgiai komponensek kztti interfszek,
valamint az egyes alkalmazsokra pl termkek esetben is. Intenzv piaci konszolidci zajlott le az elmlt
tz vben, fknt a beszdfelismers s -szintzis tern.
A G20 orszgok nemzeti piacait kevesebb mint 5 cg
dominlja, mint a Nuance (USA) s a Loquendo (Olaszorszg), csak hogy a legprominensebbeket emltsk az

vekre, mint a nnre, a trkre vagy az arabra.


A magyar beszdszintzis piact a Budapesti Mszaki
s Gazdasgtudomnyi Egyetemen (BME) dolgoz kutatcsoportok [24] dominljk. A legszlesebb krben
hasznlt magyar beszdszintetiztor a Provox, amely
2002 ta elrhet, s amelyet tbb alkalmazsba is
beptettek: SMS- s e-mailfelolvas szoverbe, auts
s mobiltelefonos GPS rendszerbe, valamint e-books kpernyolvas szolgltatsba, amelyek segthetik a
ltssrlt emberek integrcijt az informcis trsadalomba. Egy magas szinten vezrelhet interaktv fejleszti krnyezet is rendelkezsre ll specilis kutatsi clok tmogatsra (pl. pszicholgiai s prozdiai
vizsglatokra). A szover szvegfelolvas (TTS) technolgin alapul. Segtsgvel meghatrozott akusztikai
s prozdiai tartalommal br ksrleti jel hozhat
ltre kontrolllhat krlmnyek kztt.

Egy 1,5

milli szalakot tartalmaz magyar kiejtsi sztr is


elrhet. Ennek alapjn kialakthat egy magyar szvegeket (fonetikai) szimblumokk alakt eszkz.

eurpai piacrl. 2011-ben a Nuance bejelentette, hogy

Beszdfelismerssel Magyarorszgon a fentebb emltett

felvsrolja a Loquendt, ami egy jabb lps a piac

s ms egyetemi kutatmhelyek (pl. a Szegedi Tudo-

konszolidcija fel.

mnyegyetem Informatikai Tanszkcsoportja) mellett

21

kisebb vllalkozsok is foglalkoznak, mint az Alkalma-

nem vltotta be a kezdeti nagy remnyeket. A legalac-

zott Logikai Laboratrium, az Aitia vagy a Digital Na-

sonyabb szinten a gpi fordts egyszer behelyettests:

tives K. A mr emltett nyelvi nehzsgek ellenre

az egyik termszetes nyelv szt lecserljk egy msik

tbb magyar nyelv gpi beszdfelismer alkalmazst

nyelv szra. Ez az eljrs csak nagyon szk szkincs,

is kifejlesztettek az elmlt vekben. A BME Tvkz-

formalizlt nyelv szvegek (pl. idjrs-jelentsek) ese-

lsi s Mdiainformatikai Tanszkn kifejlesztettek egy

tben mkdik.

statisztikai alap folyamatosbeszd-felismer motort s


fejleszti krnyezetet, tovbb egy kttt hangslyozson alapul szhatr-detektl alkalmazst magyar s
nn nyelvekre, melyet egy nyelvkzi vizsglat elztt
meg. A mr emltett kutatmhelyek kzs munkj-

A legalacsonyabb szinten a gpi fordts


egyszer behelyettests: az egyik termszetes
nyelv szt lecserljk egy msik nyelv szra.

nak eredmnyekppen klnfle orvosi leletez beszdfelismerk is kszltek, melyek az orvosi vizsglatokat

A kevsb kttt szvegek j fordtshoz nagyobb

kzvetlen beszd-szveg talaktssal segtik. Tovbb

szvegegysgeket (frzisokat, mondatokat vagy tel-

ltezik egy olyan alkalmazs, amely a beszdhibs gye-

jes bekezdseket) kell illeszteni a msik nyelv

rekek beszdtanulst segti, s a magyar mellett tbb

megfelelikhez. A legfbb problmt az okozza, hogy

ms eurpai nyelven is hasznlhat. A beszdfelismer

az emberi nyelv sokszor tbbrtelm, ami minden szin-

rendszerek szmos tovbbi gyakorlati alkalmazst segt-

ten kihvsok el lltja a nyelvfeldolgozkat. A lexikai

hetnek. Ilyen pldul a telefonos hvsok kezelse vagy

szinten a szjelents-egyrtelmsts (pl. a nyl lehet

a telefonkzpont-irnyts.

cselekvs s llat is), mg a mondat szintjn akr az es-

A jvben jelents vltozsokat fog hozni az gyfl-

etragos fnvi csoportok is okozhatnak nehzsgeket,

kapcsolatok kezelsben a hagyomnyos telefon, az in-

mint ezekben a mondatokban:

ternet s az e-mail mellett az okostelefonok terjedse,


ami hatssal lesz a beszdtechnolgira is. Hossztvon

A rendr ltta az embert a tvcsvel.

kevesebb telefonalap felhasznli interfsz lesz, s a

A rendr ltta az embert a revolverrel.

beszlt nyelv mint felhasznlbart bemenet sokkal


inkbb kzponti szerepet fog jtszani. Ez nagyrszt an-

A feladat egyik megkzeltsi mdja nyelvtani szablyo-

nak ksznhet, hogy a beszlfggetlen beszdfelis-

kon alapul. Kzeli rokon nyelvek esetben a kzvetlen

mers pontossgnak javtsban elrelpsek trtntek

fordts kivitelezhet lehet a fenti pldkra is.

azltal, hogy a diktl rendszerek mr most beptett

ltalban a szablyalap (vagy tudsvezrelt) rendsze-

szolgltatsok az okostelefon-hasznlk szmra.

rek gy mkdnek, hogy elszr elemzik a bemen

De

szveget, majd egy kzvett, szimbolikus reprezent-

4.2.4 Gpi fordts

cit alkotnak, s ez utbbibl generljk a clnyelvi

A digitlis szmtgpek alkalmazsnak tlete ter-

morfolgiai, szintaktikai s szemantikai informcit

mszetes nyelvek lefordtsra 1946-ban merlt fel

egyarnt tartalmaz lexikonok, valamint nyelvsz szak-

elszr. Az tletet az tvenes vekben anyagi tmo-

rtk ltal aprlkosan kidolgozott nyelvtani szablyok

gats is kvette, ami azonban csak a nyolcvanas vekben

meglttl egyarnt ersen fgg, amelyek ellltsa

folytatdott. Mindemellett a gpi fordts mg mindig

hossz s kltsges folyamat.

kimenetet. Ezeknek a rendszereknek a teljestmnye

22

Szvegelemzs (formzs,
morfolgia, szintaxis stb.)

Forrsszveg
Statisztikai
alap gpi
fordts

Fordtsi szablyok
Szveggenerls

Clszveg

7: Gpi fordts (balra: statisztikai, jobbra: szablyalap)

A nyolcvanas vek vgtl kezdve,

ahogy a

szmtgpek egyre olcsbbak lettek, s teljest-

hiszen nincs mindig egyrtelm megfelels a mondatrszek kztt.

mnyk ntt, egyre nagyobb rdeklds mutatkozott a


statisztikai modellek irnt a gpi fordts tern. Ezeknek
a statisztikai modelleknek a paramtereit prhuzamos

A gpi fordts a magyar


nyelvre klnsen nehz.

korpuszokbl lehet kiszmtani, mint amilyen az Europarl prhuzamos korpusz, amely az Eurpai Parlament jegyzknyveit tartalmazza 21 eurpai nyelven. Kell mennyisg adat birtokban a statisztikai
alap gpi fordts elg j becslst tud adni egy idegen nyelv szveg jelentsrl. Azonban a szablyalap
rendszerekkel ellenttben a statisztikai (ms nven
adatvezrelt) gpi fordtk gyakran nyelvtanilag helytelen kimenetet produklnak.

Msrszrl viszont az

adatvezrelt rendszereknek tbb elnyk is van: amellett, hogy kevesebb emberi munkt ignyelnek, a nyelv
olyan klnlegessgeit is tudjk kezelni (pldul az idiomatikus kifejezseket), amilyeneket a szablyalapak
nem.

A gpi fordts a magyar nyelvre klnsen nehz. A


szabad szrend s az elvl igektk problmt okoznak az elemzs sorn, a gazdag ragozsi rendszer pedig
kihvsokat jelent a megfelel esetraggal rendelkez
szalakok ellltsnl.
A nehzsgek ellenre a gpi fordts magyar piacn
is lteznek szablyalap s adatvezrelt megoldsok.
A MorphoLogic K. szmtgpes gpi fordt programcsomagokat s online fordt szolgltatst egyarnt
knl. A MorphoWord angol s magyar nyelv kztt
fordt, mindkt irnyba. Rendszerk szablyalap s
statisztikai mdszereket tvz, de a f komponens a
fordtand szveghez egy bels reprezentcit rendel,

Mivel a szablyalap s a statisztikai alap mdszerek

majd ezt alaktja clnyelv szvegg.

erssgei s gyengesgei kiegsztik egymst, a kutatk

A nagymret ktnyelv szvegek kulcsfontossgak a

manapsg mr inkbb a kt megkzeltst tvz hib-

statisztikai alap gpi fordtshoz. A Hunglish kor-

rid rendszereken dolgoznak. Ezt tbbflekppen lehet

pusz szabadon elrhet, mondatszinten prhuzamos-

megvalstani. Az egyik t, amikor mindkt mdszert

tott magyar-angol prhuzamos korpusz, amely 2,07 mil-

hasznljuk, s minden mondatra kivlasztjuk a legjobb

li mondatban 54,2 milli szt tartalmaz. Jelenleg ez

kimenetet.

Ennl jobb megoldst ad, ha a kln-

a legnagyobb magyar-angol prhuzamos korpusz. A

bz kimenetekbl sszevlogatjuk a legjobb mon-

mondatok illesztse a hunalign nev eszkzzel trtnt,

datrszeket, ami meglehetsen komplex feladat lehet,

amelyet a BME kutati fejlesztettek ki, s az egyik

23

leggyakrabban hasznlt mondatszint illesztprogram.

keretben kszlt, a nyelvprokra lebontott teljest-

A Hunglish mondattr egy online felleten keresztl

mnyt mutatja a 23 hivatalos eurpai nyelv kzl 22-

elrhet s lekrdezhet [25], gy nyersfordtknt vagy

re (az r nyelv nem szerepel az sszehasonltsban). Az

ktnyelv sztrknt is hasznlhat.

eredmnyeket a BLEU-rtk [28] alapjn rangsoroltk,

2010 mrciusban indult az iTranslate4.eu [26] projekt,

amely szerint a jobb fordts magasabb pontszmot kap.

melynek clja olyan gpi fordt szolgltats nyjtsa,

Az emberi fordts kb. 80 pontot kapna.

amely nemcsak lefedi az Eurpai Uni sszes nyelvt,

A legjobb eredmnyek (zlddel s kkkel jellve) azon

hanem az sszes nyelvpr esetben a mindenkori legjobb

nyelvek esetben szlettek, amelyek koordinlt prog-

minsg fordtst is adja.

Ezt a tervet az Eu-

ramokban vettek rszt, s amelyekre kell mennyisg

rpa legjobb gpifordt-rendszereinek mkdteti l-

prhuzamos korpusz ll rendelkezsre (pl. angol, fran-

tal ltrehozott konzorcium valstja meg, amelynek tag-

cia, holland, spanyol s nmet). A gyengbb ered-

jai legalbb egy nyelvpr legjobb fordtst biztostjk.

mnyt elrt nyelvek pirossal kiemelve lthatk. Ezek

A projektnek kt magyar rsztvevje is van: a konzorci-

vagy nlklzni voltak knytelenek a kutatsi rfordt-

umvezet az MTA Nyelvtudomnyi Intzete, mg a szol-

sokat, vagy strukturlisan klnbznek a tbbi nyelvtl

gltatsok kzs interfszt a MorphoLogic K. nyjtja.

(pl. magyar, mltai s nn).

A gpi fordts nagy mrtkben tudja javtani a

hat a megfelel munkakrnyezetbe. A magyar nyelvre

4.3 TOVBBI ALKALMAZSI


TERLETEK

is lteznek ilyen interaktv fordtstmogat rendszerek,

A nyelvtechnolgiai alkalmazsok mgtt egy sor olyan

pldul a Kilgray K. ltal fejlesztett memoQ program-

alfeladat ll, amelyek ltalban nem jelennek meg a fel-

csomag [27].

hasznl szintjn, de a rendszerben fontos szerepet tl-

Mg mindig nagy potencil rejlik a gpi fordt rendsze-

tenek be. Ezek jelents kutatsi irnyokat alkotnak,

rek minsgnek javtsban, pldul a nyelvi erforr-

s sajt tudomnyos terletet kvetelnek maguknak a

sok egy adott felhasznli terletre val alkalmazsval,

szmtgpes nyelvszeten bell.

hatkonysgot, fknt ha intelligensen igazthat a felhasznlspecikus terminolgihoz, illetve integrl-

vagy a terminolgiai adatbzisok s a fordtmemrik


esetben alkalmazott munkakrnyezetek integrlsval.
Problmt jelent, hogy a jelenlegi rendszerek nagy rsze
angolkzpont, s a magyarrl s magyarra val fordts
is csak angolra, illetve angolrl mkdik. Ez fennakad-

A nyelvtechnolgiai alkalmazsok
gyakran nem jelennek meg a felhasznl
szintjn, hanem nagyobb rendszerekbe
beptve, a httrben mkdnek.

sokat okoz a fordti munkafolyamatban, s arra knyszerti a gpi fordtst hasznlkat, hogy klnbz

Az egyik ilyen, aktvan kutatott terlet a krdsmegv-

rendszerek klnbz lexikai eszkzeinek hasznlatt is

laszols, amelyhez annotlt (nyelvi informcival ell-

elsajttsk.

tott) korpuszokat ptenek, s tudomnyos versenyeket

A gpi fordt rendszerek, a klnbz mdszerek s

rendeznek. Ennek a lnyege, hogy a kulcsszalap

a klnbz nyelvprokra mkd rendszerek sszeha-

keresstl elmozdulva (amelynek sorn a keresmotor

sonltst kirtkel kampnyok segtik. A 8. tblzat,

a potencilisan relevns dokumentumok teljes listjval

amely az Eurpai Bizottsg Euromatrix+ projektje

tr vissza) egy olyan rendszert hozzanak ltre, amelyben

24

EN
BG
DE
CS
DA
EL
ES
ET
FI
FR
HU
IT
LT
LV
MT
NL
PL
PT
RO
SK
SL
SV

EN

61.3
53.6
58.4
57.6
59.5
60.0
52.0
49.3
64.0
48.0
61.0
51.8
54.0
72.1
56.9
60.8
60.7
60.8
60.8
61.0
58.5

BG
40.5

26.3
32.0
28.7
32.4
31.1
24.6
23.2
34.5
24.7
32.1
27.6
29.1
32.2
29.3
31.5
31.4
33.1
32.6
33.1
26.9

DE
46.8
38.7

42.6
44.1
43.1
42.7
37.3
36.0
45.1
34.3
44.3
33.9
35.0
37.2
46.9
40.2
42.9
38.5
39.4
37.9
41.0

CS
52.6
39.4
35.4

35.7
37.7
37.5
35.2
32.0
39.5
30.0
38.9
37.0
37.8
37.9
37.0
44.2
38.4
37.8
48.1
43.5
35.6

DA
50.0
39.6
43.1
43.6

44.5
44.4
37.8
37.9
47.4
33.0
45.8
36.8
38.5
38.9
45.4
42.1
42.8
40.3
41.0
42.6
46.6

EL
41.0
34.5
32.8
34.6
34.3

39.4
28.2
27.2
42.8
25.5
40.6
26.5
29.7
33.7
35.3
34.2
40.2
35.6
33.3
34.0
33.3

ES
55.2
46.9
47.1
48.9
47.5
54.0

40.4
39.7
60.9
34.1
26.9
21.1
8.0
48.7
49.7
46.2
60.7
50.4
46.2
47.0
46.6

ET
34.8
25.5
26.7
30.7
27.8
26.5
25.4

34.9
26.7
29.6
25.0
34.2
34.2
26.9
27.5
29.2
26.4
24.6
29.8
31.1
27.4

Clnyelv Target language


FI FR HU IT LT LV
38.6 50.1 37.2 50.4 39.6 43.4
26.7 42.4 22.0 43.5 29.3 29.1
29.5 39.4 27.6 42.7 27.6 30.3
30.5 41.6 27.4 44.3 34.5 35.8
31.6 41.3 24.2 43.8 29.7 32.9
29.0 48.3 23.7 49.6 29.0 32.6
28.5 51.3 24.0 51.7 26.8 30.5
37.7 33.4 30.9 37.0 35.0 36.9
29.5 27.2 36.6 30.5 32.5
30.0 25.5 56.1 28.3 31.9
29.4 30.7 33.5 29.6 31.9
29.7 52.7 24.2 29.4 32.6
32.0 34.4 28.5 36.8 40.1
32.4 35.6 29.3 38.9 38.4
25.8 42.4 22.4 43.7 30.2 33.2
29.8 43.4 25.3 44.5 28.6 31.7
29.0 40.0 24.5 43.2 33.2 35.6
29.2 53.2 23.8 52.8 28.0 31.5
26.2 46.5 25.0 44.8 28.4 29.9
28.4 39.4 27.4 41.8 33.8 36.7
28.8 38.2 25.7 42.3 34.6 37.3
30.9 38.9 22.7 42.0 28.2 31.0

MT
39.8
25.9
19.8
26.3
21.1
23.8
24.6
20.5
19.4
25.3
18.1
24.6
22.2
23.3

22.0
27.9
24.8
28.7
28.5
30.0
23.7

NL
52.3
44.9
50.2
46.5
48.5
48.9
48.8
41.3
40.6
51.6
36.1
50.5
38.1
41.5
44.0

44.8
49.3
43.0
44.4
45.9
45.6

PL
49.2
35.1
30.2
39.2
34.3
34.2
33.9
32.0
28.8
35.7
29.8
35.2
31.6
34.4
37.1
32.0

34.5
35.8
39.0
38.2
32.2

PT
55.0
45.9
44.1
45.7
45.4
52.5
57.3
37.8
37.5
61.0
34.2
56.5
31.6
39.6
45.9
47.7
44.1

48.5
43.3
44.1
44.2

RO
49.0
36.8
30.7
36.5
33.9
37.2
38.1
28.0
26.5
43.8
25.7
39.3
29.3
31.0
38.9
33.0
38.2
39.4

35.3
35.8
32.7

SK
44.7
34.1
29.4
43.6
33.0
33.1
31.7
30.6
27.3
33.1
25.6
32.5
31.8
33.3
35.8
30.1
38.2
32.1
31.5

38.9
31.3

SL
50.7
34.1
31.4
41.3
36.2
36.3
33.9
32.9
28.2
35.6
28.2
34.7
35.3
37.1
40.0
34.6
39.8
34.4
35.1
42.6

33.5

SV
52.0
39.9
41.2
42.9
47.2
43.3
43.7
37.3
37.6
45.8
30.5
44.3
35.3
38.0
41.6
43.6
42.1
43.9
39.4
41.8
42.7

8: Gpi fordts 22 hivatalos eurpai nyelvre Machine translation between 22 EU-languages [29]
a felhasznl egy konkrt krdst tehet fel, amelyre egy

statisztikai fordulata idejn, a kilencvenes vek elejn.

konkrt vlaszt kap. Pldul:

Az informcikinyer rendszerek clja, hogy specilis

Krds: Hny ves olt Neil Armstrong, amikor a


Holdra lpett?
Vlasz: 38.

informcikat hordoz egysgeket azonostsanak


klnbz tpus szvegekben, pldul cgfelvsrlsok
kulcsszereplit felismerjk jsgcikkekben. Egy msik
tipikus felhasznlsi terlet a terroristatmadsokrl

Ez a kutatsi terlet nagyon hasonl, mint a fentebb

szl riportokbl val informcikinyers: ki volt a t-

mr emltett webes keress, de a krdsmegvlaszols

mad, mi volt a tmads clpontja, ideje s helye, mi

inkbb gyjtfogalma az olyan kutatsi krdseknek,

volt a kvetkezmnye stb. A terletspecikus inform-

mint hogy milyen tpus krdseket lehet megkln-

cikinyers szintn kivl pldja a httrben mkd

bztetni, s ezeket hogyan lehet kezelni; hogy a

nyelvtechnolginak: jl krlhatrolt kutatsi terlet,

vlaszt potencilisan tartalmaz dokumentumhalma-

de igazn csak ms alkalmazsokba ptve hasznlhat.

zokat hogyan lehet elemezni s sszehasonltani (mi


trtnik, ha egymsnak ellentmond vlaszokat tartal-

A szvegtmrts s a szveggenerls kt olyan

maznak?); valamint hogy a vlaszt hogyan lehet meg-

hatrterlet, amelyek idnknt nll alkalmazsknt

bzhatan kinyerni egy dokumentumbl a kontextus -

jelennek meg, idnknt viszont tmogat httrkom-

gyelembevtelvel.

ponensei valamely nagyobb rendszernek. A tmrts

Ez pedig kapcsoldik az informcikinyershez, amely

sorn hosszabb szvegbl ksztnk rvidebb vltoza-

nagyon npszer feladat volt a szmtgpes nyelvszet

tot. Ez a funkci mr megtallhat pldul a Microso

25

Wordben is. Nagyrszt statisztikai alapon mkdik: a

talban tisztn statisztikai alapon mkdnek, ame-

rendszer elszr a fontos szavakat azonostja a szveg-

lyek nyelvfggetlenek, de ezekbl csak nhny pro-

ben (jellemzen azok szmtanak fontos szavaknak,

totpus rhet el.

amelyek az adott szvegben gyakoriak, mg ltalban

szont nyelvfggk, s szintn leginkbb csak az an-

nem), majd kivlasztja azokat a mondatokat, amelyek-

golra mkdnek. Mindezek ellenre tbb tulajdonnv-

ben sok fontos sz van. Ezekbl a mondatokbl pl fel

felismer, biolgiai cl informcikinyer, trendelem-

a tmrtett szveg. Ebben az esetben, ami egybknt

z s vlemnykinyer alkalmazs is ltezik a magyar

a legnpszerbb, a tmrts tulajdonkppen monda-

nyelvre.

A szveggenerl modulok vi-

tok kiszrsvel egyenl, ami ltal a szveg mondatainak


rszhalmazra cskken. Minden kereskedelmi forgalomban kaphat tmrt program ezen az tleten alapul. Egy msik mdszer viszont j mondatokat hoz
ltre, vagyis olyanokat, amelyek ugyanebben a formban

4.4 NYELVTECHNOLGIA AZ
OKTATSBAN

nem szerepelnek a forrsszvegben. Ez a szveg mlyebb

A nyelvtechnolgia tipikus interdiszciplinris terlet:

megrtst kveteli, s ezltal nem is olyan robusztus.

nyelvszeti, szmtstechnikai, matematikai, loz-

A szvegtmrt- s generl alkalmazsok az esetek

ai, pszicholingvisztikai s idegtudomnyi szakrtel-

tlnyom rszben nagyobb szoverkrnyezetbe ptve

met egyarnt kvn.

jelennek meg, pldul klinikai informcis rendszerek-

tallta meg a helyt a magyar oktatsi rendszerben

ben, amelyben betegek adatait gyjtik ssze, troljk s

Magyarorszgon egyelre egyetlen egyetemen sem

dolgozzk fel, s amelynek a jelentsgenerls csak egy a

mkdik szmtgpes nyelvszeti tanszk. Nyelvtech-

funkcii kzl.

nolgiai oktats ennek ellenre folyik nhny kapcsold tanszken.

Valsznleg emiatt mg nem

Pr egyetemen az alap- vagy a

mesterkpzs szintjn tartanak kurzusokat a tmban,

A krdsmegvlaszols s szveggenerls
a magyar nyelvre sokkal kevsb fejlett, mint
az angol nyelv esetben.

mshol nyelvtechnolgiai modulokat is kialaktottak


egyb szakokon, fknt a nyelvszeten bell. m ezek
a kurzusok s modulok sem rendelkeznek nagy mlttal: csak az elmlt nhny vben indultak. Jelenleg hat

A krdsmegvlaszols s szveggenerls a magyar

magyarorszgi egyetemen folyik valamilyen formban

nyelvre sokkal kevsb fejlett, mint az angol nyelv ese-

nyelvtechnolgia-oktats.

tben, ahol ezeken a kutatsi terleteken a kilencvenes

A hazai felsoktatsban, az utbbi vek jelents

vek ta rendszeresen tudomnyos versenyeket ren-

erfesztseinek ellenre,

deznek, elssorban az amerikai DARPA/NIST tmo-

nyelvtechnolgusainak oktatsa ma mg kzel sem ll

gatsval. Ezek a versenyek nagy mrtkben elsegtet-

a megfelel szinten. A magyarorszgi nyelvtechnol-

tk a fejldst, de mindig az angolra koncentrl-

gus kzssg clja egy az ltalnos eurpai rendszerbe

tak. Nhny versenyen tbbnyelv feladatok is voltak,

illeszked BA/BSc-MA/MSc-PhD szekvencia tanter-

de a magyar ezekben soha nem szerepelt.

Ennek

vnek kidolgozsa. Tovbbi problmt jelent a atal ku-

eredmnyekppen kevs olyan magyar nyelv kor-

tatk alacsony zetse, s rszben emiatti elvndorlsa a

pusz s egyb erforrs ltezik, amilyen ezekhez a

szakmbl. Szmukra versenykpes sztndjakat kne

feladatokhoz kell. A szvegtmrt rendszerek l-

ltesteni, valamint az ipar s az oktatsi intzmnyek

a jvend nemzedkek

26

kztti kapcsolat megerstsnek keretben lehetv

ts s egysgests nemzetkzi hullma. Tbb, az integ-

kellene tenni kpzsk egy rsznek kihelyezst ipari

rcit s interoperabilitst clul tz projekt is indult,

szereplkhz.

pldul egy egysges magyar ontolgia ptst, vagy a


morfolgiai elemzk klnbz kdolsi rendszereinek
harmonizlst clz projektek.

4.5 HAZAI PROJEKTEK


Ms

orszgokhoz

hasonlan

termszetesnyelv-feldolgozs

kezdetei

magyarorszgi
is

gpi

fordtshoz kapcsoldnak. Az els prblkozsok a


hatvanas vekben zajlottak akkor mg oroszrl magyarra fordtottak. A hetvenes-nyolcvanas vekben a
lexikogrfusi munka adta a kvetkez lkst: ez vezetett
az els magyar morfolgiai rendszer kifejlesztshez.
Ezekben az vekben nem voltak szervezett nemzeti keretprogramok, tovbb Magyarorszg az eurpai tmogatsi lehetsgektl is el volt zrva.
A rendszervlts utn, a kilencvenes vekben egyms
utn alakultak a szakterleten egyetemi tanszkek (pl.
a Szegedi Tudomnyegyetem Nyelvtechnolgiai Csoportja), illetve kutatintzeti osztlyok (pl. az MTA
Nyelvtudomnyi Intzetnek Korpusznyelvszeti Osztlya). Az elmlt tz vben az eurpai s a nemzeti
nanszrozs projektek szma nagy mrtkben
megemelkedett, ez utbbiakat elssorban a Nemzeti
Kutatsi s Technolgiai Hivatal (NKTH) s az Oktatsi Minisztrium tmogatta.

2008-ban lenjr magyarorszgi kutat-fejleszt


kzssgek ltrehoztk a Nyelv- s Beszdtechnolgiai Platformot [30] azzal a cllal, hogy sszehangolt munkval erstsk s elsegtsk az innovcit a
nyelv- s beszdtechnolgia terletn. A Platform hivatalos keretet nyjtva sszefogja a jelentsebb hazai
nyelv- s beszdtechnolgiai kutats-fejlesztst vgz
kzpontokat, s ezltal elsegti az eddig viszonylagos
elszigeteltsgben mkd kzpontokban felhalmozdott magas szint tuds megosztst, illetve integrcijt; rszletes stratgiai s arra pl megvalstsi
terveket dolgoz ki; kzvetti az informatikai szektor
rdekelt rsztvevi fel a Platform elemzseit, stratgiit,
javaslatait; megjelenti s kpviseli a magyar szempontokat s rdekeket a nemzetkzi szntren; s elsegti a
Platform eredmnyeinek tudatostst a magyar gazdasg potencilis felhasznli fel, klns tekintettel a kis- s kzpvllalkozsokra. A Platform egyes
rsztvevi rszt vesznek a CLARIN projektben is.

Ezek kvetkezmnyekppen az elmlt vtizedben a


magyar kutatk szp szm adatbzist (korpuszokat,

Tovbbi kisebb projektekkel egytt a felsorolt projek-

sztrakat, lexikai adatbzisokat) s szvegfeldolgoz

tek Magyarorszgon a nyelvtechnolgia terletn s az

eszkzt (helyesrs-ellenrzket, morfolgiai elemzket

alapvet technolgiai infrastruktra kiptsben egya-

stb.) fejlesztettek ki. A klnbz mhelyek sokig

rnt fejldst hoztak. A nyelvtechnolgiai projektek t-

elszigetelten mkdtek, ezrt fordulhatott el, hogy

mogatsa Magyarorszgon s Eurpban azonban mg

egymstl fggetlenl hasonl eszkzket hoztak ltre

mindig alacsony ahhoz kpest, amennyit az USA klt

(pl. magyar morfolgiai elemzbl legalbb hrom

fordtsra s tbbnyelv informci-hozzfrsre [31].

van). Ezek ltalban ssze nem egyeztethet formtumak, nem szabvnyostottak, tovbb hinyos a

Ahogy lttuk, az eddigi programok a magyar nyelv

dokumentcijuk, s sok esetben a jogi sttuszuk is

nyelvtechnolgiai eszkzk s erforrsok szmban

tisztzatlan. Mindezek ellenre vagy ppen ezrt az

nvekedst hoztak. A kvetkez fejezetben a magyar

elmlt pr vben Magyarorszgot is elrte a szabvnyos-

nyelv nyelvtechnolgia jelenlegi llapott sszegezzk.

27

4.6 AZ ESZKZK S
ERFORRSOK ELRHETSGE

Szintaktikai, szemantikai s diskurzusszerkezeti annotcival elltott korpuszokbl hiny mutatkozik.


Minl tbb szemantikt alkalmaz egy eszkz, annl

A 9. tblzat sszegzi a magyar nyelvtechnolgia t-

nehezebb a fejlesztshez megfelel adatokat elll-

mogatottsgnak jelenlegi helyzett. Az egyes tech-

tani.

nolgik s erforrsok rtkelse vezet szakrtk

A vilgrl val tudsunkat lekpez tuds-

becslse alapjn kszlt. A 7 kritriumra vonatkoz

bzisokhoz szksges szemantikai sztenderdek

pontszmok 0-tl (nagyon alacsony) 6-ig (nagyon ma-

(RDF, OWL stb.) lteznek ugyan, de nehezen al-

gas) terjed skln mozognak.

kalmazhatk a termszetesnyelv-feldolgozsi felada-

A magyar nyelvet illeten a technolgikat s erforr-

tokra.

sokat rint kulcsfontossg eredmnyek a kvetkezk:

Magyarorszgon

beszdfelismerssel

gpi

fordtssal szmos kutatmhelyben foglalkoznak,


Ltezik ugyan nhny kivl minsg specikus

mgis alig van szabadon hasznlhat eszkz s er-

korpusz, de nagyon nagy mret szintaktikailag an-

forrs. Ez a jelensg elg tipikus a magyar nyelvtech-

notlt korpusz nincs. Van azonban egy 1,2 mil-

nolgiban: a nylt forrskd programok s a sza-

li tokent tartalmaz, manulisan szintaktikailag an-

badon felhasznlhat adatbzisok szma nhny

notlt korpusz, amely ingyenesen hozzfrhet, s

dt kivteltl eltekintve meglehetsen alacsony.

amely szmos alkalmazs alapjul szolglt mr.


Az erforrsok nagy rsze nem szabvnyostott;
ltrehozsukkor a fenntarthatsg nem szerepelt a
tervek kztt. Szervezett programok keretben, a
megfelel elrsokat kvetve szabvnyostani kellene a meglv adatbzisokat.
A sztenderd elfeldolgoz lpsek (tokenizls,

sszegzskppen elmondhat, hogy a magyar nyelv


kutats tbb specikus terletn rendelkeznk limitlt funkcionalits szoverekkel. Nyilvnval, hogy
tovbbi kutatsok kellenek ahhoz, hogy ptolni tudjuk
a jelenlegi hinyossgokat a mlyebb szemantikai szint
szvegfeldolgozsban s a beszdfelismershez szksges beszdkorpuszok ltrehozsban.

morfolgiai elemzs, felszni szintaktikai elemzs


stb.) mr megoldottnak tekinthetk a magyarra, de
a bonyolultabb szemantikai feldolgozs mg tovbbi
kutatsokat ignyel.

4.7 NYELVEK KZTTI


SSZEHASONLTS

Minl tbb szemantikt alkalmaz egy eszkz, annl

A nyelvtechnolgia helyzete jelentsen klnbzik az

nagyobb hinyok mutatkoznak a technolgiban

egyes nyelvi kzssgek esetben. Annak rdekben,

(lsd a szvegelemzs s -rtelmezs klnbsgt). A

hogy a nyelvek helyzett ssze tudjuk hasonltani, ebben

mlyebb nyelvi elemzs tbb erfesztst ignyel.

a fejezetben rtkelst adunk kt plda alkalmazsi

A kutats sikeres volt egyes kivl minsg

terletrl (gpi fordts s beszdtechnolgia), valamint

szoverek ltrehozsban, de az erforrsok nagy

egy alaptechnolgirl (szvegelemzs) s a nyelvtech-

rsze nem szabvnyos s nem hatkonyan fenntart-

nolgiai alkalmazsok ptshez szksges erforr-

hat. Az adatok szabvnyostshoz s a kzs adat-

sokrl.

cserl formtumok kialaktshoz szervezett prog-

A nyelvek az albbi telem skla alapjn lettek csopor-

ramok szksgesek.

tostva:

28

Lefedettsg

Fejlettsg

Fenntarthatsg

Alkalmazhatsg

Beszdszintzis

Nyelvtani elemzs

4,5

4,5

4,5

Szemantikai elemzs

0,6

2,5

0,5

Szveggenerls

Gpi fordts

Minsg

Elrhetsg

Mennyisg
Beszdfelismers

Nyelvtechnolgia: Eszkzk, Technolgik s Alkalmazsok

Nyelvi erforrsok: Erforrsok, Adatok s Tudsbzisok


Szvegkorpuszok

3,5

5,5

5,5

Beszdkorpuszok

Prhuzamos korpuszok

4,5

2,5

Lexikai erforrsok

3,5

3,5

3,5

3,5

4,5

Nyelvtanok

9: A magyar nyelvtechnolgia helyzete


1. klaszter: kivl nyelvtechnolgiai tmogats

korpuszok minsge s mrete, az elrhet gpi

2. klaszter: j tmogats

fordt alkalmazsok szma s vltozatossga.

3. klaszter: kzepes tmogats

Szvegelemzs:

A meglv szvegelemz tech-

4. klaszter: tredkes tmogats

nolgik (morfolgia,

5. klaszter: gyenge vagy semmi tmogats

minsge s lefedettsge, a nyelvi jelensgek s

A nyelvtechnolgiai tmogats csoportostsa az albbi


kritriumok szerint trtnt:
Beszdtechnolgia: A meglv beszdfelismer s
-szintetizl technolgik minsge, a terletek
lefedettsge, a meglv beszdkorpuszok szma s
mrete, az elrhet beszdalap alkalmazsok szma
s vltozatossga.
Gpi fordts: A meglv gpi fordt technolgik
minsge, a lefedett nyelvprok szma, a nyelvi jelen-

szintaxis,

szemantika)

terletek lefedettsge, az elrhet alkalmazsok


szma s vltozatossga, a ltez (annotlt) szvegkorpuszok minsge s mrete, a ltez lexikai erforrsok (pl. WordNet) s nyelvtanok minsge s
lefedettsge.
Erforrsok:

A ltez szveg-,

beszd- s

prhuzamos korpuszok minsge s mrete, a


ltez lexikai erforrsok s nyelvtanok minsge
s lefedettsge.

sgek s terletek lefedettsge, a ltez prhuzamos

29

A 10., 11., 12. s 13. tblzatok azt mutatjk, hogy

Lttuk, hogy Eurpa nyelvei kztt nagy klnbsgek

ksznheten az elmlt vtizedek nyelvtechnolgiai t-

vannak. Mikzben nhny nyelvre s alkalmazsi te-

mogatsainak a magyar nyelv viszonylag kedvez hely-

rletre j minsg szoverek s erforrsok lteznek,

zetben van. Azonban a magyar nyelv erforrsok s

msok (ltalban a kisebb nyelvek) esetben alapvet

eszkzk a minsg s a lefedettsg tekintetben nem

hinyossgok vannak. Tbb nyelvre hinyoznak a sz-

rik el a megfelel angol nyelv erforrsok s eszk-

vegelemzshez szksges alapvet technolgik s az

zk szintjt, amelyek majdnem minden nyelvtechnol-

ezek kifejlesztshez elengedhetetlen erforrsok. M-

giai terleten vezetnek. s persze az angol nyelvi erfor-

soknak megvannak ezek az eszkzei s erforrsai, de

rsok, klnsen a magas minsg alkalmazsok tekin-

a szemantikai feldolgozs itt is nehzsgeket okoz.

tetben is vannak hinyossgok.

Nagymrtk erfeszts szksges annak az ambicizus

Ami a beszdtechnolgit illeti, a jelenlegi tech-

clnak az elrshez, hogy j minsg gpi fordtst

nolgik elg jl teljestenek ahhoz, hogy sikeresen in-

tudjunk nyjtani minden eurpai nyelvre.

tegrljk ket olyan ipari alkalmazsokba, mint pldul

A magyarorszgi nyelvtechnolgiai helyzet vatos op-

a dialgus- s diktl rendszerek. A ma elrhet szveg-

timizmusra ad okot. Nagyrszt llami tmogatssal, de

elemz komponensek s nyelvi erforrsok a magyar

ltezik nyelvtechnolgiai kutats Magyarorszgon. Sz-

nyelvben meggyelhet jelensgek egy rszt lefedik,

mos technolgia s erforrs ll rendelkezsre a magyar

s tbb sekly nyelvi elemzst nyjt alkalmazs, mint

nyelvre, br kzel sem annyi, mint az angolra, s ezek

pldul a helyesrs-ellenrzs s nhny inform-

nem elgsgesek egy valdi tbbnyelv tudsalap tr-

cikinyersi feladat rszt kpezik.

sadalom ignyeinek kielgtsre.

Bonyolultabb alkalmazsok, pldul gpi fordt

Az angol nyelvre kifejlesztett s arra optimalizlt tech-

ptshez azonban olyan erforrsokra s tech-

nolgik nehezen vihetk t a magyarra. A magyar

nolgikra van szksg, amelyek a nyelvi jelensgek

nyelv specilis jellege miatt a szintaktikai elemzshez ki-

szlesebb krt fedik le, s a szveg mly szemantikai

fejlesztett angolalap rendszerek jellemzen rosszul tel-

elemzst is lehetv teszik. Az alapvet erforrsok s

jestenek a magyar szvegeken.

technolgik minsgnek s lefedettsgnek javtsval j lehetsgeket nyithatunk tovbbi alkalmazsi


terletek, gy a j minsg gpi fordts eltt.

A magyar nyelvtechnolgiai piac relatve kicsi.

magyar termszetesnyelv-feldolgozs piact elssorban egyetemi kutatcsoportok s akadmiai intzetek


uraljk, de mellettk kisebb cgek is lteznek a piacon.

4.8 SSZEGZS

A fentiekbl vilgoss vlik, hogy a magyarorszgi


kutats-fejlesztshez, innovcihoz, a magyar nyelv

A fehr knyek sorozatval megtrtnt az els fontos

eszkzk s erforrsok ellltshoz mg tbb tmo-

lps afel, hogy tfog felmrst ksztsnk 30 eurpai

gats szksges. A nagy mennyisg adatra val igny s

nyelv nyelvtechnolgijrl, s sszehasonltst is adjuk

a nyelvtechnolgiai rendszerek magasfok komplexitsa

ezen nyelveknek. A hinyossgok s ignyek feltrs-

ktelezv teszi az egyttmkdshez szksges kzs

val az eurpai nyelvtechnolgiai kzssgnek s a kap-

infrastruktra megteremtst.

csold terletek vezetinek mostmr lehetsge van egy

A kutats-fejlesztsi tmogatsok folytonossga nem

olyan kutatsi-fejlesztsi program sszelltsra, amely-

megfelel. Rvid tv programok vltjk egymst ala-

nek clja egy valban tbbnyelv Eurpa ltrehozsa.

csony tmogats idszakokkal, s az EU-s orszgok s

30

az Eurpai Bizottsg programjainak koordincijban is

A META-NET hossztv clja j minsg nyelvtech-

ltalnos hinyossgok mutatkoznak.

nolgia bevezetse minden nyelvre, a politikai s gazdasgi egysg elrsnek rdekben.

A technol-

sszegzskppen elmondhatjuk, hogy nagy szksg van

gia segthet a meglv hatrok ledntsben s hidak

egy, a klnbz eurpai nyelvek nyelvtechnolgiai fel-

ptsben Eurpa nyelvei kztt. Ennek rdekben a

kszltsgben mutatkoz klnbsgek meghaladsra

jvben minden dntshoznak, a politika, a kutats, a

fkuszl, jl koordinlt programra, amely az eurpai

gazdasg s a trsadalom tern egyarnt, egyestenie kell

nyelveket egy egysgknt kezeli.

erit.

31

Kivl
tmogats

J
tmogats
angol

Kzepes
tmogats
cseh
nn
francia
holland
nmet
olasz
portugl
spanyol

Tredkes
tmogats
baszk
bolgr
dn
szt
galciai
grg
r
kataln
lengyel
magyar
norvg
svd
szerb
szlovk
szlovn

Gyenge/semmi
tmogats
horvt
izlandi
lett
litvn
mltai
romn

10: Nyelvi klaszterek a beszdtechnolgiban

Kivl
tmogats

J
tmogats
angol

Kzepes
tmogats
francia
spanyol

Tredkes
tmogats
holland
kataln
lengyel
magyar
nmet
olasz
romn

Gyenge/semmi
tmogats
baszk
bolgr
cseh
dn
szt
nn
galciai
grg
horvt
r
izlandi
lett
litvn
mltai
norvg
portugl
svd
szerb
szlovk
szlovn

11: Nyelvi klaszterek a gpi fordtsban

32

Kivl
tmogats

J
tmogats
angol

Kzepes
tmogats
francia
holland
nmet
olasz
spanyol

Tredkes
tmogats
baszk
bolgr
cseh
dn
nn
galciai
grg
kataln
lengyel
magyar
norvg
portugl
romn
svd
szlovk
szlovn

Gyenge/semmi
tmogats
szt
horvt
r
izlandi
lett
litvn
mltai
szerb

12: Nyelvi klaszterek a szvegelemzsben

Kivl
tmogats

J
tmogats
angol

Kzepes
tmogats
cseh
francia
holland
lengyel
magyar
nmet
olasz
spanyol
svd

Tredkes
tmogats
baszk
bolgr
dn
szt
nn
galciai
grg
horvt
kataln
norvg
portugl
romn
szerb
szlovk
szlovn

Gyenge/semmi
tmogats
r
izlandi
lett
litvn
mltai

13: Nyelvi klaszterek az erforrsok esetben

33

5
A META-NET-RL
A META-NET az Eurpai Bizottsg ltal alaptott

Eurpban, azltal, hogy elsegti a dntshozk frag-

kivlsgi hlzat, amelynek jelenleg 54 tagja van 33

mentlt s elszigetelt csoportjainak tallkozst. Ezt

orszgbl[32]. A META-NET tmogatja a META-t

az sszefogst pldzza az is, hogy jelen fehr knyv a

(Multilingual Europe Technology Alliance), amely az

29 msik nyelv kiadvnnyal kzsen lett elksztve, a

eurpai nyelvtechnolgival foglalkoz szakrtk s in-

technolgiai vzit pedig hrom klnbz gazati cso-

tzmnyek egyre nvekv kzssge. A META-NET

port lltotta ssze. A META technolgiai tancsa an-

elsegti a technolgiai alapok ltrehozst s fenn-

nak rdekben alakult, hogy megvitassk s elksztsk

tartst a tbbnyelv eurpai informcis trsadalom

a kzs vzin alapul stratgiai kutatsi tervet, szoros

szmra, amely: megvalstja a klnbz nyelveken

egyttmkdsben a nyelvtechnolgiai kzssggel.

trtn kommunikcit s egyttmkdst; minden

A META-SHARE clja egy nylt rendszer ltrehozsa,

nyelvhasznl szmra biztostja az informcihoz s

amely lehetv teszi a nyelvi erforrsok megosztst.

tudshoz val hozzfrst; felhasznlja, valamint fe-

Az n. peer-to-peer hlzat nyelvi adatokat, eszkzket

jleszti a hlzati informcis technolgit.

s webes szolgltatsokat fog tartalmazni, amelyek

A hlzat egy olyan Eurpt tmogat, amely egysges

metaadatokkal lesznek elltva, s szabvnyostott

digitlis piacknt s informcis trknt mkdik. A

kategrikba lesznek rendezve. Az erforrsok kny-

META-NET sztnzi a tbbnyelv technolgik ltre-

nyen hozzfrhetek s egysgesen kereshetek lesznek.

hozst minden eurpai nyelvre. Ezek a technolgik

Az elrhet erforrsok kztt tallunk ingyenes, nylt

az alkalmazsok szles krben elrhetv teszik az

forrskd eszkzket s kereskedelmi forgalomban

automatikus fordtst, az informcifeldolgozst s

kaphat, zets szolgltatsokat is.

a tudsmenedzsmentet, tovbb intuitv nyelvalap

A META-RESEARCH hidakat pt a kapcsold

interfszeket biztostanak a hztartsi elektroniktl

technolgiai terletek kztt. Ez az irnyvonal ms

kezdve a gpszeten s szlltmnyozson keresztl a

terletek innovatv fejlesztsi eredmnyeit prblja

robotikig minden terleten.

meg temelni a nyelvtechnolgiba, s kamatoztatni

A META-NET 2010. februr 1-n alakult azzal a cl-

azokat.

lal, hogy tovbbvigye a mr megkezdett tevkenysget

ssorban a kivl minsg gpi fordts fejlesztsre,

hrom f irnyvonalon. Ezek: a META-VISION, a

adatgyjtsre s adatelksztsre, valamint az eszk-

META-SHARE s a META-RESEARCH.

zk kirtkelshez szksges nyelvi erforrsok ell-

A META-VISION tmogatja egy olyan dinamikus s

ltsra fkuszldik. Tovbbi cljai kztt szerepel

befolysos dntshoz kzssg ltrejttt, amely egy

a rendelkezsre ll eszkzk s mdszerek leltrjnak

kzs vzi s az arra pl stratgiai kutatsi terv kr

elksztse, valamint workshopok s trningek szerve-

szervezdik. Ezen tevkenysg f clja, hogy sszetart

zse a kzssg tagjainak.

s egysges nyelvtechnolgiai kzssget alaktson ki

Ennek az irnyvonalnak a tevkenysge el-

oce@meta-net.eu http://www.meta-net.eu

34

1
EXECUTIVE SUMMARY
Information technology changes our everyday lives. We

With its approx. 13 million speakers, Hungarian is 12th

typically use computers for writing, editing, calculating,

on the list of the most populous European languages.

and information searching, and increasingly for reading,

It is the ocial language of the Republic of Hungary,

listening to music, viewing photos and watching movies.

where ca. 97% of the population of 10 million claims

We carry small computers in our pockets and use them

Hungarian as their native language. It is also spoken by

to make phone calls, write emails, get information and

Hungarian communities in the seven neighbour coun-

entertain ourselves, wherever we are. How does this

tries, the largest one being an approx. 1.5 million com-

massive digitization of information, knowledge and ev-

munity in Romania. Additionally, emigrant commu-

eryday communication aect our language? Will our

nities use it worldwide, primarily in the United States,

language change or even disappear?

Canada and Israel.

All our computers are linked together into an increas-

e Hungarian language is an island in Europe most

ingly dense and powerful global network. e girl in

European languages belong to the Indo-European fam-

Ipanema, the ocer in Budapest and the engineer in

ily of languages, but not Hungarian. It is a Finno-Ugric

Delhi can all chat with their friends on Facebook, but

language, related to Finnish, Estonian and a number of

they are unlikely ever to meet one another in online

minority languages spoken in the Baltic states and in

communities and forums. If they are worried about

Russia. It is the most widely spoken non-Indo-European

how to treat earache, they will all check Wikipedia to

language in Europe, but contrary to world languages

nd out all about it, but even then they wont read the

such as English and Chinese, or to more commonly used

same article. When Europes netizens discuss the ef-

European languages such as German and French, Hun-

fects of the Fukushima nuclear accident on European

garian does not play a prominent role on the interna-

energy policy in forums and chat rooms, they do so in

tional scene.

cleanly-separated language communities. What the Internet connects is still divided by the languages of its
users. Will it always be like this?

ere are plenty of complaints in Hungary about the


ever-increasing use of Anglicisms, and some even fear
that the Hungarian language will become riddled with

Many of the worlds 6,000 languages will not survive in

English words and expressions. But our study suggests

a globalised digital information society. It is estimated

that this is misguided. e Hungarian language has al-

that at least 2,000 languages are doomed to extinction

ready survived the impact of new words and terms from

in the decades ahead. Others will continue to play a role

the Old Turkish on the steppes, then later from the Slavs

in families and neighbourhoods, but not in the wider

in the Carpathian basin. Moreover, Hungary was a part

business and academic world. What are the Hungarian

of the Ottoman Empire for 150 years in the 16th-17th

languages chances of survival?

century, then a part of the Habsburg Empire till the rst

35

half of the 20th century. In those times the Latin and

technologies have been produced and distributed for

German inuence was the strongest. One good antidote

Hungarian. However, the scope of the resources and

to losing our lovely little Hungarian words and phrases

the range of tools are still very limited when compared

is to actually use them frequently and consciously; lin-

to the resources and tools for the English language, and

guistic polemics about foreign inuences and govern-

they are simply not sucient in quality and quantity to

ment regulations do not usually help. Our main con-

develop the kind of technologies required to support a

cern should be not the gradual Anglicisation of our lan-

truly multilingual knowledge society.

guage, but rather its complete disappearance from ma-

Information and communication technology are now

jor areas of our personal lives. Not science, aviation and

preparing for the next revolution. Aer personal com-

the global nancial markets, we mean the many areas

puters, networks, miniaturisation, multimedia, mobile

of life in which it is far more important to be close to

devices and cloud-computing, the next generation of

a countrys citizens than to international partners do-

technology will feature soware that understands not

mestic policies, for example, administrative procedures,

just spoken or written letters and sounds but entire

the law, culture and shopping.

words and sentences, and supports users far better be-

e status of a language depends not only on the num-

cause it speaks, knows and understands their language.

ber of speakers, but also on the presence of the lan-

Forerunners of such developments are the free online

guage in the digital information space and soware ap-

service Google Translate that translates between 57 lan-

plications. e existence of a quite active Hungarian-

guages, or its European counterpart itranslate4.eu (the

speaking web community is well demonstrated by the

product of a Hungarian led consortium), IBMs su-

fact that the Hungarian Wikipedia is the 19th largest,

percomputer Watson that was able to defeat the US-

ranking higher than commonly used European lan-

champion in the game of Jeopardy, and Apples mo-

guages such as Turkish, Romanian or Danish, and world

bile assistant Siri for the iPhone that can react to voice

languages such as Arabic or Korean. A few important

commands and answer questions in English, German,

international soware product is available in Hungar-

French and Japanese.

ian versions, however, due to the special characteristics

e next generation of information technology will

of Hungarian, the adaptation of English-based applica-

master human language to such an extent that human

tions is quite dicult. Another reason that hinders the

users will be able to communicate using the technology

development of expensive technologies for Hungarian

in their own language. Devices will be able to automat-

is the fact that the Hungarian market is quite small.

ically nd the most important news and information

In the eld of language technology, we can be cautiously

from the worlds digital knowledge store in reaction to

optimistic about the current state of Hungarian lan-

easy-to-use voice commands. Language-enabled tech-

guage technology support. ere is a viable LT research

nology will be able to translate automatically or assist

community in Hungary, which has been supported in

interpreters; summarise conversations and documents;

the past by national and recently, increasingly, European

and support users in learning scenarios.

funding. Currently both of the two EU-funded projects

e next generation of information and communi-

that are coordinated by Hungary in the competitive

cation technologies will enable industrial and service

ICT eld come from the language technology domain.

robots (currently under development in research labo-

A number of large-scale resources and state-of-the-art

ratories) to faithfully understand what their users want

36

them to do and then proudly report on their achieve-

tion tends to show that results for the automatic analysis

ments.

of English are far better than those for Hungarian. is

is level of performance means going way beyond sim-

holds true for extracting information from texts, gram-

ple character sets and lexicons, spell checkers and pro-

mar checking, machine translation and a whole range of

nunciation rules. e technology must move on from

other applications.

simplistic approaches and start modelling language in

Many researchers reckon that these setbacks are due to

an all-encompassing way, taking syntax as well as seman-

the fact that, for y years now, the methods and algo-

tics into account to understand the dri of questions

rithms of computational linguistics and language tech-

and generate rich and relevant answers.

nology application research have rst and foremost fo-

However, there is a yawning technological gap between

cused on English. However, other researchers believe

English and Hungarian, and it is currently getting wider.

that English is inherently better suited to computer pro-

ere is a lack of continuity in research and develop-

cessing. And languages such as Spanish and French are

ment funding. Short-term coordinated programmes

also a lot easier to process than Hungarian using current

tend to alternate with periods of sparse or zero funding.

methods. is means that we need a dedicated, consis-

In addition, there is an overall lack of coordination with

tent, and sustainable research eort if we want to be use

programmes in other EU countries and at the European

the next generation of information and communication

Commission level. As a result, Hungary (and Europe

technology in those areas of our private and work life

in general) lost several very promising high-tech innova-

where we live, speak and write Hungarian.

tions to the US, where there is greater continuity in their


strategic research planning and more nancial backing
for bringing new technologies to the market. In the race
for technology innovation, an early start with a visionary concept will only ensure a competitive advantage if
you can actually make it over the nish line. Otherwise
all you get is an honorary mention in Wikipedia.
Nevertheless, there is still a high research potential
in Hungary and the EU. Apart from internationally
renowned research centres and universities, there are
a number of innovative small- and medium-sized language technology companies that manage to survive
through sheer creativity and immense eorts, despite
the lack of venture capital or sustained public funding.
Although Hungary has supported important develop-

Summing up, despite the prophets of doom the Hungarian language is not in danger, even from the prowess of
English language computing. However, the whole situation could change dramatically when a new generation of technologies really starts to master human languages eectively. rough improvements in machine
translation, language technology will help in overcoming language barriers, it will only be able to operate between those languages that have managed to survive in
the digital world. If there is adequate language technology available, then it will be able to ensure the survival
of languages with very small populations of speakers. If
not, even larger languages will come under severe pressure.

ments in corpus building and language resource genera-

e dentist jokingly warns: Only brush the teeth you

tion, language technology resources and tools for Hun-

want to keep. e same principle also holds true for re-

garian clearly do not yet reach the quality and coverage

search support policies: You can study every language

of comparable resources and tools for the English lan-

under the sun all you want, but if you really intend to

guage, which is in the lead in almost all language tech-

keep them alive, you also need to develop technologies

nology areas. Every international technology competi-

to support them.

37

2
LANGUAGES AT RISK: A CHALLENGE FOR
LANGUAGE TECHNOLOGY
We are witnesses to a digital revolution that is dramati-

the creation of dierent media like newspapers, ra-

cally impacting communication and society. Recent de-

dio, television, books, and other formats satised

velopments in information and communication tech-

dierent communication needs.

nology are sometimes compared to Gutenbergs invention of the printing press. What can this analogy tell

In the past twenty years, information technology has

us about the future of the European information soci-

helped to automate and facilitate many processes:

ety and our languages in particular?


desktop publishing soware has replaced typewriting and typesetting;

The digital revolution is comparable to


Gutenbergs invention of the printing press.

Microso PowerPoint has replaced overhead projector transparencies;


e-mail allows documents to be sent and received

Aer Gutenbergs invention, real breakthroughs in


communication were accomplished by eorts such as
Luthers translation of the Bible into vernacular language. In subsequent centuries, cultural techniques have
been developed to better handle language processing
and knowledge exchange:

more quickly than using a fax machine;


Skype oers cheap Internet phone calls and hosts
virtual meetings;
audio and video encoding formats make it easy to exchange multimedia content;
web search engines provide keyword-based access;

the orthographic and grammatical standardisation


of major languages enabled the rapid dissemination
of new scientic and intellectual ideas;
the development of ocial languages made it possible for citizens to communicate within certain (often political) boundaries;
the teaching and translation of languages enabled exchanges across languages;
the creation of editorial and bibliographic guidelines
assured the quality of printed material;

online services like Google Translate produce quick,


approximate translations;
social media platforms such as Facebook, Twitter
and Google+ facilitate communication, collaboration, and information sharing.
Although these tools and applications are helpful, they
are not yet capable of supporting a fully-sustainable,
multilingual European society in which information
and goods can ow freely.

38

2.1 LANGUAGE BORDERS


HOLD BACK THE EUROPEAN
INFORMATION SOCIETY

Surprisingly, this ubiquitous digital linguistic divide

We cannot predict exactly what the future information

ciety, and which are doomed to disappear?

has not gained much public attention; yet, it raises a


very pressing question: Which European languages will
thrive in the networked information and knowledge so-

society will look like. However, there is a strong likelihood that the revolution in communication technology is bringing together people who speak dierent languages in new ways. is is putting pressure both on in-

2.2 OUR LANGUAGES AT RISK

dividuals to learn new languages and especially on developers to create new technology applications to ensure

While the printing press helped step up the exchange of

mutual understanding and access to shareable know-

information in Europe, it also led to the extinction of

ledge. In the global economic and information space,

many European languages. Regional and minority lan-

there is increasing interaction between dierent lan-

guages were rarely printed and languages such as Cor-

guages, speakers and content thanks to new types of me-

nish and Dalmatian were limited to oral forms of trans-

dia. e current popularity of social media (Wikipedia,

mission, which in turn restricted their scope of use. Will

Facebook, Twitter, YouTube, and, recently, Google+) is

the Internet have the same impact on our modern lan-

only the tip of the iceberg.

guages?

A global economy and information


space confronts us with dierent languages,
speakers and content.

The wide variety of languages in Europe is one


of its richest and most important cultural assets.

Today, we can transmit gigabytes of text around the


world in a few seconds before we recognise that it is in
a language that we do not understand. According to a

Europes approximately 80 languages are one of our rich-

recent report from the European Commission, 57% of

est and most important cultural assets, and a vital part

Internet users in Europe purchase goods and services in

of this unique social model [3]. While languages such

non-native languages; English is the most common for-

as English and Spanish are likely to survive in the emerg-

eign language followed by French, German and Spanish.

ing digital marketplace, many European languages could

55% of users read content in a foreign language while

become irrelevant in a networked society. is would

35% use another language to write e-mails or post com-

weaken Europes global standing, and run counter to the

ments on the Web [2]. A few years ago, English might

strategic goal of ensuring equal participation for every

have been the lingua franca of the Web the vast ma-

European citizen regardless of language. According to

jority of content on the Web was in English but the

a UNESCO report on multilingualism, languages are

situation has now drastically changed. e amount of

an essential medium for the enjoyment of fundamen-

online content in other European (as well as Asian and

tal rights, such as political expression, education and

Middle Eastern) languages has exploded.

participation in society [4].

39

2.3 LANGUAGE TECHNOLOGY


IS A KEY ENABLING
TECHNOLOGY

To maintain our position in the frontline of global inno-

In the past, investments in language preservation fo-

ronments. Without language technology, we will not

cussed primarily on language education and transla-

be able to achieve a really eective interactive, multime-

tion. According to one estimate, the European mar-

dia and multilingual user experience in the near future.

vation, Europe will need language technology, tailored


to all European languages, that is robust and aordable
and can be tightly integrated within key soware envi-

ket for translation, interpretation, soware localisation


and website globalisation was 8.4 billion in 2008 and
is expected to grow by 10% per annum [5]. Yet this gure covers just a small proportion of current and future
needs in communicating between languages. e most
compelling solution for ensuring the breadth and depth
of language usage in Europe tomorrow is to use appropriate technology, just as we use technology to solve our
transport and energy needs among others.
Language technology targeting all forms of written text
and spoken discourse can help people to collaborate,
conduct business, share knowledge and participate in
social and political debate regardless of language barriers and computer skills. It oen operates invisibly inside
complex soware systems to help us already today to:
nd information with a search engine;
check spelling and grammar in a word processor;
view product recommendations in an online shop;

2.4 OPPORTUNITIES FOR


LANGUAGE TECHNOLOGY
In the world of print, the technology breakthrough was
the rapid duplication of an image of a text using a suitably powered printing press. Human beings had to do
the hard work of looking up, assessing, translating, and
summarising knowledge. We had to wait until Edison
to record spoken language and again his technology
simply made analogue copies.
Language technology can now simplify and automate
the processes of translation, content production, and
knowledge management for all European languages. It
can also empower intuitive speech-based interfaces for
household electronics, machinery, vehicles, computers
and robots. Real-world commercial and industrial applications are still in the early stages of development,
yet R&D achievements are creating a genuine window

follow the spoken directions of a navigation system;

of opportunity. For example, machine translation is al-

translate web pages via an online service.

ready reasonably accurate in specic domains, and experimental applications provide multilingual informa-

Language technology consists of a number of core ap-

tion and knowledge management, as well as content

plications that enable processes within a larger applica-

production, in many European languages.

tion framework. e purpose of the META-NET lan-

As with most technologies, the rst language applica-

guage white papers is to focus on how ready these core

tions such as voice-based user interfaces and dialogue

enabling technologies are for each European language.

systems were developed for specialised domains, and often exhibit limited performance. However, there are

Europe needs robust and aordable language


technology for all European languages.

huge market opportunities in the education and entertainment industries for integrating language technologies into games, edutainment packages, libraries, simu-

40

lation environments and training programmes. Mobile


information services, computer-assisted language learning soware, eLearning environments, self-assessment

2.5 CHALLENGES FACING


LANGUAGE TECHNOLOGY

tools and plagiarism detection soware are just some

Although language technology has made considerable

of the application areas in which language technology

progress in the last few years, the current pace of tech-

can play an important role. e popularity of social

nological progress and product innovation is too slow.

media applications like Twitter and Facebook suggest a

Widely-used technologies such as the spelling and gram-

need for sophisticated language technologies that can

mar correctors in word processors are typically mono-

monitor posts, summarise discussions, suggest opinion

lingual, and are only available for a handful of languages.

trends, detect emotional responses, identify copyright

Online machine translation services, although useful

infringements or track misuse.

for quickly generating a reasonable approximation of a


documents contents, are fraught with diculties when
highly accurate and complete translations are required.
Due to the complexity of human language, modelling

Language technology helps overcome


the disability of linguistic diversity.

our tongues in soware and testing them in the real


world is a long, costly business that requires sustained
funding commitments. Europe must therefore maintain its pioneering role in facing the technological chal-

Language technology represents a tremendous opportu-

lenges of a multiple-language community by inventing

nity for the European Union. It can help to address the

new methods to accelerate development right across the

complex issue of multilingualism in Europe the fact

map. ese could include both computational advances

that dierent languages coexist naturally in European

and techniques such as crowdsourcing.

businesses, organisations and schools. However, citizens need to communicate across the language borders
of the European Common Market, and language tech-

Technological progress needs to be accelerated.

nology can help overcome this nal barrier, while supporting the free and open use of individual languages.

for our global partners when they begin to support

2.6 LANGUAGE ACQUISITION


IN HUMANS AND MACHINES

their own multilingual communities. Language tech-

To illustrate how computers handle language and why it

nology can be seen as a form of assistive technology

is dicult to program them to process dierent tongues,

that helps overcome the disability of linguistic diver-

lets look briey at the way humans acquire rst and sec-

sity and makes language communities more accessible to

ond languages, and then see how language technology

each other. Finally, one active eld of research is the use

systems work.

of language technology for rescue operations in disas-

Humans acquire language skills in two dierent ways.

ter areas, where performance can be a matter of life and

Babies acquire a language by listening to the real inter-

death: Future intelligent robots with cross-lingual lan-

actions between their parents, siblings and other family

guage capabilities have the potential to save lives.

members. From the age of about two, children produce

Looking even further ahead, innovative European multilingual language technology will provide a benchmark

41

their rst words and short phrases. is is only possi-

e second approach to language technology, and to

ble because humans have a genetic disposition to imitate

machine translation in particular, is to build rule-based

and then rationalise what they hear.

systems. Experts in the elds of linguistics, computa-

Learning a second language at an older age requires

tional linguistics and computer science rst have to en-

more cognitive eort, largely because the child is not im-

code grammatical analyses (translation rules) and com-

mersed in a language community of native speakers. At

pile vocabulary lists (lexicons). is is very time con-

school, foreign languages are usually acquired by learn-

suming and labour intensive. Some of the leading rule-

ing grammatical structure, vocabulary and spelling using

based machine translation systems have been under con-

drills that describe linguistic knowledge in terms of ab-

stant development for more than 20 years. e great

stract rules, tables and examples.

advantage of rule-based systems is that the experts have


more detailed control over the language processing.

Humans acquire language skills in two dierent


ways: learning from examples and learning the
underlying language rules.

is makes it possible to systematically correct mistakes


in the soware and give detailed feedback to the user, especially when rule-based systems are used for language
learning. However, due to the high cost of this work,

Moving now to language technology, the two main

rule-based language technology has so far only been de-

types of systems acquire language capabilities in a simi-

veloped for a few major languages.

lar manner. Statistical (or data-driven) approaches obtain linguistic knowledge from vast collections of concrete example texts. While it is sucient to use text in a
single language for training, e. g., a spell checker, parallel texts in two (or more) languages have to be available
for training a machine translation system. e machine
learning algorithm then learns patterns of how words,

As the strengths and weaknesses of statistical and rulebased systems tend to be complementary, current research focusses on hybrid approaches that combine the
two methodologies. However, these approaches have so
far been less successful in industrial applications than in
the research lab.

short phrases and complete sentences are translated.

As we have seen in this chapter, many applications

is statistical approach usually requires millions of sen-

widely used in todays information society rely heavily

tences to boost performance quality. is is one rea-

on language technology, particularly in Europes eco-

son why search engine providers are eager to collect as

nomic and information space. Although this technol-

much written material as possible. Spelling correction

ogy has made considerable progress in the last few years,

in word processors, and services such as Google Search

there is still huge potential to improve the quality of lan-

and Google Translate, all rely on statistical approaches.

guage technology systems. In the next section, we de-

e great advantage of statistics is that the machine

scribe the role of Hungarian in European information

learns quickly in a continuous series of training cycles,

society and assess the current state of language technol-

even though quality can vary randomly.

ogy for the Hungarian language.

42

3
THE HUNGARIAN LANGUAGE IN THE
EUROPEAN INFORMATION SOCIETY
3.1 GENERAL FACTS

ing countries. Of course, minor but characteristic dier-

Hungarian is the most widely spoken non-Indo-

oped under fundamentally German inuence, Roma-

European language in Europe. It is the ocial language

nian Hungarian has been mostly inuenced by Roma-

of the Republic of Hungary, where ca. 97% of the popu-

nian. e Csng minority group has been largely iso-

lation of 10 million claims Hungarian as their native

lated from other Hungarian people, and they therefore

language. It is also spoken by Hungarian communities

preserved a dialect closely resembling medieval Hungar-

in the seven neighbour countries, the largest one being

ian.

ences are present. While the variety in Hungary devel-

an approximately 1.5 million community in Romania.


Additionally, emigrant communities use it worldwide,

Hungarian is the most widely spoken


non-Indo-European language in Europe.

primarily in the United States, Canada and Israel. With


its 13 million speakers, Hungarian is 12th on the list
of the most populous European languages [6]. Abroad,
Hungarian is an ocial language in Vojvodina, as well
as in three municipalities in Slovenia. Hungarian is ofAustria, Croatia, Romania, Ukraine, and Slovakia.

3.2 PARTICULARITIES OF THE


HUNGARIAN LANGUAGE

It is interesting that Hungarian barely has any major

Most European languages belong to the Indo-European

variety: its dialects dier very little from each other

family of languages, but not Hungarian! It is a Uralic

and from the standard, and spelling is particularly uni-

language, part of the Ugric group, related to Finnish, Es-

form. is may be the result of a long-term co-existence,

tonian and a number of minority languages spoken in

which by continuously clashing with other languages

the Baltic states and in Russia.

may have launched speakers on a road to standardisa-

Uralic languages share a few ancient characteristics, such

tion. According to the traditional categorisation, there

as:

cially recognised as a minority or regional language in

are seven dialects identied in the present area of Hungary. ese dialects are, for the most part, mutually intelligible. Two additional Hungarian dialects exist in
Romania: Szkely and Csng.

ere is no gender in Hungarian: the same word ()


expresses the concepts of both he and she.
ere are only two verb tenses: present and past.

ere is scant dierence between the Hungarian used

eir variations and the future tense may be circum-

in the Republic of Hungary and that used in neighbour-

scribed.

43

e so-called direction triad: there are 3x3 of each

Another important feature of Hungarian is the posses-

set of location cases, as shown by table 1 using the

sive structure, the reverse of its counterparts in Indo-

word doboz (box) (the determiner a (the) not be-

European languages. For example, in Pauls radio, Hun-

ing subject to declension).

garian does not attach a sux to the possessor (Paul),


but rather to his possession, the radio: Pl rdi-ja, lit-

Hungarian is written in Roman letters; nonetheless,


Hungarian texts do not resemble any other European
language. Below are two lines from a classic poem (from
Ferenc Klcseys 1823 poem Hymnus, forming the lyrics
of the Hungarian national anthem), in simple literal
translation:

erally: Paul radio-his.


It is more of a cultural-historical rather than a linguistic curiosity that in Hungarian the family name comes
rst, with the utnv (given name, Christian name)
behind, thus the regular order is Liszt Ferenc (=Franz
Liszt), Bem Jzsef (=Jzef Bem), Bartk Bla, Mrai
Sndor, etc.

Isten, ldd meg a magyart


Jkedvvel, bsggel.

Hungarian is called a synthetic language: for the most

God, bless the Hungarian

part, it expresses grammatical elements in a single word

With merriment and plenty.

form using axes, as opposed to isolating languages,


which tend to employ separate words, e. g., prepositions,

Not a single word in the text is recognisable on the ba-

pronouns, auxiliaries, for expressing grammatical phe-

sis of the average European vocabulary; not only do

nomena. For example, the Hungarian equivalent of the

Hungarians refer to God as Isten, they do not even call

English auxiliary can is the sux -hat/-het.

themselves Hungarian; they call themselves magyar.


But there is more to this than dierences in individual

Le-val

a kocsi-bl

utaz-hat

jr-ogat

words:

with Leo

from the car

can travel

usually goes

Isten

ldd

meg

magyart

Suxes, oen multiple ones, must be attached to the

God

bless

the

Hungarian

word stem in strict order, thus words may grow to stun-

e word denoted with the question mark does not exist

ning lengths. is type of synthetic word formation is

in most languages: its name is igekt (verbal prex).

called agglutination (meaning gluing of words). For

It plays a multitude of roles, here expressing the perfect

example: bolondozhattunk we could fool [around]

tense, i. e., it indicates a completed action. One of the

(=fool-verb-can-past-we); sztnzhettnk we could

beauties (and diculties) of the Hungarian language

stimulate (=stimulus-verb-can-past-we). e struc-

lies precisely within the usage of verbal prexes. Now

ture of the two words is identical the apparent dier-

let us examine the second line:

ence is caused by the dierence of vowels, which is due

jkedvwith

merriment

-vel

bsgwith

-gel

plenty

to the so-called vowel harmony (also known as assimilation). e vowels are relegated into one of two classes:
deep: a o u, and high: e i . In the suxes, the

Where English uses the preposition with, Hungarian

vowel appears to t the stem: bolond is deep, thus the

uses suxes. Hungarian does not feature any preposi-

vowels in the suxes are also deep: o - o + o - a - u,

tions. In this example the suxes -vel and -gel express

while sztn is high, therefore the other vowels are high

what English expresses by means of with.

as well: - + - e - [6].

44

Hova?
Where to?

Hol?
Where?

Honnan?
Where from?

bell
inside

a dobozba
into the box

a dobozban
inside the box

a dobozbl
out of the box

rajta
on

a dobozra
onto the box

a dobozon
on the box

a dobozrl
o the box

kzelben
near

a dobozhoz
to the box

a doboznl
at the box

a doboztl
from near the box

1: The so-called direction triad using the word doboz (box)

3.3 RECENT DEVELOPMENTS

3.4 OFFICIAL LANGUAGE


PROTECTION IN HUNGARY

In a way, Hungarian has always been a minority lan-

Hungary has two main institutional bodies that play an

guage that continuously adopted words into its vocabu-

active role in the promotion of the Hungarian language:

lary from other peoples present in the Carpathian basin:

the Balassi Blint Institute, which was founded by the

Slavs (primarily Slovaks, Serbs, Croats), and later Ger-

Ministry of Education, and the Research Institute for

man, Romanian, Jewish and Roma populations. Latin

Linguistics of the Hungarian Academy of Sciences.

was used as the ocial language as late as the beginning


of the 19th century, being the language of public administration and science. e Hungarian Parliament introduced legislative sessions in Hungarian only from 1844
onward.

Hungary has two main institutional bodies


that play an active role in the promotion of the
Hungarian language.

Hungarian has always been more of an importer than

e Balassi Blint Institute, founded in 2002, was

an exporter. e current vocabulary contains numer-

launched to promote Hungarian language culture,

ous words borrowed from Slavic, Latin, Romanian, and

analogously to the well-known British Council and

Italian. e German inuence was the strongest, since

Goethe Institute. It contributes to the teaching of Hun-

Hungary was a part of the Habsburg Empire for 400

garian language and Hungarian studies for foreigners

years. ere is a vast number of words of German ori-

living in Hungary. In co-operation with the interna-

gin, including tnc dance and hering herring. Lexical

tional network of institutions for Hungarian studies, it

borrowing continues to this day: from French itz fri-

promotes the education and research of the Hungarian

teuse, bagett baguette; from Italian maz Maoso,

language and culture abroad. e Balassi Institute fore-

paparazzi; from English tnesz tness, szerver server,

sees the cultivation of the Hungarian language and edu-

etc. Nowadays loan words are usually Anglicisms, due

cation of Hungarians living beyond the borders of Hun-

to the strong inuence of American lms, popular mu-

gary, participates in the linguistic and terminological

sic, and technology (including the Internet).

follow-up training of teachers of Hungarian and other

45

experts abroad, as well as organises courses of Hungarian studies and minority rights [7].
e Research Institute for Linguistics is among the leading institutions in the eld of research on the Hungarian language. It was founded in 1949, and placed under
the direction of the Hungarian Academy of Sciences in
1951. Its primary tasks include research in Hungarian
linguistics, general, theoretical and applied linguistics,
Uralic linguistics, and phonetics. e Institutes tasks
include the preparation of a comprehensive dictionary
of the Hungarian language, and the maintenance of its
archival materials. Its research projects investigate various aspects of Hungarian as well as minority languages
in and outside Hungary, and deal with issues of language policy within the framework of the European integration. Further activities include the compilation of
linguistic corpora and databases, and the laying of the
linguistic groundwork for language technology applications. Besides, the Institute operates a public counselling service on language and linguistics, and runs the
eoretical Linguistics undergraduate and PhD programmes, jointly with Etvs Lornd University [8].

3.5 LANGUAGE IN EDUCATION


From 1844, when Hungarian became the ocial language of public administration, science and education,
elementary school children have the possibility to have
lessons in Hungarian. Aer the Education Reform Act
of 1868, Hungarian became the language of higher education. Diplomas in Hungarian may be earned at numerous institutions of higher education beyond the border at Hungarian universities and colleges: from Nyitra
(Nitra, Slovakia) all the way to jvidk (Novi Sad, Serbia) or Kolozsvr (Cluj-Napoca, Romania).
From the 19th century onward, Hungarian language
and literature have played an important role in education. e study of Hungarian is compulsory from age 6
to 18. In elementary school, from age 6 to 10, the teaching requirements are divided into key areas of reading,
writing and composition. Aer age 10, grammar and
literature are taught separately.
According to the PISA 2009 study that aims to measure literary reading skills of teenagers, Hungary became
the member of countries whose results are not statistically signicantly dierent from the OECD average.

Hungarian orthography is under strict academic con-

e overall reading score in Hungary is comparable with

trol: the rules of Spelling Committee of the Hungarian

those of Germany, France and the UK [9].

Academy of Sciences are intended to use. e regulations are not obligatory, but misspellings can certainly
cause loss of prestige.

3.6 INTERNATIONAL ASPECTS

ese days, many enthusiastic traditionalists argue that

Hungary has a great number of world famous physi-

the neologisms originating from the English language

cists (Ede Teller, Jen Wigner and Le Szilrd, who

threaten the Hungarian rather than enrich it. As a result

contributed to the Manhattan Project), mathematicians

of their language protecting activities, the so-called

(Alfrd Rnyi, Paul Erds, the latter being the author

language law was ratied in 2002, which demands that

of the Erds number), and musicians (Franz Liszt, Bla

English advertisements and slogans must be replaced by

Bartk). Hungarian scientists have won several Nobel

Hungarian equivalents. Additional measures for pro-

prizes in physics, chemistry, and medicine.

tecting the status of the Hungarian language have also

As everywhere in the scientic world, Hungarian schol-

been taken: for example, a television and radio quota

ars face a great deal of pressure to publish in interna-

regulating the percentage of music sung in Hungarian

tional, English-language journals, which leads to a self-

was introduced at the beginning of 2011.

perpetuating cycle that stresses the importance of Eng-

46

lish. e situation is similar in the business world: in

ish, Romanian or Danish, and world languages such as

many large and internationally active companies, Eng-

Arabic or Korean.

lish has become the lingua anca, both in written and

e growing importance of the Internet is critical for

oral communication.

language technology. e vast amount of digital lan-

However, according to a survey in 2005 [10], the num-

guage data is a key resource for analysing the usage of

ber of foreign language speakers in Hungary is below the

natural language, in particular, for collecting statistical

European average: the percentage of Hungarian people

information about patterns. And the Internet oers a

who speak at least one foreign language is 35%. Lan-

wide range of application areas for language technology.

guage technology can address this challenge from a dif-

e most commonly used web application is search,

ferent perspective by oering services such as machine

which involves the automatic processing of language on

translation or cross-lingual information retrieval, and

multiple levels (see Chapter 4). Web search involves

thereby help diminish personal and economic disadvan-

sophisticated language technology that diers for each

tages naturally faced by non-native speakers of English.

language. For Hungarian this comprises taking into account the dierent inectional endings of nouns, adjec-

3.7 HUNGARIAN ON THE


INTERNET
In 2009, 61.6% of the people in Hungary were Internet
users [11]. Among young people aged 14-17, the proportion was even higher. e Internet penetration was
below the European average, but it has been increasing
steadily. In January 2011, the number of domains under
the .hu public domains was near 600,000, and is rising
[12]. About 70,000 domains exist in the country outside of the .hu system (most of them .com) [13].

tives and verbs, and dierent stem forms like l (horse,


single) and loak (horses, plural).
In Hungary there is no ocially ratied law on equal
opportunities for the disabled, but a design guide for
the implementation of complex accessibility has been
developed by the Public Foundation for Equal Opportunities of Persons with Disabilities. It recommends
public agencies to ensure that the disabled can use their
websites and Internet services without any restrictions.
User-friendly language technology tools are a key solution by oering for example speech synthesis to enunciate the content of web pages for the blind.
Internet users and providers of web content can also

The Hungarian Wikipedia is the 19th


largest, before more commonly used
European and world languages.

use language technology in less obvious ways, for example, by automatically translating web page contents
from one language into another. Despite the high cost
of manually translating this content, comparatively little

According to a European study in 2010, the usage

language technology has been developed and applied to

of community pages such as Facebook is above the

the issue of website translation in light of the supposed

European average which may be due to the pre-

need. is may be due to the complexity of the Hungar-

existence of a quite popular Hungarian community site

ian language and to the range of dierent technologies

called iWiW. e existence of a quite active Hungarian-

involved in typical applications.

speaking web community is also reected by the fact

e next chapter gives an introduction to language tech-

that the Hungarian Wikipedia is the 19th largest, before

nology and its core application areas, together with an

more commonly used European languages such as Turk-

evaluation of current language technology support for


Hungarian.
47

4
LANGUAGE TECHNOLOGY SUPPORT
FOR HUNGARIAN
Language technology is used to develop soware sys-

information retrieval,

tems designed to handle human language and are there-

information extraction,

fore oen called human language technology. Human


language comes in spoken and written forms. While
speech is the oldest and in terms of human evolution the

text summarisation,
question answering,

most natural form of language communication, com-

speech recognition and

plex information and most human knowledge is stored

speech synthesis.

and transmitted through the written word. Speech


and text technologies process or produce these dier-

Language technology is an established area of research

ent forms of language, using dictionaries, rules of gram-

with an extensive set of introductory literature. e in-

mar, and semantics. is means that language technol-

terested reader is referred to the following references:

ogy (LT) links language to various forms of knowledge,

[14, 15, 16, 17].

independently of the media (speech or text) in which it

Before discussing the above application areas, we will

is expressed. Figure 1 illustrates the LT landscape.

briey describe the architecture of a typical LT system.

When we communicate, we combine language with


other modes of communication and information media
expressions. Digital texts link to pictures and sounds.

4.1 APPLICATION
ARCHITECTURES

Movies may contain language in spoken and written

Soware applications for language processing typically

form. In other words, speech and text technologies over-

consist of several components that mirror dierent as-

lap and interact with other multimodal communication

pects of language. While such applications tend to be

and multimedia technologies.

very complex, gure 2 shows a highly simplied archi-

In this section, we will discuss the main application

tecture of a typical text processing system. e rst three

areas of language technology, i. e., language checking,

modules handle the structure and meaning of the text

web search, speech interaction, and machine transla-

input:

for example speaking can involve gestures and facial

tion. ese applications and basic technologies include


1. Pre-processing: cleans the data, analyses or removes
spelling correction,

formatting, detects the input language, handles the

authoring support,

specic accented letters (, , , , , , , , ) in Hun-

computer-assisted language learning,

garian, and so on.

48

Speech Technologies
Multimedia &
Multimodality
Technologies

Language
Technologies

Knowledge Technologies

Text Technologies

2: Language technologies

2. Grammatical analysis: nds the verb, its objects,


modiers and other sentence elements; detects the
sentence structure.
3. Semantic analysis: performs disambiguation (i. e.,
computes the appropriate meaning of words in a

4.2 CORE APPLICATION AREAS


In this section, we focus on the most important LT tools
and resources, and provide an overview of LT activities
in Hungary.

given context); resolves anaphora (i. e., which pro-

4.2.1 Language Checking

nouns refer to which nouns in the sentence); rep-

Anyone who has used a word processor such as Mi-

resents the meaning of the sentence in a machine-

croso Word knows that it has a spell checker that high-

readable way.

lights spelling mistakes and proposes corrections. e


rst spelling correction programs compared a list of extracted words against a dictionary of correctly spelled

Aer analysing the text, task-specic modules can per-

words. Today these programs are far more sophisticated.

form other operations, such as automatic summarisa-

Using language-dependent algorithms for grammatical

tion and database look-ups.

analysis, they detect errors related to morphology (e. g.,

In the remainder of this section, we rstly introduce


the core application areas for language technology, and
follow this with a brief overview of the state of LT research and education today, and a description of past

plural formation) as well as syntaxrelated errors, such


as a missing verb or a conict of verb-subject agreement
(e. g., she *write a letter). However, most spell checkers
will not nd any errors in the following text [18]:

and present research programmes. Finally, we present

I have a spelling checker,

an expert estimate of core LT tools and resources for

It came with my PC.

Hungarian in terms of various dimensions such as avail-

It plane lee marks four my revue

ability, maturity and quality. e general situation of

Miss steaks aye can knot sea.

LT for the Hungarian language is summarised in gure 7


(p. 61) at the end of this chapter. LT support for Hun-

Handling these kinds of errors usually requires an analy-

garian is also compared to other languages that are part

sis of the context. For example: there are inected word

of this series.

forms in Hungarian that can hold several meanings, e. g.,

49

Input Text

Pre-processing

Output

Grammatical Analysis

Semantic Analysis

Task-specific Modules

3: A typical text processing architecture

the word vrunk can be an inected form of the verb vr,

tion use vocabulary and sentence structures that are con-

or the noun vr with possessive inection.

sistent with industry rules and (corporate) terminology

is type of analysis either needs to draw on language-

restrictions.

specic grammars laboriously coded into the soware


by experts, or on a statistical language model. In this
case, a model calculates the probability of a particular
word as it occurs in a specic position (e. g., between the

The use of language checking is not


limited to word processors. It also applies
to authoring support systems.

words that precede and follow it). For example: vrunk


is probably not a verb if the sentence contains an other

As Hungarian is a highly agglutinative language, a Hun-

nite verb. A statistical language model can be auto-

garian spell checker must contain a morphological ana-

matically created by using a large amount of (correct)

lyzer that handles the great number of axes and com-

language data (called a text corpus). Most of these two

plex words. e rst spell checker for Hungarian has

approaches have been developed around data from Eng-

been developed by combining a spell checking system

lish. Neither approach can transfer easily to Hungarian

and a morphological model by a Hungarian SME called

because it has a exible word order and a richer inec-

MorphoLogic [19], in the late 80s. eir program

tion system.

(Helyes-e?) is available for MS Oce, uarkXPress,

Language checking is not limited to word processors;

Adobe InDesign and other desktop publisher packages.

it is also used in authoring support systems, i. e., so-

MorphoLogic developed grammar and style checkers

ware environments in which manuals and other docu-

that recognise spelling errors based on the context. e

mentation are written to special standards for complex

program indicates possible mistakes and leaves it to the

IT, healthcare, engineering and other products. Fear-

user to decide whether it is a real mistake.

ing customer complaints about incorrect use and dam-

An open source spell checker for Hungarian exists as

age claims resulting from poorly understood instruc-

well. Hunspell [20] is based on MySpell, and it has been

tions, companies are increasingly focusing on the qual-

integrated into OpenOce, Mozilla Firefox, Mozilla

ity of technical documentation while targeting the in-

underbird and Google Chrome.

ternational market (via translation or localisation) at the

Besides spell checkers and authoring support, language

same time. Advances in natural language processing

checking is also important in the eld of computer-

have led to the development of authoring support so-

assisted language learning.

ware, which helps the writer of technical documenta-

applications also automatically correct search engine

And language checking

50

Statistical Language Models

Input Text

Spelling Check

Grammar Check

Correction Proposals

4: Language checking (top: statistical; bottom: rule-based)

queries, as found in Googles Did you mean sugges-

as machine-readable thesauri or ontological language re-

tions.

sources (e. g, WordNet for English or Hungarian WordNet for Hungarian) have demonstrated improvements

4.2.2 Web Search


Searching the Web, intranets or digital libraries is probably the most widely used yet largely underdeveloped
language technology application today. e Google

in nding pages using synonyms of the original search


terms, such as atomenergia [atomic energy], magenergia
[atomic power] and nukleris energia [nuclear energy],
or even more loosely related terms.

search engine, which started in 1998, now handles


about 80% of all search queries [21]. e verb guglizni

e next generation of search engines will have to in-

is commonly used in Hungarian, even though it has not

clude much more sophisticated language technology, in

made its way into printed dictionaries yet. e Google

particular in order to deal with search queries consisting

search interface and results page display has not signif-

of a question or other sentence type rather than a list of

icantly changed since the rst version. Yet in the cur-

keywords. For the query, Give me a list of all companies

rent version, Google oers spelling correction for mis-

that were taken over by other companies in the last ve

spelled words and has now incorporated basic semantic

years, the LT system needs to analyse the sentence syn-

search capabilities that can improve search accuracy by

tactically and semantically as well as provide an index to

analysing the meaning of terms in a search query context

quickly retrieve relevant documents. A satisfactory an-

[22]. e Google success story shows that a large vol-

swer will require syntactic parsing to analyse the gram-

ume of available data and ecient indexing techniques

matical structure of the sentence and determine that the

can deliver satisfactory results for a statistically-based

user wants companies that have been acquired, not com-

approach.

panies that acquired other companies. For the expression last ve years, the system needs to determine the relevant years. And, the query needs to be matched against

The next generation of search engines


will have to include much more sophisticated
language technology.

a huge amount of unstructured data to nd the piece or


pieces of relevant information the user wants. is is
called information retrieval, and involves searching and
ranking relevant documents. To generate a list of com-

For more sophisticated information requests, it is essen-

panies, the system also needs to recognise a particular

tial to integrate deeper linguistic knowledge for text in-

string of words in a document as a company name, a pro-

terpretation. Experiments using lexical resources such

cess called named entity recognition.

51

Web Pages

Pre-processing

Semantic Processing

Indexing
Matching
&
Relevance

Pre-processing

Query Analysis

User Query

Search Results

5: Web search architecture

A more demanding challenge is matching a query in

Due to the variable word order characteristic of Hun-

one language with documents in another language.

garian, we cannot rely on exploiting particular linear

Cross-lingual information retrieval involves automati-

congurations alone when syntactic parsers are devel-

cally translating the query into all possible source lan-

oped. On the other hand, Hungarian is an agglu-

guages and then translating the results back into the tar-

tinative language with rich case marking, and mor-

get language.

phological case markers and postpositions lend them-

Now that data is increasingly found in non-textual for-

selves to being used as cues for parsing. A database of

mats, there is a need for services that deliver multime-

Hungarian verbs and case markers of their arguments

dia information retrieval by searching images, audio les

was developed at the Research Institute for Linguistics,

and video data. In the case of audio and video les,

which has been built in higher level parsing applica-

a speech recognition module must convert the speech

tions, e. g., for automatic acquisition of verb argument

content into text (or into a phonetic representation)

frames, or rule-based syntactic parsing. More syntac-

that can then be matched against a user query.

tic parsers for Hungarian exist one of them was built

For inectional languages like Hungarian, it is important to be able to search for all the inected forms of
a word simultaneously, instead of having to enter each

in the Hungarian treebank (Szeged Treebank) and in a


rule-based Hungarian-English machine translation program (MetaMorpho).

dierent form separately. For this purpose, several mor-

Focus on development for HLT companies and research

phological parsers exist for Hungarian. NP chunkers for

institutes lies on providing trend- and text-analysis tools

identifying noun phrases provide higher level parsing: a

which integrate natural language processing tools to

statistical and a rule-based application have been devel-

nd the relevant information in unstructured text. For

oped for Hungarian.

this purpose part-of-speech taggers, dependency parsers

52

and named entity recognisers have been developed for

ing, supply chain, public transportation, and telecom-

Hungarian, which are mostly based on statistical learn-

munications. Other uses of speech interaction technol-

ing algorithms.

ogy include interfaces to car navigation systems and the

A meta-search and clustering engine is PolyMeta [23].

use of spoken language as an alternative to the graphical

It enables organisations and individuals to simultane-

or touchscreen interfaces in smartphones.

ously search diverse information resources on the Web

Speech interaction technology comprises four tech-

with a common interface. It employs natural language

nologies:

processing and information retrieval algorithms in its


query analysis and renement, search strategy, relevancy

1. Automatic speech recognition (ASR) determines

ranking, focused drill-down and exploration of multi-

which words are actually spoken in a given sequence

dimensional information spaces.

of sounds uttered by a user.

Certainly, not only SMEs try to extract information by

2. Natural language understanding analyses the syntac-

natural language processing tools. Several projects have

tic structure of a users utterance and interprets it ac-

been running in the academia with the aim of develop-

cording to the system in question.

ing a model-based semantic search system, creating the


framework of a unied Hungarian ontology, or creating a semantically structured, general purpose Hungar-

3. Dialogue management determines which action to


take given the user input and system functionality.

ian concept set on the basis of the results and formalism

4. Speech synthesis (text-to-speech or TTS) trans-

of EuroWordNet language ontology (Hungarian Word-

forms the systems reply into sounds for the user.

Net).
One of the major challenges of ASR systems is to ac-

4.2.3 Speech interaction


Speech interaction is one of many application areas that
depend on speech technology, i. e., technologies for processing spoken language. Speech interaction technology is used to create interfaces that enable users to interact in spoken language instead of using a graphical display, keyboard and mouse.

curately recognise the words a user utters. is means


restricting the range of possible user utterances to a
limited set of keywords, or manually creating language
models that cover a large range of natural language utterances. Using machine learning techniques, language
models can also be generated automatically from speech
corpora, i. e., large collections of speech audio les and
text transcriptions. Restricting utterances usually forces
people to use the voice user interface in a rigid way and

Speech interaction is the basis for creating


interfaces that allow a user to interact with
spoken language instead of a graphical
display, keyboard and mouse.

can damage user acceptance; but the creation, tuning


and maintenance of rich language models will signicantly increase costs. VUIs that employ language models and initially allow a user to express their intent more
exibly prompted by a How may I help you? greeting

Today, these voice user interfaces (VUI) are used for par-

tend to be automated and are better accepted by users.

tially or fully automated telephone services provided by

Companies tend to use utterances pre-recorded by pro-

companies to customers, employees or partners. Busi-

fessional speakers for generating the output of the voice

ness domains that rely heavily on VUIs include bank-

user interface. For static utterances where the wording

53

Speech Output

Speech Input

Speech Synthesis

Signal Processing

Phonetic Lookup &


Intonation Planning

Natural Language
Understanding &
Dialogue

Recognition

6: Speech-based dialogue system

does not depend on particular contexts of use or per-

Provox, available since 2002, which has been built into

sonal user data, this can deliver a rich user experience.

SMS- and email-reader sowares, into in-car and mo-

But more dynamic content in an utterance may suer

bilephone GPS systems, and into e-book- and screen-

from unnatural intonation because dierent parts of au-

reader applications which can help the integration of

dio les have simply been strung together. Todays TTS

blind people into information society. A high level in-

systems are getting better (though they can still be op-

teractive development tool is also available for support-

timised) at producing natural-sounding dynamic utter-

ing special research (psychology and prosody research).

ances.

e soware supports a supervised generation of speech

Interfaces in speech interaction have been considerably

stimuli with predened acoustic and prosodic content,

standardised during the last decade in terms of their var-

based on TTS technology. Hungarian pronunciation

ious technological components. ere has also been

electronic dictionary also exists for 1.5 million word

strong market consolidation in speech recognition and

forms. is may be the basis of the development of a

speech synthesis. e national markets in the G20 coun-

Hungarian TTS symbol conversion tool.

tries (economically resilient countries with high populations) have been dominated by just ve global players, with Nuance (USA) and Loquendo (Italy) being the
most prominent players in Europe. In 2011, Nuance announced the acquisition of Loquendo, which represents
a further step in market consolidation.

On the Hungarian ASR market there are additional


smaller companies, such as Applied Logic Laboratory,
Aitia, Digital Natives, as well as academic research
groups, e. g., at the University of Szeged. In spite of the
linguistic diculties mentioned above, several speech
recogniser applications for Hungarian have been devel-

Due to the specic characteristics of Hungarian, the

oped over the last few years. One of them is a prosodic

widely used methods in speech interaction technol-

recogniser that was prepared by a cross-lingual study for

ogy are dicult or impossible to adapt for Hungarian.

agglutinative, xed stressed languages, such as Hungar-

However, the methods developed for Hungarian can

ian and Finnish, about the segmentation of continuous

be applied for similar languages, e. g., Finnish, Turkish,

speech on word level by examination of supra-segmental

Arabic, in the eld of TTS and ASR.

parameters. Another system helps the work of doctors:

e Hungarian TTS market is dominated by research

during examining the patient they dictate the diagno-

groups at Budapest University of Technology and Eco-

sis which will be automatically transcribed. Yet another

nomics [24]. e most widely used TTS system is

one is a Hungarian computer assisted speech pronun-

54

ciation learning and training system for speech handi-

matched to their closest counterparts in the target lan-

capped and for language learning, which is adapted for

guage. e major diculty is that human language is

some European languages, as well. Further application

ambiguous. Ambiguity creates challenges on multiple

areas are call centres, dialogue systems, or indexing and

levels, such as word sense disambiguation at the lexical

searching media databases.

level (a jaguar is a brand of car or an animal) or the as-

Looking ahead, there will be signicant changes, due to

signment of case on the syntactic level, for example:

the spread of smartphones as a new platform for managing customer relationships, in addition to xed telephones, the Internet and e-mail. is will also aect
how speech interaction technology is used. In the long
term, there will be fewer telephone-based VUIs, and

A rendr ltta az embert a tvcsvel.


e policeman saw the man with the telescope.
A rendr ltta az embert a reolerrel.
e policeman saw the man with the revolver.

spoken language apps will play a far more central role


as a user-friendly input for smartphones. is will be

One way to build an MT system is to use linguis-

largely driven by stepwise improvements in the accu-

tic rules. For translations between closely related lan-

racy of speaker-independent speech recognition via the

guages, a translation using direct substitution may be

speech dictation services already oered as centralised

feasible in cases such as the above example. However,

services to smartphone users.

rule-based (or linguistic knowledge-driven) systems often analyse the input text and create an intermediary

4.2.4 Machine Translation


e idea of using digital computers to translate natural
languages can be traced back to 1946 and was followed
by substantial funding for research during the 1950s and
again in the 1980s. Yet machine translation (MT) still
cannot deliver on its initial promise of providing acrossthe-board automated translation.

symbolic representation from which the target language


text can be generated. e success of these methods is
highly dependent on the availability of extensive lexicons with morphological, syntactic, and semantic information, and large sets of grammar rules carefully designed by skilled linguists. is is a very long and therefore costly process.
In the late 1980s when computational power increased
and became cheaper, interest in statistical models for

At its basic level, machine translation simply


substitutes words in one natural language with
words in another language.

machine translation began to grow. Statistical models


are derived from analysing bilingual text corpora, parallel corpora, such as the Europarl parallel corpus, which
contains the proceedings of the European Parliament in

e most basic approach to machine translation is the

21 European languages. Given enough data, statistical

automatic replacement of the words in a text written

MT works well enough to derive an approximate mean-

in one natural language with the equivalent words of

ing of a foreign language text by processing parallel ver-

another language. is can be useful in subject do-

sions and nding plausible patterns of words. Unlike

mains that have a very restricted, formulaic language

knowledge-driven systems, however, statistical (or data-

such as weather reports. However, in order to produce a

driven) MT systems oen generate ungrammatical out-

good translation of less restricted texts, larger text units

put. Data-driven MT is advantageous because less hu-

(phrases, sentences, or even whole passages) need to be

man eort is required, and it can also cover special par-

55

Source Text

Text Analysis (Formatting,


Morphology, Syntax, etc.)

Statistical
Machine
Translation

Translation Rules
Target Text

Text Generation

7: Machine translation (left: statistical; right: rule-based)

ticularities of the language (e. g., idiomatic expressions)

between English and Hungarian. eir MT system inte-

that are oen ignored in knowledge-driven systems.

grates rule-based and statistical methods, its main com-

e strengths and weaknesses of knowledge-driven and

ponent being a parser that creates an intermediary repre-

data-driven machine translation tend to be complemen-

sentation, from which it produces the text in the target

tary, so that nowadays researchers focus on hybrid ap-

language.

proaches that combine both methodologies. One such

Availability of large amounts of bilingual texts is actually

approach uses both knowledge-driven and data-driven

the key in statistical MT. e Hunglish Corpus is a free

systems, together with a selection module that decides

sentence-aligned Hungarian-English parallel corpus of

on the best output for each sentence. However, results

about 54.2 million words in 2.07 million sentences.

for sentences longer than, say, 12 words, will oen be

At present this is the largest Hungarian-English par-

far from perfect. A more eective solution is to com-

allel corpus. Sentence alignment was performed with

bine the best parts of each sentence from multiple out-

hunalign, which is one of the most widely used sen-

puts; this can be fairly complex, as corresponding parts

tence level aligners, developed by Hungarian researchers

of multiple alternatives are not always obvious and need

at Budapest University of Technology and Economics.

to be aligned.

e corpus may be searched through an online sentence


search service [25], which can be used as a raw translator

Machine Translation is particularly challenging


for the Hungarian language.

or a smart bilingual lexicon.


e iTranslate4.eu [26] project started o in March
2010, which intends to provide online translation so-

Machine translation is particularly challenging for the

lution for all European languages. It does not only of-

Hungarian language. e free word order and split verb

fer full coverage of EU languages, but also provides for

constructions pose problems for analysis; and exten-

each language pair the best quality available at the time,

sive inection is a challenge for generating words with

and mediates easy transfer to professional translators.

proper case markings.

e project is carried out by a consortium of European

ere are knowledge- and data-driven solutions on the

MT companies that have developed the best transla-

Hungarian MT market, as well. MorphoLogic, a private

tion system for at least one language pair. e project

R&D company oers both desktop machine translation

has two Hungarian participants: the consortium leader

programs and online services. MorphoWord translates

is the Research Institute for Linguistics of Hungarian

56

Academy of Sciences, and MorphoLogic provides the


common API for the services.

4.3 OTHER APPLICATION AREAS


Building language technology applications involves a

e use of machine translation can signicantly in-

range of subtasks that do not always surface at the level

crease productivity provided the system is intelligently

of interaction with the user, but they provide signicant

adapted to user-specic terminology and integrated

service functionalities behind the scenes of the system

into a workow. Special systems for interactive trans-

in question. ey all form important research issues

lation support were developed, for example, at Kilgray

that have now evolved into individual sub-disciplines of

[27]. ey provide translation tools, web-based termi-

computational linguistics. uestion answering, for ex-

nology management systems and translation memories.

ample, is an active area of research for which annotated

ere is still a huge potential for improving the quality of MT systems. e challenges involve adapting language resources to a given subject domain or user area,
and integrating the technology into workows that already have term bases and translation memories. Another problem is that most of the current systems are
English-centred and only support English from and into

corpora have been built and scientic competitions have


been initiated. e concept of question answering goes
beyond keyword-based searches (in which the search engine responds by delivering a collection of potentially
relevant documents) and enables users to ask a concrete
question to which the system provides a single answer.
For example:

Hungarian. is leads to friction in the translation

Question: How old was Neil Armstrong when he

workow and forces MT users to learn dierent lexicon

stepped on the moon?

coding tools for dierent systems.

Answer: 38.

Evaluation campaigns help to compare the quality of

While question answering is obviously related to the

MT systems, the dierent approaches and the status

core area of web search, it is nowadays an umbrella term

of the systems for dierent language pairs. Figure 8

for such research issues as which dierent types of ques-

(p. 25), which was prepared during the EC Euromatrix+

tions exist, and how they should be handled; how a set

project, shows the pair-wise performances obtained for

of documents that potentially contain the answer can be

22 of the 23 ocial EU languages (Irish was not com-

analysed and compared (do they provide conicting an-

pared). e results are ranked according to a BLEU

swers?); and how specic information (the answer) can

score, which indicates higher scores for better transla-

be reliably extracted from a document without ignoring

tions [28]. A human translator would normally achieve

the context.

a score of around 80 points.


e best results (in green and blue) were achieved by languages that benet from a considerable research eort in
coordinated programmes and the existence of many par-

Language technology applications often provide


signicant service functionalities behind the
scenes of larger software systems.

allel corpora (e. g., English, French, Dutch, Spanish and


German). e languages with poorer results are shown

uestion answering is in turn related to information ex-

in red. ese languages either lack such development

traction (IE), an area that was extremely popular and in-

eorts or are structurally very dierent from other lan-

uential when computational linguistics took a statis-

guages (e. g., Hungarian, Maltese and Finnish).

tical turn in the early 1990s. IE aims to identify spe-

57

cic pieces of information in specic classes of docu-

stores and processes patient data. Creating reports is just

ments, such as the key players in company takeovers as

one of many applications for text summarisation.

reported in newspaper stories. Another common sce-

For the Hungarian language, research in these text tech-

nario that has been studied is reports on terrorist in-

nologies is much less developed than for the English

cidents. e task here consists of mapping appropri-

language. uestion answering, information extraction,

ate parts of the text to a template that species the per-

and summarisation have been the focus of numerous

petrator, target, time, location and results of the in-

open competitions in the USA since the 1990s, pri-

cident. Domain-specic template-lling is the central

marily organised by the government-sponsored organ-

characteristic of IE, which makes it another example

isations DARPA and NIST. ese competitions have

of a behind the scenes technology that forms a well-

signicantly improved the start-of-the-art, but their fo-

demarcated research area, which in practice needs to be

cus has mostly been on the English language. As a re-

embedded into a suitable application environment.

sult, there are hardly any annotated corpora or other

Text summarisation and text generation are two border-

special resources needed to perform these tasks in Hun-

line areas that can act either as standalone applications

garian. When summarisation systems use purely sta-

or play a supporting role. Summarisation attempts to

tistical methods, they are largely language-independent

give the essentials of a long text in a short form, and

and a number of research prototypes are available. For

is one of the features available in Microso Word. It

text generation, reusable components have tradition-

mostly uses a statistical approach to identify the im-

ally been limited to surface realisation modules (gen-

portant words in a text (i. e., words that occur very fre-

eration grammars) and most of the available soware

quently in the text in question but less frequently in gen-

is for the English language. ere are, however, sev-

eral language use) and determine which sentences con-

eral named entity recognizer, biomedical information

tain the most of these important words. ese sen-

extracting, trend analysing and opinion mining systems

tences are then extracted and put together to create the

for the Hungarian language.

summary. In this very common commercial scenario,


summarisation is simply a form of sentence extraction,

carried out, is to generate brand new sentences that do

4.4 EDUCATIONAL
PROGRAMMES

not exist in the source text.

Language technology is a very interdisciplinary eld

and the text is reduced to a subset of its sentences. An


alternative approach, for which some research has been

that involves the combined expertise of linguists, com-

For the Hungarian language, research in most


text technologies is much less developed
than for the English language.

puter scientists, mathematicians, philosophers, psycholinguists, and neuroscientists among others. As a result, it has not acquired a clear, independent existence in
the Hungarian faculty system yet, so in Hungary there is

is requires a deeper understanding of the text, which

no university with an established department of Com-

means that so far this approach is far less robust. On the

putational Linguistics. However, there are relevant pro-

whole, a text generator is rarely used as a stand-alone ap-

grammes oered by related departments, such as the

plication but is embedded into a larger soware environ-

faculty of computer science or the faculty of linguis-

ment, such as a clinical information system that collects,

tics. Some universities oer Master or Bachelor courses

58

only, or modules in Language Technology to students

As a consequence, over the past decade a number of im-

of other courses of study. Many of these programs and

portant electronic language resources (dictionaries, cor-

courses have only been introduced recently. Currently,

pora, lexical databases) as well as processing resources

six Hungarian universities oer at least courses in the

(spell checking, morphological analyser etc.) have been

eld of Language Technology.

developed. Activities however have not been synchro-

In spite of the eorts in recent years to nd the way of

nized, and not uncommonly similar resources have been

regular teaching of CL into the Hungarian faculty sys-

developed in parallel at dierent locations (e. g., there

tem, the education of next generation computational

are at least three morphological analysers for Hungar-

linguists does not achieve the required level. e aim of

ian). A range of dierent formalisms or standards have

the Hungarian CL community is to develop a high qual-

been used in these, which in the majority of cases are ei-

ity curriculum of BSc-MSc-PhD sequence, which ts

ther incompatible or dicult to convert from; there is

into the European standards. e relatively low salaries

also a lack of documentation and in many cases copy-

and scholarships of young researchers pose further prob-

right issues are unclear. Nevertheless, in recent years the

lems, which could partly be solved by strengthening the

international trends of standardisation and uniformisa-

relationships between research and industry.

tion of existing resources have reached Hungary as well.


Several projects started o with the objective of integration and interoperability, e. g., creating a unied Hun-

4.5 NATIONAL PROJECTS AND


INITIATIVES
e beginnings of natural language processing in Hungary are connected with Machine Translation. e rst
attempts were made in the 1960s in those years from
Russian to Hungarian. In the 1970s-1980s the lexicographers work gave the impetus that led to the development of the rst computational morphosyntactic systems for Hungarian. In those years there were no regular nationally nanced projects, moreover, Hungary was
separated from European support.
Aer the political change, in the 1990s new sections
were formed at universities (e. g., the Natural Language
Processing Group at Szeged University) and in research
institutes (e. g., the Department for Corpus Linguistics
at the Research Institute for Linguistics). Since 2000,

garian ontology, or harmonising the dierent coding


systems of separately developed morphological analysers.
In 2008, prominent Hungarian academic institutions
and R&D companies formed the Hungarian Platform
for Speech and Language Technology [30], which aims
to help sharing and integration of high quality knowledge accumulated in centres that worked in isolation
beforehand; to work out detailed strategic and implementation plans and to help their subsequent implementation; to disseminate its analyses and proposals
among the members of the IT sector; to represent the
Hungarian interests at the international level; and to
disseminate the achievements of the Platform among
the potential users of the technology. Hungarian institutions have also been involved in the CLARIN project.

there has been a signicant increase in the number of

Along with many smaller scale projects that have now

projects supported by European funds and nationally -

been completed, the above projects have led to the de-

nanced projects, supported mainly by the Fund of the

velopment of wide-ranging competence in the eld of

Ministry of Education, or the Agency for Research Fund

language technology as well as the creation of a ba-

Management and Research Exploitation.

sic technological infrastructure for Hungarian language

59

tools and resources. Public funding for LT projects

(text analysis versus text interpretation). ere is a

in Hungary and in Europe is still relatively low, how-

need for far more eort to support deep linguistic

ever, when compared to the amount of money the USA

processing.

spends on language translation and multilingual infor-

Research has been successful in designing particu-

mation access [31].

larly high quality soware, but many of the resources

As we have seen, previous programmes have led to the

are not standardised and cannot be sustained eec-

development of a number of LT tools and resources for

tively. A concerted program is required to standard-

the Hungarian language. e following section sum-

ise data and interchange formats.

marises the current state of LT support for Hungarian.

ere is a lack of annotated corpora with syntactic,


semantic and discourse structure mark-up. Again,

4.6 AVAILABILITY OF TOOLS


AND RESOURCES

the situation gets worse as the need for more deep


linguistic and semantic information grows.
Standards do exist for semantics in the sense of world

Figure 7 provides a rating for language technology sup-

knowledge (RDF, OWL, etc.); they are, however,

port for the Hungarian language. is rating of existing

not easily applicable to NLP tasks.

tools and resources was generated by leading experts in

Speech Recognition and Machine Translation of

the eld who provided estimates based on a scale from 0

Hungarian is studied at several universities and

(very low) to 6 (very high) using seven criteria.

workplaces, but free tools and data are currently not

e key results for Hungarian language technology can

available. It is a typical phenomenon at the Hungar-

be summed up as follows:

ian NLP market that the number of free databases


and open source programs is quite low.

While several specic corpora of high quality exist, a


very large syntactically annotated corpus is not avail-

To conclude, in a number of specic areas of Hungarian

able. However, there is a syntactically highly elabo-

language research, we have sowares with limited func-

rately annotated corpus for Hungarian (1.2 million

tionality available today. Obviously, further research

tokens). e corpus is available for free, which re-

eorts are required to meet the current decit in pro-

sults in a wide range applications developed based on

cessing texts on a deeper semantic level and to address

the corpus.

the lack of resources such as speech corpora for speech

Many of the resources lack standardisation, i. e., even

recognition.

if they exist, they are not addressing sustainability;


concerted programs and initiatives are needed to
standardise data and interchange formats.
e standard preprocessing steps (tokenisation,

4.7 CROSS-LANGUAGE
COMPARISON

morphology, shallow parsing, etc.) are completed

e current state of LT support varies considerably from

for Hungarian, but treating the more dicult se-

one language community to another. In order to com-

mantics requires further research.

pare the situation between languages, this section will

e more linguistic and semantic knowledge a tool

present an evaluation based on two sample applica-

draws on, the more gaps there are in the technology

tion areas (machine translation and speech processing)

60

Coverage

Maturity

Sustainability

Adaptability

Speech Synthesis

Grammatical analysis

4.5

4.5

4.5

Semantic analysis

0.6

2.5

0.5

Text generation

Machine translation

uality

Availability

uantity
Speech Recognition

Language Technology: Tools, Technologies and Applications

Language Resources: Resources, Data and Knowledge Bases


Text corpora

3.5

5.5

5.5

Speech corpora

Parallel corpora

4.5

2.5

Lexical resources

3.5

3.5

3.5

3.5

4.5

Grammars

8: State of language technology support for Hungarian


and one underlying technology (text analysis), as well

Machine Translation: uality of existing MT tech-

as basic resources needed for building LT applications.

nologies, number of language pairs covered, coverage of

e languages were categorised using the following ve-

linguistic phenomena and domains, quality and size of

point scale:

existing parallel corpora, amount and variety of available

1. Excellent support
2. Good support
3. Moderate support
4. Fragmentary support

MT applications.
Text Analysis: uality and coverage of existing text
analysis technologies (morphology, syntax, semantics),
coverage of linguistic phenomena and domains, amount
and variety of available applications, quality and size of

5. Weak or no support

existing text corpora, quality and coverage of existing

LT support was measured according to the following cri-

lexical resources (e. g., WordNet) and grammars.

teria:

Resources: uality and size of existing text corpora,

Speech Processing: uality of existing speech recog-

speech corpora and parallel corpora, quality and cover-

nition technologies, quality of existing speech synthesis

age of existing lexical resources and grammars.

technologies, coverage of domains, number and size of

Figures 8 to 11 show that, thanks to large-scale LT fund-

existing speech corpora, amount and variety of available

ing in recent decades, the Hungarian language is quite

speech-based applications.

well equipped compared to other. But LT resources and

61

tools for Hungarian clearly do not yet reach the quality

languages and application areas, others, usually smaller

and coverage of comparable resources and tools for the

languages, have substantial gaps. Many languages lack

English language, which is in the lead in almost all LT

basic technologies for text analysis and the essential re-

areas. And there are still plenty of gaps in English lan-

sources. Others have basic tools and resources but the

guage resources with regard to high quality applications.

implementation of for example semantic methods is still

For speech processing, current technologies perform

far away. erefore a large-scale eort is needed to at-

well enough to be successfully integrated into a number

tain the ambitious goal of providing high-quality lan-

of industrial applications such as spoken dialogue and

guage technology support for all European languages,

dictation systems. Todays text analysis components and

for example through high quality machine translation.

language resources already cover the linguistic phenom-

In the case of the Hungarian language, we can be cau-

ena of Hungarian to a certain extent and form part of

tiously optimistic about the current state of language

many applications involving mostly shallow natural lan-

technology support. ere is a viable LT research com-

guage processing, e. g., spelling correction and some in-

munity in Hungary, which has been supported in the

formation extraction tasks.

past mostly by national funds. And a number of large-

However, for building more sophisticated applications,

scale resources and state-of-the-art technologies have

such as machine translation, there is a clear need for re-

been produced and distributed for Hungarian. How-

sources and technologies that cover a wider range of lin-

ever, the scope of the resources and the range of tools are

guistic aspects and allow a deep semantic analysis of the

still very limited when compared to the resources and

input text. By improving the quality and coverage of

tools for the English language, and they are simply not

these basic resources and technologies, we shall be able

sucient in quality and quantity to develop the kind

to open up new opportunities for tackling a vast range of

of technologies required to support a truly multilingual

advanced application areas, including high-quality ma-

knowledge society.

chine translation.

Nor can we simply transfer technologies already developed and optimised for the English language to handle

4.8 CONCLUSIONS
In this series of white papers, we have made an important eort by assessing the language technology support
for 30 European languages, and by providing a high-

Hungarian. English-based systems for parsing (syntactic and grammatical analysis of sentence structure) typically perform far less well on Hungarian texts, due to the
specic characteristics of the Hungarian language.

leel comparison across these languages. By identifying

ere is a relatively small language technology industry

the gaps, needs and decits, the European language tech-

at work on the Hungarian language. us the Hungar-

nology community and its related stakeholders are now

ian NLP market is dominated by research groups at uni-

in a position to design a large scale research and develop-

versities and academic institutes, however there are ad-

ment programme aimed at building a truly multilingual,

ditional smaller companies on the market.

technology-enabled communication across Europe.

Our ndings lead to the conclusion that the only way

e results of this white paper series show that there is a

forward is to make a substantial eort to create language

dramatic dierence in language technology support be-

technology resources for Hungarian, as a means to drive

tween the various European languages. While there are

forward research, innovation and development. e

good quality soware and resources available for some

need for large amounts of data and the extreme com-

62

plexity of language technology systems makes it vital to

e long term goal of META-NET is to enable the cre-

develop an infrastructure and a coherent research organ-

ation of high-quality language technology for all lan-

isation to spur greater sharing and cooperation.

guages. is requires all stakeholders in politics, re-

Finally there is a lack of continuity in research and devel-

search, business, and society to unite their eorts.

opment funding. Short-term coordinated programmes

e resulting technology will help tear down existing

tend to alternate with periods of sparse or zero funding.

barriers and build bridges between Europes languages,

In addition, there is an overall lack of coordination with

paving the way for political and economic unity through

programmes in other EU countries and at the European

cultural diversity.

Commission level.

63

Excellent
support

Good
support
English

Moderate
support
Czech
Dutch
Finnish
French
German
Italian
Portuguese
Spanish

Fragmentary
support
Basque
Bulgarian
Catalan
Danish
Estonian
Galician
Greek
Hungarian
Irish
Norwegian
Polish
Serbian
Slovak
Slovene
Swedish

Weak/no
support
Croatian
Icelandic
Latvian
Lithuanian
Maltese
Romanian

9: Speech processing: state of language technology support for 30 European languages

Excellent
support

Good
support
English

Moderate
support
French
Spanish

Fragmentary
support
Catalan
Dutch
German
Hungarian
Italian
Polish
Romanian

Weak/no
support
Basque
Bulgarian
Croatian
Czech
Danish
Estonian
Finnish
Galician
Greek
Icelandic
Irish
Latvian
Lithuanian
Maltese
Norwegian
Portuguese
Serbian
Slovak
Slovene
Swedish

10: Machine translation: state of language technology support for 30 European languages

64

Excellent
support

Good
support
English

Moderate
support
Dutch
French
German
Italian
Spanish

Fragmentary
support
Basque
Bulgarian
Catalan
Czech
Danish
Finnish
Galician
Greek
Hungarian
Norwegian
Polish
Portuguese
Romanian
Slovak
Slovene
Swedish

Weak/no
support
Croatian
Estonian
Icelandic
Irish
Latvian
Lithuanian
Maltese
Serbian

11: Text analysis: state of language technology support for 30 European languages

Excellent
support

Good
support
English

Moderate
support
Czech
Dutch
French
German
Hungarian
Italian
Polish
Spanish
Swedish

Fragmentary
support
Basque
Bulgarian
Catalan
Croatian
Danish
Estonian
Finnish
Galician
Greek
Norwegian
Portuguese
Romanian
Serbian
Slovak
Slovene

Weak/no
support
Icelandic
Irish
Latvian
Lithuanian
Maltese

12: Speech and text resources: State of support for 30 European languages

65

5
ABOUT META-NET
META-NET is a Network of Excellence funded by the

sion and a common strategic research agenda (SRA).

European Commission. e network currently con-

e main focus of this activity is to build a coherent

sists of 54 members from 33 European countries [32].

and cohesive LT community in Europe by bringing to-

META-NET fosters the Multilingual Europe Technol-

gether representatives from highly fragmented and di-

ogy Alliance (META), a growing community of lan-

verse groups of stakeholders. e present White Paper

guage technology professionals and organisations in Eu-

was prepared together with volumes for 29 other lan-

rope. META-NET fosters the technological founda-

guages. e shared technology vision was developed in

tions for a truly multilingual European information so-

three sectorial Vision Groups. e META Technology

ciety that:

Council was established in order to discuss and to pre-

makes communication and cooperation possible


across languages;
provides equal access to information and knowledge
in any language;
oers advanced and aordable networked information technology to European citizens.

pare the SRA based on the vision in close interaction


with the entire LT community.
META-SHARE creates an open, distributed facility
for exchanging and sharing resources. e peer-topeer network of repositories will contain language data,
tools and web services that are documented with highquality metadata and organised in standardised cate-

e network supports a Europe that unites as a sin-

gories. e resources can be readily accessed and uni-

gle digital market and information space. It stimulates

formly searched. e available resources include free,

and promotes multilingual technologies for all Euro-

open source materials as well as restricted, commercially

pean languages. ese technologies support automatic

available, fee-based items.

translation, content production, information process-

META-RESEARCH builds bridges to related technol-

ing and knowledge management for a wide variety of

ogy elds. is activity seeks to leverage advances in

applications and subject domains. ey also enable in-

other elds and to capitalise on innovative research that

tuitive language-based interfaces to technology rang-

can benet language technology. In particular, the ac-

ing from household electronics, machinery and vehi-

tion line focuses on conducting leading-edge research in

cles to computers and robots. Launched on 1 February

machine translation, collecting data, preparing data sets

2010, META-NET has already conducted various activ-

and organising language resources for evaluation pur-

ities in its three lines of action META-VISION, META-

poses; compiling inventories of tools and methods; and

SHARE and META-RESEARCH.

organising workshops and training events for members

META-VISION fosters a dynamic and inuential

of the community.

stakeholder community that unites around a shared vi-

oce@meta-net.eu http://www.meta-net.eu

66

A
HIVATKOZSOK REFERENCES
[1] Aljoscha Burchard, Markus Egg, Kathrin Eichler, Brigitte Krenn, Jrn Kreutel, Annette Lemllmann, Georg
Rehm, Manfred Stede, Hans Uszkoreit, and Martin Volk. Die Deutsche Sprache im Digitalen Zeitalter e
German Language in the Digital Age. META-NET White Paper Series. Georg Rehm and Hans Uszkoreit
(Series Editors). Springer, 2012.
[2] Directorate-General Information Society & Media of the European Commission. User Language Preferences
Online, 2011. http://ec.europa.eu/public_opinion/flash/fl_313_en.pdf.
[3] European Commission. Multilingualism: an Asset for Europe and a Shared Commitment, 2008. http://ec.
europa.eu/languages/pdf/comm2008_en.pdf.
[4] Directorate-General of the UNESCO. Intersectoral Mid-term Strategy on Languages and Multilingualism,
2007. http://unesdoc.unesco.org/images/0015/001503/150335e.pdf.
[5] Directorate-General for Translation of the European Commission. Size of the Language Industry in the EU,
2009. http://ec.europa.eu/dgs/translation/publications/studies.
[6] dm Ndasdy. Did you know? Educational publication about the Hungarian language.
[7] http://www.bbi.hu/index.php?id=99&fid=110.
[8] http://www.nytud.hu/eng/index.html.
[9] PISA 2009 Results: What Students Know and Can Do: Student Performance in Reading, Mathematics and
Science (Volume I). http://www.oecd.org/document/61/0,3343,en_2649_35845621_46567613_1_1_1_
1,00.html.
[10] http://www.tarki.hu/tarkitekinto/20050412.html.
[11] http://www.google.com/publicdata?ds=wb-wdi&met_y=it_net_user_p2&idim=country:HUN&dl=
hu&hl=hu&q=internethaszn%C3%A1lat.
[12] http://www.nic.hu/English/statisztika/domain-teljes.html.
[13] http://www.webhosting.info/registries/country_stats/HU.
[14] Daniel Jurafsky and James H. Martin. Speech and Language Processing (2nd Edition). Prentice Hall, 2009.

67

[15] Christopher D. Manning and Hinrich Schtze. Foundations of Statistical Natural Language Processing. MIT
Press, 1999.
[16] Language Technology World (LT World). http://www.lt-world.org/.
[17] Ronald Cole, Joseph Mariani, Hans Uszkoreit, Giovanni Battista Varile, Annie Zaenen, and Antonio Zampolli, editors. Survey of the State of the Art in Human Language Technology (Studies in Natural Language
Processing). Cambridge University Press, 1998.
[18] Jerrold H. Zar. Candidate for a Pullet Surprise. Journal of Irreproducible Results, page 13, 1994.
[19] http://www.morphologic.hu/.
[20] http://hunspell.sourceforge.net/.
[21] Spiegel Online. Google zieht weiter davon (Google is still leaving everybody behind), 2009. http://www.
spiegel.de/netzwelt/web/0,1518,619398,00.html.
[22] Juan Carlos Perez.

Google Rolls out Semantic Search Capabilities, 2009.

http://www.pcworld.com/

businesscenter/article/161869/google_rolls_out_semantic_search_capabilities.html.
[23] http://www.weblib.com/.
[24] http://www.tmit.bme.hu/home.
[25] http://szotar.mokk.bme.hu/hunglish/search/corpus.
[26] http://itranslate4.eu/.
[27] http://kilgray.com/.
[28] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: A Method for Automatic Evaluation
of Machine Translation). In Proceedings of the 40th Annual Meeting of ACL, Philadelphia, PA, 2002.
[29] Philipp Koehn, Alexandra Birch, and Ralf Steinberger. 462 Machine Translation Systems for Europe. In
Proceedings of MT Summit XII, 2009.
[30] http://hlt-platform.hu/.
[31] Gianni Lazzari. Sprachtechnologien fr Europa (Language Technology for Europe), 2006. http://tcstar.org/
pubblicazioni/D17_HLT_DE.pdf.
[32] Georg Rehm and Hans Uszkoreit. Multilingual Europe: A challenge for language tech. MultiLingual,
22(3):5152, April/May 2011.

68

B
META-NET TAGOK META-NET MEMBERS
Ausztria

Austria

Zentrum fr Translationswissenscha, Universitt Wien: Gerhard Budin

Belgium

Belgium

Computational Linguistics and Psycholinguistics Research Centre, Univ. of


Antwerp: Walter Daelemans
Centre for Proc. Speech and Images, Univ. of Leuven: Dirk van Compernolle

Bulgria

Bulgaria

Inst. for Bulgarian Lang., Bulgarian Academy of Sciences: Svetla Koeva

Ciprus

Cyprus

Lang. Centre, School of Humanities: Jack Burston

Csehorszg

Czech Republic

Inst. of Formal and Applied Linguistics, Charles Univ. in Prague: Jan Hajic

Dnia

Denmark

Centre for Lang. Technology, Univ. of Copenhagen: Bolette Sandford Pedersen,


Bente Maegaard

Egyeslt Kirlysg

UK

Inst. for Lang., Cognition and Computation, Center for Speech Technology Research, Univ. of Edinburgh: Steve Renals
Research Inst. of Informatics and Lang. Proc., Univ. of Wolverhampton:
Ruslan Mitkov
School of Computer Science, Univ. of Manchester: Sophia Ananiandou

sztorszg

Estonia

Inst. of Computer Science, Univ. of Tartu: Tiit Roosmaa

Finnorszg

Finland

Computational Cognitive Systems Research Group, Aalto Univ.: Timo Honkela


Dept. of General Linguistics, Univ. of Helsinki: Kimmo Koskenniemi,
Krister Linden

Franciaorszg

France

Centre National de la Recherche Scientique, Laboratoire dInformatique pour


la Mcanique et les Sciences de lIngnieur: Joseph Mariani
Evaluations and Lang. Resources Distribution Agency: Khalid Choukri

Grgorszg

Greece

Inst. for Lang. and Speech Proc., R. C. Athena: Stelios Piperidis

Hollandia

Netherlands

Utrecht Inst. of Linguistics, Utrecht Univ.: Jan Odijk


Computational Linguistics, Univ. of Groningen: Gertjan van Noord

Horvtorszg

Croatia

Inst. of Linguistics, Faculty of Humanities and Social Science, Univ. of Zagreb:


Marko Tadi

rorszg

Ireland

School of Computing, Dublin City Univ.: Josef van Genabith

Izland

Iceland

School of Humanities, Univ. of Iceland: Eirikur Rgnvaldsson

Lengyelorszg

Poland

Inst. of Computer Science, Polish Academy of Sciences: Adam Przepirkowski,


Maciej Ogrodniczuk

69

Univ. of d: Barbara Lewandowska-Tomaszczyk, Piotr Pzik


Dept. of Computer Linguistics and Articial Intelligence, Adam Mickiewicz
Univ.: Zygmunt Vetulani
Lettorszg

Latvia

Tilde: Andrejs Vasiljevs


Inst. of Mathematics and Computer Science, Univ. of Latvia: Inguna Skadina

Litvnia

Lithuania

Inst. of the Lithuanian Lang.: Jolanta Zabarskait

Luxemburg

Luxembourg

Arax Ltd.: Vartkes Goetcherian

Magyarorszg

Hungary

Research Inst. for Linguistics, Hungarian Academy of Sciences: Tams Vradi


Dept. of Telecommunications and Media Informatics, Budapest Univ. of Technology and Economics: Gza Nmeth, Gbor Olaszy

Mlta

Malta

Dept. Intelligent Computer Systems, Univ. of Malta: Mike Rosner

Nmetorszg

Germany

DFKI (German Research Centre for Articial Intelligence): Hans Uszkoreit,


Georg Rehm
Human Lang. Technology and Pattern Recognition, RWTH Aachen Univ.:
Hermann Ney
Dept. of Computational Linguistics, Saarland Univ.: Manfred Pinkal

Norvgia

Norway

Dept. of Linguistic, Literary and Aesthetic Studies, Univ. of Bergen:


Koenraad De Smedt
Dept. of Informatics, Lang. Technology Group, Univ. of Oslo: Stephan Oepen

Olaszorszg

Italy

Consiglio Nazionale Ricerche, Istituto di Linguistica Computazionale Antonio


Zampolli: Nicoletta Calzolari
Human Lang. Technology, Fondazione Bruno Kessler: Bernardo Magnini

Portuglia

Portugal

Dept. of Informatics, Univ. of Lisbon: Antonio Branco


Spoken Lang. Systems Lab., Inst. for Systems Engineering and Computers:
Isabel Trancoso

Romnia

Romania

Research Inst. for Articial Intelligence, Romanian Academy of Sciences:


Dan Tus
Faculty of Computer Science, Univ. Alexandru Ioan Cuza: Dan Cristea

Spanyolorszg

Spain

Barcelona Media: Toni Badia


Institut Universitari de Lingistica Aplicada, Univ. Pompeu Fabra: Nria Bel
Aholab Signal Proc. Lab., Univ. of the Basque Country: Inma Hernaez Rioja
Center for Lang. and Speech Technologies and Applications, Technical Univ. of
Catalonia: Asuncin Moreno
Dept. of Signal Proc. and Communications, Univ. of Vigo:
Carmen Garca Mateo

70

Svjc

Switzerland

Idiap Research Inst.: Herv Bourlard

Svdorszg

Sweden

Dept. of Swedish Lang., Univ. of Gothenburg: Lars Borin

Szerbia

Serbia

Faculty of Mathematics, Belgrade Univ.: Dusko Vitas, Cvetana Krstev,


Ivan Obradovic
Pupin Inst.: Sanja Vranes

Szlovkia

Slovakia

Ludovit Stur Inst. of Linguistics, Slovak Academy of Sciences: Radovan Garabik

Szlovnia

Slovenia

Jozef Stefan Inst.: Marko Grobelnik

Tbb mint 100 nyelvtechnolgus szakrt a META-NET-ben rszt vev orszgok s nyelvek kpviseli vitatta
meg s vglegestette a fehr knyvek sorozat fbb krdseit egy META-NET tallkozn Berlinben, 2011. oktber 21-22-n. About 100 language technology experts representatives of the countries and languages
represented in META-NET discussed and nalised the key results and messages of the White Paper Series at a
META-NET meeting in Berlin, Germany, on October 21/22, 2011.

71

C
A META-NET FEHR THE META-NET
KNYVEK SOROZAT WHITE PAPER SERIES
angol

English

English

baszk

Basque

euskara

bolgr

Bulgarian

cseh

Czech

etina

dn

Danish

dansk

szt

Estonian

eesti

nn

Finnish

suomi

francia

French

franais

galciai

Galician

galego

grg

Greek

holland

Dutch

Nederlands

horvt

Croatian

hrvatski

Irish

Gaeilge

izlandi

Icelandic

slenska

kataln

Catalan

catal

lengyel

Polish

polski

lett

Latvian

latvieu valoda

litvn

Lithuanian

lietuvi kalba

magyar

Hungarian

magyar

mltai

Maltese

Malti

nmet

German

Deutsch

norvg bokml

Norwegian Bokml

bokml

norvg nynorsk

Norwegian Nynorsk

nynorsk

olasz

Italian

italiano

portugl

Portuguese

portugus

romn

Romanian

romn

spanyol

Spanish

espaol

svd

Swedish

svenska

szerb

Serbian

szlovk

Slovak

slovenina

szlovn

Slovene

slovenina

73

Research
Co

ies
unit
mm

Lan
gu
a

es
stri
u
d

Soc
iet

rs
Use
e
g

In

In everyday communication, Europes citizens, business

A mindennapi kommunikci Eurpa polgrai, mind

partners and politicians are inevitably confronted with

az zleti, mind a politikai szfrban elkerlhetetlenl

language barriers. Language technology has the po-

nyelvi akadlyokba tkzik.

tential to overcome these barriers and to provide inno-

hozz tud jrulni ezen akadlyok legyzshez,

vative interfaces to technologies and knowledge. This

tovbb kapcsoldsi pontokat nyjt az innovatv

white paper presents the state of language technology

technolgik s tuds fel. Ez a fehr knyv a ma-

support for the hungarian language. It is part of a se-

gyar nyelvtechnolgia helyzett mutatja be, egyben

ries that analyses the available language resources and

egy sorozat rszt kpezi, amely az elrhet nyelvi

technologies for 31 European languages. The analy-

erforrsokrl s technolgikrl ad elemzst 31 eu-

sis was carried out by META-NET, a Network of Excel-

rpai nyelvre. A felmrst a META-NET, az Eurpai

lence funded by the European Commission. META-NET

Bizottsg ltal alaptott hlzat vgezte. A META-

consists of 54 research centres in 33 countries, who co-

NET 33 orszg 54 kutatkzpontjbl ll, akik

operate with stakeholders from economy, government

gazdasgi dntshozkkal, kormnyzati szervekkel,

agencies, research organisations, non-governmental or-

kutatszervezetekkel, nyelvi kzssgekkel s eur-

ganisations, language communities and European uni-

pai egyetemekkel dolgoznak egytt. A META-NET

versities. META-NETs vision is high-quality language

jvkpe: kivl minsg nyelvtechnolgia minden

technology for all European languages.

eurpai nyelvre.

A nyelvtechnolgia

META-NET is making a signicant contribution to innovation, research and development in Europe and to an
eective implementation of the European idea.
Valria Cspe (Deputy General Secretary of Hungarian Academy of Sciences)
A META-NET jelents mrtkben hozzjrul az innovcihoz s a kutats-fejlesztshez, valamint az eurpai
eszme hatkony megvalstshoz.
Cspe Valria (ftitkrhelyettes, MTA)

www.meta-net.eu
www.meta-net.eu

Anda mungkin juga menyukai