Anda di halaman 1dari 6

International Journal on Computational Sciences & Applications (IJCSA) Vo3, No.

1, February 2013

NAMED ENTITY RECOGNITION IN ENGLISH USING HIDDEN MARKOV MODEL


Deepti Chopra1 and Sudha Morwal2
Department of Computer Engineering, Banasthali Vidyapith, Jaipur (Raj.), INDIA
deeptichopra11@yahoo.co.in, sudha_morwal@yahoo.co.in

ABSTRACT
Named Entity Recognition (NER) is one of the tasks of processing documents to detect and categorize proper nouns, which is known to be a useful component in various Natural Language Processing (NLP) applications, that enable the extraction of important information from texts very easily. In the following paper we have discussed about NER, various approaches of NER and the results obtained by performing NER in English using Hidden Markov Model (HMM).

KEYWORDS
Named Entity Recognition (NER), Hidden Markov Model (HMM), Performance Metrics, Named Entities

1. INTRODUCTION
Named Entity Recognition (NER) is one of the subtasks of Information Extraction. The main aim of NER is to first locate the proper nouns or the Named Entities (NEs) in a given text and then classify these NEs into different categories of NEs such as Name of Person, Location, Organization, Sport, River, Time, Quantity, Percentage etc. Some of the applications of NER include: Machine translation, Automatic Summarization, Question Answering System, Information Retrieval, Information Extraction etc. Today, NER has been performed in languages such as English, some of the European languages, Korean, Chinese, Japanese, Indian languages etc. Consider a sentence: Rohit/PERSON plays/O cricket/SPORT In the above sentence, Rohit is a name of person, so NER based system will tag it as PERSON. Cricket is a name of Sport, so it is associated with a SPORT tag. O means that it is not a proper noun or Named Entity.

2. METHODOLOGIES OF NER
Following approaches are used to perform NER: [1][5][6][16][18] 1. Rule Based Approach 1.1 Linguistic Approach[3][11] 1.2 List Look up Approach 2. Machine Learning Based Approach 2.1 Hidden Markov Model (HMM) [12] 2.2 Maximum Entropy Markov Model (MEMM) [16] 2.3 Conditional Random Fields (CRF) [3]
DOI : 10.5121/ijcsa.2013.3101 1

International Journal on Computational Sciences & Applications (IJCSA) Vo3, No.1, February 2013

2.4 Support Vector Machine (SVM) [4] 2.5 Decision Tree [7][9]

3. NER IN ENGLISH USING HMM


I have done training on 6,680 words of English (or 25 files) taken from Treebank corpus found in NLTK. The tags that I have used for annotation of the document are shown in TABLE 1. Out of these tags we have 8 Named Entities and OTHER is Not a Named Entity tag. TABLE 1 Tags used for NER in English TAGS PER (Name of Person) OTHER (Not a Named Entity) ORG (Name of Organization) CO (Name of Country) MAGAZINE (Name of Magazine) WEEK (Name of Week) LOC (Name of Location) PC (Name of Personal Computer) MONTH (Name of Month)

SNO 1 2 3 4 5 6 7 8 9

Example of NER in English using HMM: Consider a raw text: NormanRicken was 49 years old DonaldPardus has been to Argentina NormanRicken was named senior vice president He/other was vice president of CrayComputerCorp. NormanRicken was president of Australia BassStraitfields After tagging, we obtain the following annotated text: NormanRicken/PER was/OTHER 49/OTHER years/OTHER old/OTHER DonaldPardus/PER has/OTHER been/OTHER to/OTHER Argentina/CO NormanRicken/PER president/OTHER was/OTHER named/OTHER senior/OTHER vice/OTHER

He/other was/other vice/OTHER president/OTHER of/other CrayComputerCorp./ORG NormanRicken/PER was/OTHER president/OTHER of/other Australia/CO BassStraitfields/ORG HMM has 3 parameters: Start Probability, Transition Probability and Emission Probability. Start Probability means the probability that the tag exists first in a sentence. Transition Probability means the ratio of probability of transition from given tag to the next tag and the Probability of occurrence of a given tag. Emission Probability is the ratio of probability of occurrence of particular word with tag t and the Probability of occurrence of a tag t. HMM parameters can be described mathematically as follows:
2

International Journal on Computational Sciences & Applications (IJCSA) Vo3, No.1, February 2013

A = aij = (Number of transitions from state si to sj) / (Number of transitions from state si). B = bj (k) = (Number of times in state j and observing symbol k) / (expected number of times in state j).
In the above example we have: states ['PER', 'OTHER', 'CO', 'other', 'ORG'] start probability {'other': 0.2, 'PER': 0.8} transition probability {'ORG': {'ORG': 0.0, 'OTHER': 0.0, 'CO': 0.0, 'other': 0.0, 'PER': 0.5}, 'OTHER': {'ORG': 0.0, 'OTHER': 0.6875, 'CO': 0.0625, 'other': 0.1875, 'PER': 0.0625}, 'CO': {'ORG': 0.5, 'OTHER': 0.0, 'CO': 0.0, 'other': 0.0, 'PER': 0.5}, 'other': {'ORG': 0.25, 'OTHER': 0.25, 'CO': 0.25, 'other': 0.25, 'PER': 0.0}, 'PER': {'ORG': 0.0, 'OTHER': 1.0, 'CO': 0.0, 'other': 0.0, 'PER': 0.0}} emission probability {'ORG': {'has': 0, 'named': 0, 'Australia': 0, 'old': 0, 'BassStraitfields': 0.5, 'vice': 0, 'CrayComputerCorp.': 0.5, '49': 0, 'been': 0, 'years': 0, 'to': 0, 'NormanRicken': 0, 'of': 0, 'president': 0, 'Argentina': 0, 'DonaldPardus': 0, 'was': 0, 'senior': 0, 'He': 0}, 'OTHER': {'has': 0.0625, 'named': 0.0625, 'Australia': 0, 'old': 0.0625, 'BassStraitfields': 0, 'vice': 0.125, 'CrayComputerCorp.': 0, '49': 0.0625, 'been': 0.0625, 'years': 0.0625, 'to': 0.0625, 'NormanRicken': 0, 'of': 0, 'president': 0.1875, 'Argentina': 0, 'DonaldPardus': 0, 'was': 0.1875, 'senior': 0.0625, 'He': 0}, 'CO': {'has': 0, 'named': 0, 'Australia': 0.5, 'old': 0, 'BassStraitfields': 0, 'vice': 0, 'CrayComputerCorp.': 0, '49': 0, 'been': 0, 'years': 0, 'to': 0, 'NormanRicken': 0, 'of': 0, 'president': 0, 'Argentina': 0.5, 'DonaldPardus': 0, 'was': 0, 'senior': 0, 'He': 0}, 'other': {'has': 0, 'named': 0, 'Australia': 0, 'old': 0, 'BassStraitfields': 0, 'vice': 0, 'CrayComputerCorp.': 0, '49': 0, 'been': 0, 'years': 0, 'to': 0, 'NormanRicken': 0, 'of': 0.5, 'president': 0, 'Argentina': 0, 'DonaldPardus': 0, 'was': 0, 'senior': 0, 'He': 0.25}, 'PER': {'has': 0, 'named': 0, 'Australia': 0, 'old': 0, 'BassStraitfields': 0, 'vice': 0, 'CrayComputerCorp.': 0, '49': 0, 'been': 0, 'years': 0, 'to': 0, 'NormanRicken': 0.75, 'of': 0, 'president': 0, 'Argentina': 0, 'DonaldPardus': 0.25, 'was': 0, 'senior': 0, 'He': 0}}

3. RESULTS
Figure1 displays the results obtained by us by performing NER in English using HMM. We obtained F- Measure of 73.8%. We have obtained more than 70% of accuracy in correctly identifying the Named Entities especially the Names of Persons.

International Journal on Computational Sciences & Applications (IJCSA) Vo3, No.1, February 2013

180 160 140 120 100 80 60 40 20 0


PE R OR G CO M ZI GA A NE W K EE C LO PC M TH ON

% OF TAGS CORRECTLY IDENTIFIED TOTAL TAGS IN TRAINING SENTENCE

Figure1 Results of NER in English using HMM

4. CONCLUSION
We have performed NER on 25 files of Treebank corpus or 6,680 words using Hidden Markov Model. We considered total 8 Named Entities and we obtained F-Measure of 73.8%. Our challenge is to increase the performance of this NER based system. The F-Measure can be enhanced by combining other approaches with HMM.

REFERENCES
[1] Animesh Nayan,, B. Ravi Kiran Rao, Pawandeep Singh,Sudip Sanyal and Ratna Sanya Named Entity Recognition for Indian Languages .In Proceedings of the IJCNLP-08 Workshop on NER for South and South East Asian Languages ,Hyderabad (India) pp. 97104, 2008. Available at: http://www.aclweb.org/anthology-new/I/I08/I08-5014.pdf [2] Asif Ekbal and Sivaji Bandyopadhyay .Named Entity Recognition using Support Vector Machine: A Language Independent Approach International Journal of Electrical and Electronics Engineering 4:2 2010. Available at: http://www.waset.org/journals/ijeee/v4/v4-2-19.pdf [3] Asif Ekbal, Rejwanul Haque, Amitava Das, Venkateswarlu Poka and Sivaji Bandyopadhyay Language Independent Named Entity Recognition in Indian Languages .In Proceedings of the IJCNLP-08 Workshop on NER for South and South East Asian Languages, pages 3340,Hyderabad, India, January 2008. Available at: http://www.mt-archive.info/IJCNLP-2008-Ekbal.pdf [4] Asif Ekbal and Sivaji Bandyopadhyay 2008 Bengali Named Entity Recognition using Support Vector Machine Proceedings of the IJCNLP-08 Workshop on NER for South and South East Asian Languages, pages 5158, Hyderabad, India, January 2008.. Available at: http://www.aclweb.org/anthology-new/I/I08/I08-5008.pdf [5] B. Sasidhar , P. M. Yohan, Dr. A. Vinaya Babu3, Dr. A. Govardhan,.A Survey on Named Entity Recognition in Indian Languages with particular reference to Telugu IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 2, March 2011 Available at: http://www.ijcsi.org/papers/IJCSI-8-2-438-443.pdf [6] Darvinder kaur, Vishal Gupta.A survey of Named Entity Recognition in English and other Indian Languages . IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 6, November 2010. Available at: http://ijcsi.org/papers/7-6-239-245.pdf [7] Georgios Paliouras, Vangelis Karkaletsis, Georgios Petasis and Constantine D.Spyropoulos.Learning Decision Trees for Named-Entity Recognition and Classification 4

International Journal on Computational Sciences & Applications (IJCSA) Vo3, No.1, February 2013 Available at: http://users.iit.demokritos.gr/~petasis/Publications/Papers/ECAI-2000.pdf [8] G.V.S.RAJU,B.SRINIVASU,Dr.S.VISWANADHA RAJU,4K.S.M.V.KUMAR Named Entity Recognition for Telugu Using Maximum Entropy Model Available at: http://www.jatit.org/volumes/research-papers/Vol13No2/4Vol13No2.pdf [9] Hideki Isozaki Japanese Named Entity Recognition based on a Simple Rule Generator and Decision Tree Learning .Available at: http://acl.ldc.upenn.edu/acl2001/MAIN/ISOZAKI.PDF [10] James Mayeld and Paul McNamee and Christine Piatko Named Entity Recognition using Hundreds of Thousands of Features .Available at: http://acl.ldc.upenn.edu/W/W03/W03-0429.pdf [11]Kamaldeep Kaur, Vishal Gupta. Name Entity Recognition for Punjabi Language IRACST International Journal of Computer Science and Information Technology & Security (IJCSITS), ISSN: 22499555 .Vol. 2, No.3, June 2012 [12] Lawrence R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition", In Proceedings of the IEEE, 77 (2), p. 257-286February 1989. Available at: http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf [13] Padmaja Sharma , Utpal Sharma, Jugal KalitaNamed Entity Recognition: A Survey for the Indian Languages. . (LANGUAGE IN INDIA. Strength for Today and Bright Hope for Tomorrow .Volume 11: 5 May 2011 ISSN 1930-2940.) Available at: http://www.languageinindia.com/may2011/v11i5may2011.pdf [14] Praveen Kumar P and Ravi Kiran V A Hybrid Named Entity Recognition System for South Asian Languages. Available at-http://www.aclweb.org/anthology-new/I/I08/I08-5012.pdf [15] S. Pandian, K. A. Pavithra, and T. Geetha, Hybrid Three-stage Named Entity Recognizer for Tamil, INFOS2008, March Cairo-Egypt. Available at: http://infos2008.fci.cu.edu.eg/infos/NLP_08_P045-052.pdf [16] Shilpi Srivastava, Mukund Sanglikar & D.C Kothari. Named Entity Recognition System for Hindi Language: A Hybrid Approach International Journal of Computational Linguistics (IJCL), Volume (2) : Issue (1) : 2011.Available at: http://cscjournals.org/csc/manuscript/Journals/IJCL/volume2/Issue1/IJCL19.pdf [17] Sujan Kumar Saha, Sudeshna Sarkar , Pabitra Mitra Gazetteer Preparation for Named Entity Recognition in Indian Languages. Available at:http://www.aclweb.org/anthology-new/I/I08/I08-7002.pdf [18] Sujan Kumar Saha Sanjay Chatterji Sandipan Dandapat . A Hybrid Approach for Named Entity Recognition in Indian Languages Available at: http://aclweb.org/anthology-new/I/I08/I08-5004.pdf [19] S. Biswas, M. K. Mishra , Sitanath_biswas ,S. Acharya , S. Mohanty A Two Stage Language Independent Named Entity Recognition for Indian Languages (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 1 (4) , 2010, 285-289. Available at: http://www.ijcsit.com/docs/vol1issue4/ijcsit2010010416.pdf [20] Vishal Gupta, Gurpreet Singh Lehal Named Entity Recognition for Punjabi Language Text Summarization International Journal of Computer Applications (0975 8887) Volume 33 No.3, November 2011. Available at:http://www.advancedcentrepunjabi.org/pdf/NER%20for%20Summarization.pdf

International Journal on Computational Sciences & Applications (IJCSA) Vo3, No.1, February 2013 About Authors Deepti Chopra received B.Tech degree in Computer Science and Engineering from Rajasthan College of Engineering for Women, Jaipur, Rajasthan in 2011.Currently she is pursuing her M.Tech degree in Computer Science and Engineering from Banasthali University, Rajasthan. Her research interests include Artificial Intelligence, Natural Language Processing, and Information Retrieval. She has published many papers in international journals and conferences. Sudha Morwal is an active researcher in the field of Natural Language Processing. Currently working as Associate Professor in the Department of Computer Science at Banasthali University (Rajasthan), India. She has done M.Tech (Computer Science) , NET, M.Sc (Computer Science) and her PhD is in progress from Banasthali University (Rajasthan), India. She has published many papers in International Conferences and Journals.

Anda mungkin juga menyukai