(Presentation) Preliminary Study On Automatic Induction of Rules For Recognition of Semantic Relations Between Proper Names in Polish Texts

Preliminary Study on Automatic Induction of Rules for Recognition of Semantic Relations between Proper Names in Polish Texts
M. Marciczuk and M. Ptak

Wrocaw University of Technology
13 wrzenia 2012
Introduction Task denition
Task denition
Our task
To recognize semantic relations between proper names in running text.
Categories of semantic relations

We started with 8 coarse-grained relation categories (inspired by ACE):
alias Woody Allen (born Allan Stewart Konigsberg) ... aliation Steven Ballmer is chief executive ocer of Microsoft., composition The Chicago Loop is one of 77 community areas in Chicago., creator Mark Zuckerberg was one of the creator of Facebook., location The Eiel Tower is located in Paris., nationality Adam Maysz is a former Polish ski jumper., neighbourhood Czech Republic is bordered by Poland to the north., origin Walter Disney was born in Chicago..
M. Marciczuk and M. Ptak 13 wrzenia 2012 2 / 16
Introduction Assumptions & Examples
Assumptions & Examples

Assumptions
1 2 3
Proper names are recognized a priori 56 categories. Relations between proper names within one sentence. Existance of a relation must be stated in the sentence it is possible to point out a pattern which indicate the relation.
Examples
1 2 3
Facebook was created by Mark Zuckerberg. Mark Zuckerberg is a creator of Facebook. Mark Zuckerberg has prole on Facebook.
13 wrzenia 2012
3 / 16
Introduction Our goal
Our goal
Major goal
To create a supervised method that will learn to recognize semantic relations of given categories on the basis of positive and negative examples.
Current goal
To what extent can we automate rule creation on the basis of positive and negative examples and existing tools and resources (i.e., morphological analysis, wordnet). We wanted to create a method that will be able to identify some repeating patterns in the data indicating the existence of relations.
13 wrzenia 2012
4 / 16
Introduction Problem analysis
Problem analysis 1/2

Sample sentences with creator relations:
Mark is a developer of a freeware game known as SimCountry
Tom and Mike are authors of a shareware game named BusTycon
13 wrzenia 2012
5 / 16
Problem analysis 2/2

Sample sentences with creator relations:
Mark is a developer of a freeware game known as SimCountry
PERSON be a creator of a software game called TITLE

Tom and Mike are authors of a shareware game named BusTycon
base form
hypheronym
synonym
13 wrzenia 2012
6 / 16
Data representation
#code #software #make #game #freeware #name
Person John created a freeware game called
Title BusTycon
N John
V create
DT a
N freeware
N game
V call
N BusTycon
13 wrzenia 2012
7 / 16
Introduction Experimental dataset
Experimental dataset KPWr corpus

KPWr Training Held-out Testing General statistics documents tokens relations 761 204,354 3,641 418 105,613 2,317 135 43,184 516 208 55,557 808
Detailed relation statistics aliation alias composition creator location nationality neighbourhood origin
446 187 385 187 1141 27 159 106
250 87 277 94 816 14 98 70

13 wrzenia 2012
61 33 50 31 156 6 28 12
135 67 58 62 169 7 33 26
8 / 16
Manual rule creation Procedure
Manual rule creation

WCCL formalism, rules created in limited time on the basis of the training set, more attention to origin relation, 117 rules created in total. Sample WCCL rule for origin relation:
apply ( match ( is ( person_nam ) , // person_nam equal ( base [0] , urodzi ) , // born equal ( base [0] , si ) , equal ( base [0] , w ) , // in is ( city_nam ) , // city_nam ), actions ( link (1 , person_nam , 5 , city_nam , origin ) ))
Manual rule creation Baseline results
Baseline results
Set Relation aliation alias composition creator location nationality neigbourhood origin
P [%] 72.22 100.00 79.17 77.78 88.79 66.67 85.71 87.50
Training R [%] 9.56 13.01 5.56 6.25 11.16 28.57 5.36 67.12
F [%] 16.88 23.02 10.38 11.57 19.83 40.00 10.08 75.97
P [%] 80.95 100.00 100.00 100.00 100.00 66.67 100.00 84.62
Testing R [%] 11.49 7.69 18.87 1.49 9.36 28.57 2.78 40.74
F [%] 20.12 14.29 31.75 2.94 17.11 40.00 5.41 55.00
good precision but very low recall, best results for origin relation.
13 wrzenia 2012
10 / 16
Automatic rule creation How to discover the rules?
How to discover the rules?

sentence can be represented as a graph, the nodes in the graph represents tokens, annotations and thier attributes, pattern is a path (or, in general, set of paths) in the graph, such problem can be solved using Inductive Logic Programming.
#code #software #make #game #freeware #name
Person John created a freeware game called
Title BusTycon
N John
V create
DT a
N freeware
N game
V call
N BusTycon
13 wrzenia 2012
11 / 16
Automatic rule creation Data representation
Data representation basic predicates

token_after_token(token, token) describes the order of tokens, token_orth(token, word) word is an orthographic form of token, token_tag(token, tag) tag is a morphological tag assigned to token, token_pattern(token, pattern) token orthographic form is pattern, where pattern is one of upper case, lower case, etc., tag_base(tag, word) word is a base form for tag, tag_morph(tag, morph) morphological feature for tag, annotation_type(annotation, annotation_type) annotation is type of annotation_type, annotation_range(annotation, token, token) annotation starts from rst token and ends on the second token.
Automatic rule creation Data representation
Data representation extended predicates

The role of these predicates is to simplify the search space by shortening the chain of predicates. next_orth(token-1, token-2, word) token-2 has orthographic form word and appears directly after token-1, next_orth(A, B, C) :token_after_token(A, B), token_orth(B, C). next_pattern(token-1, token-2, pattern) token-2 matches pattern and appears directly after token-1, prev_orth(token-1, token-2, word) token-2 has orthographic form word and appears directly before token-1, prev_pattern(token-1, token-2, pattern) token-1 matches pattern and appears directly before token-2.
Automatic rule creation ILP conguration
ILP conguration
a general-purpose ILP systems: FOIL, GOLEM, Progol, Aleph, the values of the parameters were experimentaly obtained on the held-out set, parameters:
breadth-rst search strategy, i=8 upper bound on layers of new variables, clauselength=8 upper bound on number of literals in clause, nodes=320000 upper bound on the nodes to be explored, minpos=2 lower bound on the number of positive examples to be covered by a clause, noise=10 upper bound on number of negative examples covered by clause.
13 wrzenia 2012
14 / 16
Automatic rule creation Sample rule
Sample rule
Sample rule for origin relation
relation(A,B,origin) :annotation_range(B,C,D), prev_orth(E,C,word_w), // in annotation_range(A,F,G), next_orth(G,H,meta_BRACKET_LEFT), next_orth(H,I,word_ur), // born next_orth(I,J,meta_DOT), next_pattern(J,K,PATTERN_NUM), next_pattern(K,L,PATTERN_LOWERCASE).
Sample sentences matched by the rule

Jan Nowak (ur. 29 stycznia w Krakowie Marek Kowalski (ur. 1 marca 1980 roku we Wrocawiu
Automatic rule creation Evaluation
Evaluation
Comparision with WCCL rules on the testing set:
Set Relation aliation alias composition creator location nationality neigbourhood origin WCCL Rules P [%] R [%] F [%] 80.95 100.00 100.00 100.00 100.00 66.67 100.00 84.62 11.49 7.69 18.87 1.49 9.36 28.57 2.78 40.74 20.12 14.29 31.75 2.94 17.11 40.00 5.41 55.00 P [%] 43.79 62.86 40.63 30.77 31.30 0.00 12.00 48.28 ILP R [%] 49.63 32.84 44.83 12.90 42.60 0.00 9.38 53.85 F [%] 46.53 43.14 42.62 18.18 36.09 0.00 10.53 50.91 F +/[%] +26.41 +28.84 +10.85 +15.24 +18.98 -40.00 +5.12 -4.09
higher recall but lower precision, no rules for neighbourhood only 14 examples in the training set, better but not good as expected.
Summary Conclusions
Conclusions
Generic ILP system can be eectively used to discover the rules in the task of semantic relation recognition. Search space control is an option to improve the performance (denition of prunning rules). Creating rules with * operator is inecient needs more attention. The weak point rules consisting of two disjoint sequences. Insucient number of data for some relation categories.
4 5
13 wrzenia 2012
17 / 16
Summary Latest & future work
Latest & future work

Latest work
1
The background knowledge was extended by dependency information of tokens to reduce word order variety. The rules were used as features for classiers. For some categories we reached the level of 5060% of F-measure (i.e., composition, origin, nationality and aliation).
2 3
Future work
1 2
Make use of shallow parsers. Force the continuity of patterns or introduce the maximum distance between the parts of the patterns. Apply bootstrapping to extend the training set.
13 wrzenia 2012 18 / 16
The end
Thank you for your attention.
13 wrzenia 2012
19 / 16

(Presentation) Preliminary Study On Automatic Induction of Rules For Recognition of Semantic Relations Between Proper Names in Polish Texts

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

(Presentation) Preliminary Study On Automatic Induction of Rules For Recognition of Semantic Relations Between Proper Names in Polish Texts

Diunggah oleh

Hak Cipta:

Format Tersedia

Preliminary Study on Automatic Induction of Rules for Recognition of Semantic Relations between Proper Names in Polish Texts

M. Marciczuk and M. Ptak

Introduction Task denition

Categories of semantic relations

Introduction Assumptions & Examples

Assumptions & Examples

M. Marciczuk and M. Ptak

Introduction Our goal

M. Marciczuk and M. Ptak

Introduction Problem analysis

Problem analysis 1/2

Tom and Mike are authors of a shareware game named BusTycon

M. Marciczuk and M. Ptak

Introduction Problem analysis

Problem analysis 2/2

PERSON be a creator of a software game called TITLE

M. Marciczuk and M. Ptak

Introduction Problem analysis

Person John created a freeware game called

M. Marciczuk and M. Ptak

Introduction Experimental dataset

Experimental dataset KPWr corpus

446 187 385 187 1141 27 159 106

250 87 277 94 816 14 98 70

Manual rule creation Procedure

Manual rule creation

Manual rule creation Baseline results

P [%] 72.22 100.00 79.17 77.78 88.79 66.67 85.71 87.50

F [%] 16.88 23.02 10.38 11.57 19.83 40.00 10.08 75.97

P [%] 80.95 100.00 100.00 100.00 100.00 66.67 100.00 84.62

F [%] 20.12 14.29 31.75 2.94 17.11 40.00 5.41 55.00

M. Marciczuk and M. Ptak

Automatic rule creation How to discover the rules?

How to discover the rules?

Person John created a freeware game called

M. Marciczuk and M. Ptak

Automatic rule creation Data representation

Data representation basic predicates

Automatic rule creation Data representation

Data representation extended predicates

Automatic rule creation ILP conguration

M. Marciczuk and M. Ptak

Automatic rule creation Sample rule

Sample sentences matched by the rule

Automatic rule creation Evaluation

M. Marciczuk and M. Ptak

Summary Latest & future work

Latest & future work

M. Marciczuk and M. Ptak

Thank you for your attention.

M. Marciczuk and M. Ptak

Anda mungkin juga menyukai