Anda di halaman 1dari 19

Preliminary Study on Automatic Induction of Rules for Recognition of Semantic Relations between Proper Names in Polish Texts

M. Marciczuk and M. Ptak


Wrocaw University of Technology

13 wrzenia 2012

Introduction Task denition

Task denition
Our task
To recognize semantic relations between proper names in running text.

Categories of semantic relations


We started with 8 coarse-grained relation categories (inspired by ACE):
alias Woody Allen (born Allan Stewart Konigsberg) ... aliation Steven Ballmer is chief executive ocer of Microsoft., composition The Chicago Loop is one of 77 community areas in Chicago., creator Mark Zuckerberg was one of the creator of Facebook., location The Eiel Tower is located in Paris., nationality Adam Maysz is a former Polish ski jumper., neighbourhood Czech Republic is bordered by Poland to the north., origin Walter Disney was born in Chicago..
M. Marciczuk and M. Ptak 13 wrzenia 2012 2 / 16

Introduction Assumptions & Examples

Assumptions & Examples


Assumptions
1 2 3

Proper names are recognized a priori 56 categories. Relations between proper names within one sentence. Existance of a relation must be stated in the sentence it is possible to point out a pattern which indicate the relation.

Examples
1 2 3

Facebook was created by Mark Zuckerberg. Mark Zuckerberg is a creator of Facebook. Mark Zuckerberg has prole on Facebook.

M. Marciczuk and M. Ptak

13 wrzenia 2012

3 / 16

Introduction Our goal

Our goal
Major goal
To create a supervised method that will learn to recognize semantic relations of given categories on the basis of positive and negative examples.

Current goal
To what extent can we automate rule creation on the basis of positive and negative examples and existing tools and resources (i.e., morphological analysis, wordnet). We wanted to create a method that will be able to identify some repeating patterns in the data indicating the existence of relations.

M. Marciczuk and M. Ptak

13 wrzenia 2012

4 / 16

Introduction Problem analysis

Problem analysis 1/2


Sample sentences with creator relations:
Mark is a developer of a freeware game known as SimCountry

Tom and Mike are authors of a shareware game named BusTycon

M. Marciczuk and M. Ptak

13 wrzenia 2012

5 / 16

Introduction Problem analysis

Problem analysis 2/2


Sample sentences with creator relations:
Mark is a developer of a freeware game known as SimCountry

PERSON be a creator of a software game called TITLE


Tom and Mike are authors of a shareware game named BusTycon

base form

hypheronym

synonym

M. Marciczuk and M. Ptak

13 wrzenia 2012

6 / 16

Introduction Problem analysis

Data representation
#code #software #make #game #freeware #name

Person John created a freeware game called

Title BusTycon

N John

V create

DT a

N freeware

N game

V call

N BusTycon

M. Marciczuk and M. Ptak

13 wrzenia 2012

7 / 16

Introduction Experimental dataset

Experimental dataset KPWr corpus


KPWr Training Held-out Testing General statistics documents tokens relations 761 204,354 3,641 418 105,613 2,317 135 43,184 516 208 55,557 808

Detailed relation statistics aliation alias composition creator location nationality neighbourhood origin
M. Marciczuk and M. Ptak

446 187 385 187 1141 27 159 106

250 87 277 94 816 14 98 70


13 wrzenia 2012

61 33 50 31 156 6 28 12

135 67 58 62 169 7 33 26
8 / 16

Manual rule creation Procedure

Manual rule creation


WCCL formalism, rules created in limited time on the basis of the training set, more attention to origin relation, 117 rules created in total. Sample WCCL rule for origin relation:
apply ( match ( is ( person_nam ) , // person_nam equal ( base [0] , urodzi ) , // born equal ( base [0] , si ) , equal ( base [0] , w ) , // in is ( city_nam ) , // city_nam ), actions ( link (1 , person_nam , 5 , city_nam , origin ) ))
M. Marciczuk and M. Ptak 13 wrzenia 2012 9 / 16

Manual rule creation Baseline results

Baseline results

Set Relation aliation alias composition creator location nationality neigbourhood origin

P [%] 72.22 100.00 79.17 77.78 88.79 66.67 85.71 87.50

Training R [%] 9.56 13.01 5.56 6.25 11.16 28.57 5.36 67.12

F [%] 16.88 23.02 10.38 11.57 19.83 40.00 10.08 75.97

P [%] 80.95 100.00 100.00 100.00 100.00 66.67 100.00 84.62

Testing R [%] 11.49 7.69 18.87 1.49 9.36 28.57 2.78 40.74

F [%] 20.12 14.29 31.75 2.94 17.11 40.00 5.41 55.00

good precision but very low recall, best results for origin relation.

M. Marciczuk and M. Ptak

13 wrzenia 2012

10 / 16

Automatic rule creation How to discover the rules?

How to discover the rules?


sentence can be represented as a graph, the nodes in the graph represents tokens, annotations and thier attributes, pattern is a path (or, in general, set of paths) in the graph, such problem can be solved using Inductive Logic Programming.
#code #software #make #game #freeware #name

Person John created a freeware game called

Title BusTycon

N John

V create

DT a

N freeware

N game

V call

N BusTycon

M. Marciczuk and M. Ptak

13 wrzenia 2012

11 / 16

Automatic rule creation Data representation

Data representation basic predicates


token_after_token(token, token) describes the order of tokens, token_orth(token, word) word is an orthographic form of token, token_tag(token, tag) tag is a morphological tag assigned to token, token_pattern(token, pattern) token orthographic form is pattern, where pattern is one of upper case, lower case, etc., tag_base(tag, word) word is a base form for tag, tag_morph(tag, morph) morphological feature for tag, annotation_type(annotation, annotation_type) annotation is type of annotation_type, annotation_range(annotation, token, token) annotation starts from rst token and ends on the second token.
M. Marciczuk and M. Ptak 13 wrzenia 2012 12 / 16

Automatic rule creation Data representation

Data representation extended predicates


The role of these predicates is to simplify the search space by shortening the chain of predicates. next_orth(token-1, token-2, word) token-2 has orthographic form word and appears directly after token-1, next_orth(A, B, C) :token_after_token(A, B), token_orth(B, C). next_pattern(token-1, token-2, pattern) token-2 matches pattern and appears directly after token-1, prev_orth(token-1, token-2, word) token-2 has orthographic form word and appears directly before token-1, prev_pattern(token-1, token-2, pattern) token-1 matches pattern and appears directly before token-2.
M. Marciczuk and M. Ptak 13 wrzenia 2012 13 / 16

Automatic rule creation ILP conguration

ILP conguration
a general-purpose ILP systems: FOIL, GOLEM, Progol, Aleph, the values of the parameters were experimentaly obtained on the held-out set, parameters:
breadth-rst search strategy, i=8 upper bound on layers of new variables, clauselength=8 upper bound on number of literals in clause, nodes=320000 upper bound on the nodes to be explored, minpos=2 lower bound on the number of positive examples to be covered by a clause, noise=10 upper bound on number of negative examples covered by clause.

M. Marciczuk and M. Ptak

13 wrzenia 2012

14 / 16

Automatic rule creation Sample rule

Sample rule
Sample rule for origin relation
relation(A,B,origin) :annotation_range(B,C,D), prev_orth(E,C,word_w), // in annotation_range(A,F,G), next_orth(G,H,meta_BRACKET_LEFT), next_orth(H,I,word_ur), // born next_orth(I,J,meta_DOT), next_pattern(J,K,PATTERN_NUM), next_pattern(K,L,PATTERN_LOWERCASE).

Sample sentences matched by the rule


Jan Nowak (ur. 29 stycznia w Krakowie Marek Kowalski (ur. 1 marca 1980 roku we Wrocawiu
M. Marciczuk and M. Ptak 13 wrzenia 2012 15 / 16

Automatic rule creation Evaluation

Evaluation
Comparision with WCCL rules on the testing set:
Set Relation aliation alias composition creator location nationality neigbourhood origin WCCL Rules P [%] R [%] F [%] 80.95 100.00 100.00 100.00 100.00 66.67 100.00 84.62 11.49 7.69 18.87 1.49 9.36 28.57 2.78 40.74 20.12 14.29 31.75 2.94 17.11 40.00 5.41 55.00 P [%] 43.79 62.86 40.63 30.77 31.30 0.00 12.00 48.28 ILP R [%] 49.63 32.84 44.83 12.90 42.60 0.00 9.38 53.85 F [%] 46.53 43.14 42.62 18.18 36.09 0.00 10.53 50.91 F +/[%] +26.41 +28.84 +10.85 +15.24 +18.98 -40.00 +5.12 -4.09

higher recall but lower precision, no rules for neighbourhood only 14 examples in the training set, better but not good as expected.
M. Marciczuk and M. Ptak 13 wrzenia 2012 16 / 16

Summary Conclusions

Conclusions

Generic ILP system can be eectively used to discover the rules in the task of semantic relation recognition. Search space control is an option to improve the performance (denition of prunning rules). Creating rules with * operator is inecient needs more attention. The weak point rules consisting of two disjoint sequences. Insucient number of data for some relation categories.

4 5

M. Marciczuk and M. Ptak

13 wrzenia 2012

17 / 16

Summary Latest & future work

Latest & future work


Latest work
1

The background knowledge was extended by dependency information of tokens to reduce word order variety. The rules were used as features for classiers. For some categories we reached the level of 5060% of F-measure (i.e., composition, origin, nationality and aliation).

2 3

Future work
1 2

Make use of shallow parsers. Force the continuity of patterns or introduce the maximum distance between the parts of the patterns. Apply bootstrapping to extend the training set.
13 wrzenia 2012 18 / 16

M. Marciczuk and M. Ptak

The end

Thank you for your attention.

M. Marciczuk and M. Ptak

13 wrzenia 2012

19 / 16

Anda mungkin juga menyukai