Section 1 Introduction
Natural Language Processing Group Department of Computer Science University of Leipzig Archaeological Computing Research Group Department of Archaeology University of Southampton
Marco Bchler
Tom Brughmans
Marco Bchler Tom Brughmans 1
Tom Brughmans University of Southampton (since 2008) Research Assistent (2009-2010) PhD student (since 2010) Research interests: Archaeological network analysis Networks in the humanities Citation analysis
Marco Bchler Leipzig University (since 2006) Project Manager eAQUA ('08-'11) Project Man. eTRACES ( since 07/2011) Research interests: Historical text re-use Authorship attribution OCR correction
Agenda
l Introduction (30 min) l What is a graph/network? (30 min) l Introduction to text re-use graphs (20 min) l Working with text re-use graphs (30 min) l Conclusion/nal discussion (10 min)
Marco Bchler Tom Brughmans
What is a network/graph? The world consists of objects such as persons, buildings but also events like a celebration, conference or a war. All these objects and events do not move solely through time and space but the are connected. Object 1: Tom Connection: Colleagues Object 2: Marco
A network/graph consists of two sets: A set of objects (nodes) A set of pairwise connections (edges)
Application: Archaeology
Application: Archaeology
10
Source: F. Baumgardt: Visualisierung von Kookkurrenzgraphen. Bachelorarbeit Abteilung Automatische Sprachverarbeitung, Universitt Leipzig, 2010.
11
Networks/graphs: Moretti's distant vs. Close reading l What do you think? Is network/graph analysis more close or distant reading? This workshop use networks/graphs in order to highlight macro structure information as distant reading mode.
12
Thinking through networks: generating, visualizing and analysing complex re-use graphs in the Humanities
13
Graph VS network
Graph VS network A graph is a set of vertices and a set of lines between pairs of vertices. A network consists of a graph and additional information on the vertices or the lines of the graph. (Nooy et.al. 2005, 6-7)
14
Graph VS network
15
History of research
Graph visualization
Geographical visualization
17
Graph visualization
Topological visualization
Linkedin Maps
18
Graph visualization
Circular visualization
19
Graph visualization
Grouped visualization
ITS ESC
sites
Analytical techniques
Degree (ki)
Indegree (k ini)
Outdegree (k outj)
Analytical techniques
Shortest path
Average shortest path = average of all shortest path scores between all possible pairs of vertices in the network
Marco Bchler Tom Brughmans 22
Analytical techniques
Clustering coefficient (c) = average of the fraction of all possible relationships between all nodes and their direct neighbours.
Ki = 3
ci = 3/6 = 0.5
Marco Bchler Tom Brughmans
C = ( ci , cj , c n ) / n
23
Analytical techniques
Small-world networks
24
25
Analytical techniques
Degree distribution
Number of nodes
26
Thinking through networks: generating, visualizing and analysing complex re-use graphs in the Humanities
27
Similarity of branches of the same knowledge, knowledge is changing over time Marco Bchler Tom Brughmans 28
What is of interest?
Functionality of objects vs. object of interest, critical amount of re-use Marco Bchler Tom Brughmans 29
l What is text re-use? l Text re-use techniques l Bible data l What decisions need to be made to make a text re-use applicable?
30
Denitions/Terminology Citation/quotation Modern: (SOURCE, <TEXT>) Ancient: mostly (, <TEXT>) Plagiarism Modern: mostly (, <TEXT>) Text Re-use Legal right aspects are ignored: For this reason: (<TEXT>) Literal citations Parallel texts Paraphrases Text Re-use graph G=(V,E), V is set of sentences, E set of links between elements of V (Hyper-textual structure in a Digital Library) A text re-use graph is not directed. Marco Bchler Tom Brughmans 31
7 different versions of the Holy Bible The data: * American Standard Version (ASV) * Bible in Basic English (Basic) * Darby Bible (Darby) * King James Version (KJV) * World English Bible (WEB) * Webster Bible (Webster) * Young's Literal Translation (YLT)
28,632 verses are selected that occurred in all versions. Bible version Word tokens Word types Token/type ratio ASV 54.97 741267 13485 Basic 100.85 791367 7350 Darby KJV WEB Webster YLT Marco Bchler Tom Brughmans 732928 746746 722817 744137 745422 14971 13466 13556 13655 13973 32 48.96 55.45 54.68 54.50 53.34
33
Level 1: Pre-processing
Capitalisation (e. g. all letters to lowercase) Normalisation (e. g. removing all diacritics) Lemmatisation (e. g. replace inected words by baseform) Synonym replacements (e. g. replace a word by the most common (most frequent) synonym) String similarity (words that are similar written) Reduced strings by word length Result: Cleaned text regarding language evolution, dialects, spelling or OCR errors, etc. Marco Bchler Tom Brughmans 34
N-gram feature
Syntactical feature
Overlapping
Non overlapping
Shingling
35
Level 3: Selection of training data building a re-use ngerprint - local selection strategies work with the knowledge within a re-use unit such as - Local 0 mod p (e. g. position of a word within a re-use unit) - random selection - Winnowing - global selection strategies work with global knowledge such as a word list of the entire corpus like - Global 0 mod p (e. g. rank of a word (cf. Zipan law) ) - Selection of special word classes such as nouns - Inverted Document Frequency (IDF) score - Minimum feature frequency selection - Maximum feature frequency selection Marco Bchler Tom Brughmans 36
Level 4: Linking types comparing re-use units Intra corpus detection (Text re-use): Inter corpus detection (Modern: Plagiarism, Ancient: e.g. bible): Marco Bchler Tom Brughmans 37
18,00%
16,00%
14,00%
P ercenta g e o f a ll references
12 ,00%
10,00%
S imila rity
8,00%
6,00%
4,00%
2 ,00%
0,00% 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
38
Some results problem focused What is a good similarity threshold (literal citations)? Dissimilarity vs. Fragments Plato: Low threshold provides good results as well Atthidographers: Poor quality precision less than 20% Multi word expressions like King Alexander the Great (literal citations) Phrases (Engl.: in the Name of Our Lord Jesus Christ) Again: We are the people! Editorial references to publications Works in different editions - Embedded text re-use (relation between linking and scoring) - Detection boundaries of text re-use
Thinking through networks: generating, visualizing and analysing complex re-use graphs in the Humanities
40
41
42
43
44
45
46
47
48
Genesis/1/1
ASV: In the beginning God created the heavens and the earth
BasicEnglish: At the rst God made the heaven and the earth
Darby: In the beginning God created the heavens and the earth.
KJV: In the beginning God created the heaven and the earth
Webster: In the beginning God created the heaven and the earth
WEB: In the beginning God created the heavens and the earth
YLT: In the beginning of God`s preparing the heavens and the earth
49
50
A text re-use from a document with a high text re-use coverage is more trustworthy than from a less frequently re-used text.
A text re-use from a section of a document with a high text re-use temperature is more trustworthy than from a less frequently re-used part of a document.
51
Source (Plot): John Lee: A Computational Model of Text Reuse in Ancient Literary Texts, 2009.
52
Thinking through networks: generating, visualizing and analysing complex re-use graphs in the Humanities
http://groups.google.com/group/the-networks-network?hl=en-GB
Section 5 Discussion
tb2g08@soton.ac.uk http://archaeologicalnetworks.wordpress.com/
Marco Bchler
mbuechler@e-humanities.net http://www.asv.informatik.uni-leipzig.de/
53