Anda di halaman 1dari 53

Thinking through networks: generating, visualizing and analysing complex re-use graphs in the Humanities

Section 1 Introduction

Natural Language Processing Group Department of Computer Science University of Leipzig Archaeological Computing Research Group Department of Archaeology University of Southampton

Marco Bchler

Tom Brughmans


Marco Bchler Tom Brughmans 1

Who are we?

Tom Brughmans University of Southampton (since 2008) Research Assistent (2009-2010) PhD student (since 2010) Research interests: Archaeological network analysis Networks in the humanities Citation analysis

Marco Bchler Leipzig University (since 2006) Project Manager eAQUA ('08-'11) Project Man. eTRACES ( since 07/2011) Research interests: Historical text re-use Authorship attribution OCR correction

Marco Bchler Tom Brughmans

Agenda

l Introduction (30 min) l What is a graph/network? (30 min) l Introduction to text re-use graphs (20 min) l Working with text re-use graphs (30 min) l Conclusion/nal discussion (10 min)

Marco Bchler Tom Brughmans

Who are you?


4

Marco Bchler Tom Brughmans

What do you associate with networks/graphs


?
5


Marco Bchler Tom Brughmans

What is a network/graph? The world consists of objects such as persons, buildings but also events like a celebration, conference or a war. All these objects and events do not move solely through time and space but the are connected. Object 1: Tom Connection: Colleagues Object 2: Marco

A network/graph consists of two sets: A set of objects (nodes) A set of pairwise connections (edges)

Marco Bchler Tom Brughmans

Application: Social networks l From Sokrates to Alexander the Great

Marco Bchler Tom Brughmans

Application: Archaeology

Marco Bchler Tom Brughmans

Application: Archaeology

Marco Bchler Tom Brughmans

Application: co-occurrences in text graphs

Marco Bchler Tom Brughmans

10

A visualisation of the contrastive semantics

Source: F. Baumgardt: Visualisierung von Kookkurrenzgraphen. Bachelorarbeit Abteilung Automatische Sprachverarbeitung, Universitt Leipzig, 2010.

Marco Bchler Tom Brughmans

11

Networks/graphs: Moretti's distant vs. Close reading l What do you think? Is network/graph analysis more close or distant reading? This workshop use networks/graphs in order to highlight macro structure information as distant reading mode.

Marco Bchler Tom Brughmans

12

Thinking through networks: generating, visualizing and analysing complex re-use graphs in the Humanities

Section 2 What is a graph/network

Marco Bchler Tom Brughmans

13

Graph VS network

Graph VS network A graph is a set of vertices and a set of lines between pairs of vertices. A network consists of a graph and additional information on the vertices or the lines of the graph. (Nooy et.al. 2005, 6-7)

Simple undirected graph

Marco Bchler Tom Brughmans

14

Graph VS network

Simple directed graph

Weighted directed graph

Marco Bchler Tom Brughmans

15

History of research

Social Network Analysis (SNA)


Sociometry Actors are social entities Actors are interdependent Relations are channels for transfer or 'flow' of resources Typical applications: the diffusion and adaptation of innovations, belief systems, markets, exchange and power, occupational mobility

Complex network science


Complexity theory Small-world network Scale-free network Actors can be anything Actors are interdependent Relations are channels for transfer or 'flow' of resources Typical applications: WWW, internet, movie actors, scientific collaboration, citations, human sexual contacts, cellular networks, ecological networks, phone call networks, flight patterns, linguistics, neural networks, protein networks 16

Marco Bchler Tom Brughmans

Graph visualization

Geographical visualization

Marco Bchler Tom Brughmans

17

Graph visualization

Topological visualization

Linkedin Maps

Marco Bchler Tom Brughmans

18

Graph visualization

Circular visualization

Marco Bchler Tom Brughmans

19

Graph visualization

Grouped visualization

ITS ESC

sites

ESD ESA ESB Marco Bchler Tom Brughmans 20

Analytical techniques

Degree (ki)

Indegree (k ini)

Outdegree (k outj)

Average degree = average of all degree scores in a single network


Marco Bchler Tom Brughmans 21

Analytical techniques

Shortest path

Average shortest path = average of all shortest path scores between all possible pairs of vertices in the network
Marco Bchler Tom Brughmans 22

Analytical techniques

Clustering coefficient (c) = average of the fraction of all possible relationships between all nodes and their direct neighbours.

Ki = 3

ci = 3/6 = 0.5
Marco Bchler Tom Brughmans

C = ( ci , cj , c n ) / n
23

Analytical techniques

Small-world networks

(Watts and Strogatz 1998)

L = average shortest path length; C = clustering coefficient; P = random probability

Marco Bchler Tom Brughmans

24

How are persons connected?

Marco Bchler Tom Brughmans

25

Analytical techniques

Degree distribution

Number of nodes

Degree Scale-free networks


Marco Bchler Tom Brughmans
(Barabsi and Albert 1999)

26

Thinking through networks: generating, visualizing and analysing complex re-use graphs in the Humanities

Section 3 Introduction to text re-use graphs

Marco Bchler Tom Brughmans

27

What do you want to measure?

Similarity of branches of the same knowledge, knowledge is changing over time Marco Bchler Tom Brughmans 28

What is of interest?

Functionality of objects vs. object of interest, critical amount of re-use Marco Bchler Tom Brughmans 29

Agenda for this section

l What is text re-use? l Text re-use techniques l Bible data l What decisions need to be made to make a text re-use applicable?

Marco Bchler Tom Brughmans

30

Denitions/Terminology Citation/quotation Modern: (SOURCE, <TEXT>) Ancient: mostly (, <TEXT>) Plagiarism Modern: mostly (, <TEXT>) Text Re-use Legal right aspects are ignored: For this reason: (<TEXT>) Literal citations Parallel texts Paraphrases Text Re-use graph G=(V,E), V is set of sentences, E set of links between elements of V (Hyper-textual structure in a Digital Library) A text re-use graph is not directed. Marco Bchler Tom Brughmans 31

7 different versions of the Holy Bible The data: * American Standard Version (ASV) * Bible in Basic English (Basic) * Darby Bible (Darby) * King James Version (KJV) * World English Bible (WEB) * Webster Bible (Webster) * Young's Literal Translation (YLT)

28,632 verses are selected that occurred in all versions. Bible version Word tokens Word types Token/type ratio ASV 54.97 741267 13485 Basic 100.85 791367 7350 Darby KJV WEB Webster YLT Marco Bchler Tom Brughmans 732928 746746 722817 744137 745422 14971 13466 13556 13655 13973 32 48.96 55.45 54.68 54.50 53.34

6 levels of text re-use

Level Level Level Level Level Level

1: Pre-processing 2: Feature training 3: Feature selection (Fingerprinting) 4: Linking 5: Scoring 6: Post-processing

Marco Bchler Tom Brughmans

33

Level 1: Pre-processing

Capitalisation (e. g. all letters to lowercase) Normalisation (e. g. removing all diacritics) Lemmatisation (e. g. replace inected words by baseform) Synonym replacements (e. g. replace a word by the most common (most frequent) synonym) String similarity (words that are similar written) Reduced strings by word length Result: Cleaned text regarding language evolution, dialects, spelling or OCR errors, etc. Marco Bchler Tom Brughmans 34

Level 2: Syntactical training details

N-gram feature

Syntactical feature

Overlapping

Non overlapping

Property of overlapping features

Shingling

Longest Common Consecutive Words

Local hash breaking

Global hash breaking

Property of constant or variable n-gram length

Marco Bchler Tom Brughmans

35

Level 3: Selection of training data building a re-use ngerprint - local selection strategies work with the knowledge within a re-use unit such as - Local 0 mod p (e. g. position of a word within a re-use unit) - random selection - Winnowing - global selection strategies work with global knowledge such as a word list of the entire corpus like - Global 0 mod p (e. g. rank of a word (cf. Zipan law) ) - Selection of special word classes such as nouns - Inverted Document Frequency (IDF) score - Minimum feature frequency selection - Maximum feature frequency selection Marco Bchler Tom Brughmans 36

Level 4: Linking types comparing re-use units Intra corpus detection (Text re-use): Inter corpus detection (Modern: Plagiarism, Ancient: e.g. bible): Marco Bchler Tom Brughmans 37

Level 5: Usage of text re-use by similarity


S imila rity dis tribution
2 0,00%

18,00%

16,00%

14,00%

P ercenta g e o f a ll references

12 ,00%

10,00%

S imila rity

8,00%

6,00%

4,00%

2 ,00%

0,00% 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

S im ila rity s core o f textua l references

Marco Bchler Tom Brughmans

38

Some results problem focused What is a good similarity threshold (literal citations)? Dissimilarity vs. Fragments Plato: Low threshold provides good results as well Atthidographers: Poor quality precision less than 20% Multi word expressions like King Alexander the Great (literal citations) Phrases (Engl.: in the Name of Our Lord Jesus Christ) Again: We are the people! Editorial references to publications Works in different editions - Embedded text re-use (relation between linking and scoring) - Detection boundaries of text re-use

We are the people!


Marco Bchler Tom Brughmans 39

Thinking through networks: generating, visualizing and analysing complex re-use graphs in the Humanities

Section 4 Working with text re-use graphs

Marco Bchler Tom Brughmans

40

Accessing text re-use graphs I

Marco Bchler Tom Brughmans

41

Accessing text re-use graphs II

Marco Bchler Tom Brughmans

42

Macro view (very distant reading) I

Marco Bchler Tom Brughmans

43

Macro view (very distant reading) II

Middle Platonism Neoplatonism

Marco Bchler Tom Brughmans

44

Micro view (close reading) I

Plato: Timaeus 91b7 ff.


' ' '

Marco Bchler Tom Brughmans

45

Micro view (close reading) II

Plato: Timaeus 91b7 ff.


' ' '

Marco Bchler Tom Brughmans

46

Micro view (close reading) III

Marco Bchler Tom Brughmans

47

Micro view (close reading) IV

Marco Bchler Tom Brughmans

48

Micro view (close reading) V

Genesis/1/1
ASV: In the beginning God created the heavens and the earth BasicEnglish: At the rst God made the heaven and the earth Darby: In the beginning God created the heavens and the earth. KJV: In the beginning God created the heaven and the earth Webster: In the beginning God created the heaven and the earth WEB: In the beginning God created the heavens and the earth YLT: In the beginning of God`s preparing the heavens and the earth

Marco Bchler Tom Brughmans

49

Micro view (close reading) VI

Marco Bchler Tom Brughmans

50

Temperature view (distant reading)

A text re-use from a document with a high text re-use coverage is more trustworthy than from a less frequently re-used text.

A text re-use from a section of a document with a high text re-use temperature is more trustworthy than from a less frequently re-used part of a document.

Marco Bchler Tom Brughmans

51

Dotplot view (mid distant reading)

Source (Plot): John Lee: A Computational Model of Text Reuse in Ancient Literary Texts, 2009.

Marco Bchler Tom Brughmans

52

Thinking through networks: generating, visualizing and analysing complex re-use graphs in the Humanities

http://groups.google.com/group/the-networks-network?hl=en-GB

Join the networks network Google group! Tom Brughmans

Section 5 Discussion

tb2g08@soton.ac.uk http://archaeologicalnetworks.wordpress.com/

Marco Bchler

mbuechler@e-humanities.net http://www.asv.informatik.uni-leipzig.de/

Marco Bchler Tom Brughmans

53

Anda mungkin juga menyukai