Semi-structured Data
Soumen Chakrabarti
Indian Institute of Technology, Bombay
www.cse.iitb.ac.in/~soumen/
Acknowledgments
S. Sudarshan Arvind Hulgeri
B. Aditya Parag
1
Convergence?
SQLXML search Web searchIR
Trees, reference links Documents are nodes
Labeled edges in a graph
Nodes may contain Hyperlink edges have
Structured data important but
Free text fields unspecified semantics
Data vs. document Google, HITS
Query involves node Query language
data and edge labels remains primitive
Partial knowledge of No data types
schema ok No use of tag-tree
Answer = set of paths Answer = URL list
2
Text indexing basics
“Inverted index” maps from
term to document IDs ;<= >?@AB CD EFDD FG >?@A
Term offset info enables
D1
HCIJ FEK >?@A KFLA
to map terms to
LAH
D2: 7
FEK
Table names, column names
D1: 7
EFDD
Primary keys, RIDs
D1: 3
3
Map
Data model
Relational XML-like
None SQL,Datalog XML-QL, Xquery
Schema WHIRL ELIXIR, XIRQL
IR
DBXplorer,
support No EasyAsk, Mercado,
BANKS,
schema DataSpot, BANKS
DISCOVER
4
WHIRL scoring function
A where-clause in WHIRL is a
Boolean predicate as in SQL (age=35)
Score for such clauses are 0/1
Similarity predicate (job ~ ‘Web design’)
Score = cosine(job, ‘Web design’)
Conjunction or disjunction of clauses
Sub-clause scores interpreted as probabilities
score(B ∧ … ∧B ; θ)=Π ≤ ≤ score(B ,θ)
score(B ∨ … ∨B ; θ)=1 — Π ≤ ≤ (1—score(B ,θ))
5
XQuery
Quilt + Lorel + YATL + XML-QL
Path expressions recipes.xml
<dishes_with_flour> { FOR $r IN
document("recipes.xml")
//recipe[//ingredient[@name="flour"]]
RETURN <dish>{$r/title/text()}</dish> }
</dishes_with_flour>
recipe $r
title
name
Tortilla ingredient “flour”
¡¢£ ¤¥¥¤ ¦§¨©ª¨«¨ª¬ ®®
6
Tutorial outline
Data model
Relational XML-like
None SQL,Datalog XML-QL, Xquery
Schema WHIRL ELIXIR, XIRQL
IR
DBXplorer,
support No EasyAsk, Mercado,
BANKS,
schema DataSpot, BANKS
DISCOVER
7
How ELIXIR works
ELIXIR VLDB.xml Macbeth.xml Base XML
query documents
Flatten to WHIRL
Rewrite to XML
Result
ÏÐÑÒ ÓÔÔÓ ÕÖ×ØÙ×Ú×ÙÛÜ ÝÞ
ÿ
A more detailed view
àáââãäåàæçèãéäåêëàìå àíðï üãé äî ñ
å
àíîï áðèäåñàìå ñàìå àâðäüä üãé äî ñ
å
àáââãäåàæçèãéäåòóàìå àâýääðåç
îäèíü
çãî
àíîï áðèäåàï áï èäåôáõä ö÷øùúùûä ç îï ãüä àìå
ö÷øùúùû áçü âýíï áíè þçáüàìåñàìåàìå àìåàìå
WHIRL query
Result
ÏÐÑÒ ÓÔÔÓ ÕÖ×ØÙ×Ú×ÙÛÜ Ýß
8
Observations
SQL/XQuery + IR-like result ranking
Schema knowledge remains essential
“Free-form” text vs. tagged, typed field
Element hierarchy, element names,
IDREFs
Typical Web search is two words long
End-users don’t type SQL or XQuery
Possible remedy: HTML form access
Limitation: restricted views and queries
¡ ¢££¢ ¤¥¦§¨¦©¦¨ª« ¬
9
Two paradigms of proximity search
A single node as query response
Find node that matches query terms…
…or is “near” nodes matching query terms
(Goldman et al., 1998)
A connected subgraph as query response
Single node may not match all keywords
No natural “page boundary”
10
Basic search strategy
Node subset A activated because they
match query keyword(s)
Look for node near nodes that are
activated
Goodness of response node depends
Directly on degree of activation
Inversely on distance from activated node(s)
11
Scoring single response nodes
Additive
score(r ) = ∑a∈A b(a, r )
Belief
score(r ) = 1 − ∏a∈A (1 − b(a, r ))
Hub indexing
Decompose APSP problem using sparse
vertex cuts
|A|+|B | shortest paths to p A B
|A|+|B | shortest paths to q
d(p,q) p
a b
To find d(a,b) compare
d(apb) not through q q
d(aqb) not through p
d(apqb)
d(aqpb)
Greatest savings when |A|≈|B|
Heuristics to find cuts, e.g. large-degree nodes
12
Connected subgraph as response
Single node may not match all keywords
No natural “page boundary”
Two scenarios
Keyword search on relational data
• Keywords spread among normalized relations
Keyword search on XML-like or Web data
• Keywords spread among DOM nodes and
subtrees
Tutorial outline
Data model
Relational XML-like
None SQL,Datalog XML-QL, Xquery
Schema WHIRL ELIXIR, XIRQL
IR
DBXplorer,
support No EasyAsk, Mercado,
BANKS,
schema DataSpot, BANKS
DISCOVER
13
Keyword search on relational data
Tuple = node
GHIJK TUVJW
LMN MOP XYZQ[\]
Normalization
A2 P2 A2 Sudarshan
A3 P2 A3 Hulgeri
Join may be needed Citing Cited PaperID PaperName
to generate results P2 P1 P1 DBXplorer
Cycles may exist in P2 BANKS
schema graph: ‘Cites’
89:; <==< >?@AB@C@BDE <F
T1 T2 T3 K2 T5
T4 T2 T3
T5
T2 ~
T3
T5
xyz{ |}}| K3 |
14
Discussion
Exploits relational Coarse-grained
schema information to ranking based on
contain search schema tree
Pushes final Does not model
extraction of joined proximity or (dis)
tuples into RDBMS similarity of individual
Faster than dealing tuples
with full data graph No recipe for data
directly with less regular (e.g.
XML) or ill-defined
schema
15
Motivation from Web search
“Linux modem driver §¨© ª«¬®¯°±²
Hyperlink path
Á»ÃÄÂÅ
³ÆÇÈÉÊËÌ Í Î
matches query ßÊË ÈüÊýÉ
³ÏÇÈÐÑ
collectively
éÈÌì ýüüýìÇÊÈ ìǸÌ
³þÊÉî·
P work together
ÒÓÔÕ Ö×ØÕ ÓÙ
³û
ÖÚÓÙÕÛÛÓÚ Ü
³Í
Conjunction may
Ý¿¾ÄÂÅ
³Þ Ïßàá
retrieve wrong page âãäÀļãÅ
General notion of
³Î ÖæÛ çÓÔÕ è×ØÕ
³å é Ë Êêë ÊÈ ìíî à
16
Setting edge weights
Edges are generally directed
Paper1
Foreign to primary key in relational data
Containing to contained element in XML
Paper2
IDREFs have clear source and target
Consider the RDMS scenario Cites
Forward edge weight for edge (u,v) Citing (Src) Cited (Dst)
Paper1 Paper2
u, v are tuples in tables R(u), R(v)
Weight s(R(u),R(v)) between tables Paper1
' ()*+,-./01 20./,34 ,56778 96301 )* 30:6*4 ,53
' ;<=>? @ABC=D=>A ?D= @AA 677 3.5 2 4 .E70 E6 ,/3 >? @
Paper2
Proximity search must traverse edges in
both directions … what should w F(u,v) be?
!" # "$% &&
17
Node weight = relevance + prestige
Relevance w.r.t. keyword(s)
0/1: node contains term or it does not
Cosine score in [0,1] as in IR
Uniform model: a
node for each keyword
(e.g. DataSpot)
Popularity or prestige
W.p. d jump to
a random node
Indegree
jump to an
out-neighbor
PageRank
u.a.r.
p(v ) = + (1 − d ) ∑
d p(u )
pqrs tuut vw
N|} u →v OutDegree( u )
~
xyzx{xz
18
Data structures for search
Answer = tree with at least one leaf
containing each keyword in query
Group Steiner tree problem, NP-hard
Query term t found in source nodes S
Single-source-shortest-path SSSP iterator
Initialize with a source (near-) node
Consider edges backwards
getNext() returns next nearest node
For each iterator, each visited node v
maintains for each t a set v.R ® of nodes in
S ® which have reached v
¡¢¢¡ £¤¥¦§¥¨¥§©ª «¬
19
Search example (“Vu Kleinberg”)
Quoc Vu Jon Kleinberg
writes
writes
writes
Organizing Web pages
by “Information Unit” Authoritative sources in a
hyperlinked environment
cites
writes A metric
cites labeling problem
cites
Divyakant Agrawal writes
First response
Quoc Vu Jon Kleinberg
writes
writes
writes
Organizing Web pages
by “Information Unit” Authoritative sources in a
hyperlinked environment
cites
writes A metric
cites labeling problem
cites
Divyakant Agrawal writes
20
Folding in user feedback
As in IR systems, results may be imperfect
Unlike SQL or XQuery, no exact control over
matching, ranking and answer graph form
Ad-hoc choices for node and edge weights
Per-user and/or per-session
By graph/path/node type, e.g. “want author
citing author,” not “author coauthoring with
author”
Across users
Modifying edge costs to favor nodes (or node
types) liked by users
ÕÖ×Ø ÙÚÚÙ ÛÜÝÞßÝàÝßáâ ãä
Re-iterate to convergence
∑u '→v p(u' )
åæçè éêêé ëìíîïíðíïñò óé
21
Prototypes and products
DTL DataSpot Mercado Intuifind
www.mercado.com/
EasyAsk www.easyask.com/
ELIXIR www.smi.ucd.ie/elixir/
XIRQL ls6-www.informatik.uni-
dortmund.de/ir/projects/hyrex/
Microsoft DBXplorer
BANKS www.cse.iitb.ac.in/banks/
üýþÿ
Summary
Confluence of structured and free-format,
keyword-based search
Extend SQL, XQuery, Web search, IR
Many useful applications: product catalogs,
software libraries, Web search
Key idiom: proximity in a graph
representation of textual data
Implicit joins on foreign keys
Proximity via IDREF and other links
Several working systems
Not enough consensus on clean models
üýþÿ
22