Anda di halaman 1dari 21

Technologies for Automated

Indexing
David D. Lewis, Ph.D.
David D. Lewis Consulting
www.DavidDLewis.com

A talk presented at the NFAIS Forum,“Automated Indexing and Abstracting:


Current Status and Future Trends”, St. Johns University, April 22, 2005
Outline
• Indexing tasks
• Indexing technologies
– Inputs, outputs, and what’s in between
• Labeling vs. discovery of labels

• GOAL: buzzword-independent ways to


understand advanced indexing technologies
Two Dimensions of Indexing
• Vocabulary control
– Degree to which possible outputs are
determined in advance
• Number of possible outputs
• Whether portions of text appear directly
• Complexity
– Simple labels vs. structured annotations
• How elaborate are structures
free text
free
indexing
summarization/
abstracting
back of terminology
CONTROL

book extraction
indexing
information
named entity extraction
extraction
faceted
indexing natural
canonical

language
categorization understanding

simple complex
STRUCTURE
Indexing Technologies
• Matching
• Transformation
• Classification
• “Understanding”
NASA
asteroid belt

• NASA's Spitzer Space Telescope has


detected the first-ever asteroid belt
encircling a star similar to the sun, a finding
that might help scientists understand how
planets have formed. Astronomers used the
infrared telescope to look at 85 stars similar in
age and mass to our sun, searching for
evidence of dusty discs that may indicate the
presence of planets, NASA said in a
statement.
Matching
• Output: identical to parts of input
• Evidence: exact match
• Tuning to Task: create vocabulary list
• Applications: categorization, faceted
indexing, named entity, terminology
• Platform Support: all
• Maturity: high
G-class stars

• NASA's Spitzer Space Telescope has


detected the first-ever asteroid belt encircling
a star similar to the sun, a finding that might
help scientists understand how planets have
formed. Astronomers used the infrared
telescope to look at 85 stars similar in age
and mass to our sun, searching for evidence
of dusty discs that may indicate the presence
of planets, NASA said in a statement.
planet formation
Structured Rewriting
• Output: restructured version of input
– Reorder, delete, substitute components
• Evidence:
– Proximity/syntactic link of label parts in text
– Cooccurrence of label parts across documents
– Thesaurus links from text units to label parts
• Tuning to task:
– Manual: patterns, thresholds, thesaurus
– Automated: corpus statistics if used
Structured Rewriting, cont.
• Applications: all, esp. faceted indexing,
named entity, back of book indexing and
terminology extraction
• Platform Support:
– Minimal in general text processing
• proximity + manual effort can do a lot
– Fair to good in back of book indexing, terminology
extraction, named entity
– Highly variable in software for other tasks
Structured Rewriting, cont.
• Maturity: variable, rapidly improving
– Growth of web scraping, site wrapping
– Integration of machine learning and pattern
matching
– Merging with natural language processing
C1730 (Astronomy)
0.7
0.2
1.8 0.5
• NASA's Spitzer Space Telescope has
detected the first-ever asteroid belt encircling
a star similar to the sun, a finding that might
help scientists understand how planets have
formed. Astronomers used the infrared
telescope to look at 85 stars similar in age
and mass to our sun, searching for evidence
of dusty discs that may indicate the presence
of planets, NASA said in a statement.
Classifiers
• Output: class labels
– Label structure not related to text structure
• Evidence:
– Statistical relationship between dispersed text
units and label occurrence
– Semantic link between text units and label
meaning
• Tuning to task:
– machine learning from labeled examples
– manual rule writing
Classifiers, cont.
• Applications: all, esp. categorization, faceted
indexing, named entity and information
extraction
• Platform Support:
– Fair to good in categorization
– Mostly poor in other software
• Maturity: low to medium, improving
– usability, use of domain knowledge areas
for improvement
• Telescope
• ObserveEvent3 – IS-A: Tele
– IS-A: ObserveEvent – WAVELEN
– ACTOR: Astronomer5
– OBJECT: Group23
• M
– INSTRUMENT:
Telescope17 • Group23
– ITEM-TYPE: Star
– CARDNALITY: 85

• …Astronomers used the infrared telescope to


look at 85 stars similar in age and mass to
our sun…
Natural Language
“Understanding”
• Output: knowledge representation structures
– Complex, indirect relation to input
• Evidence:
– Morphology, syntax, semantics, discourse, domain
relationships
• Tuning to task:
– Manual: domain knowledge, lexicon, linguistic
patterns
– Automated: some lexical knowledge, some
patterns
NL Understanding, cont.
• Applications: all (overkill for most)
• Platform Support: poor except in specialized
software
• Maturity: poor; unclear near term
improvements
Labeling vs. Discovering
Labels
• Variants of these technologies can also
be used to build vocabularies
– Terminology extraction vs. discovery
– Classification vs. clustering
• Usefulness to your task must always be
carefully examined!
– Aiding manual creation often best can hope
for
Summary
• Needs of modern indexing are rich and
varied
• Claims of modern software companies
even more rich and varied
Summary, cont.
• Concentrate on
– What are the actual and possible outputs
– What technology does the work to produce
them
– What resources does that technology use
– Where do those resources come from
The End
• Write DaveLewis@DavidDLewis.com
– Questions?
– To join text classification mailing list
• high signal to noise ratio
• in its 12th year
– Consulting inquiries

Anda mungkin juga menyukai