Search
Collection Engine
HTTP Server 8080
Other
Farsi Browser
Additions
Images of
Type Foundry words and
lines
library by Columbia University [18]. bound on information content (in the current version,
By using an embedded HTTP server, we can handle minimum stem length = 3). Another difference is that
non-Latin scripts easily. The search engine performs all the Farsi stemmer identifies prefixes, unlike the Porter
processing with Unicode encoded documents. When an Stemmer.
item is displayed, the Unicode encoding is sent to the em- The stemmer is aggressive, removing every terminal
bedded HTTP server. If the browser supports Unicode substring that matches a known suffix, sometimes remov-
encoded fonts, the item is displayed using the browser. ing part of the root word. However, it rarely removes
If the browser doesn’t support Unicode fonts, the item is enough of a root to overconflate.
converted by the “type foundry” into a series of portable The stemmer attempts to match a suffix to a word end-
network graphics (PNG) images. We use the freetype1 ing by feeding the word to a finite state machine. A word
engine by the Freetype Project to render the Unicode may be fed the machine as many as three times, identi-
script into the images [17]. These images are then sent fying as many as three suffixes stacked on the same root.
to the browser for display. Figure 2 depicts an overview The state machine reads the word backwards, i.e., it first
of the system. reads the final character (the last character in reading or-
der), then proceeds toward the first character. If the ma-
5 Stemmer and Stopword List chine halts in an accepting state, we know that a suffix
matches the end of the word and that the input word is
The Farsi stemmer identifies common Farsi affixes in
long enough to retain the minimum stem length after re-
Farsi words. Conditionally, it removes the affixes, pro-
moving that suffix. The stemmer associates each accept-
ducing an effective stem.
ing state with a specific suffix and a suffix class such as
The Farsi stemmer is similar to the Porter stem-
plural.
mer [13]. For example, both are based on the mor-
For example, suppose the input is ÍÛw³ÛÌÏ (“we were
phological rules of their respective languages. Both
not going”). The finite state machine will identify the
stemmers match words with a set of suffixes and use
suffix ÍÚ as the longest matching suffix. The final state of
multiple phases to conform to the rules of suffix stack-
the machine tells the stemmer that the suffix belongs to
ing. Also, they both enforce lower bounds on how much
the verb class.
information a stem must retain. However, there are
The suffix classes determine the modifications the
differences. For example, the Porter stemmer counts
stemmer makes and the additional affixes the stemmer
substrings of consecutive consonants and vowels, es-
attempts to match. The general rules for modifying
timating the information content, before deciding to
words according to suffix class are:
remove a suffix. In Farsi, many spoken vowels are not
written, so the stemmer cannot count them. Therefore, Verb Remove the identified suffix. In accordance with
the Farsi stemmer uses stem length to define a lower the rules of prefix stacking, remove as many prefixes
Farsi Word English Equivalent ÎmÚm ÜwmÔp ¥pmØ xÚn« Ê« zm nÚa
Îa that # xm ¤ ÊË xËÉ
mE why Does the lack of health standards in Iran endanger
n Ï a there the population?
xm is
Únp must Figure 4: An Example Query
Øp be
with oil, social problems, and historical changes in the
np with/by
Middle East.
Figure 3: Potential Farsi Stopwords The sixty queries were designed by native speakers of
Farsi familiar with these documents. An example query
as possible while retaining sufficient stem length. and its translation appears in Figure 4. These queries
The expression {ÜË}{o|Î} describes the language were input using our keyboard applet for the purpose of
of relevant prefixes. relevancy judgment gathering.
We also built a web-based application to display the
Possessive Remove the identified suffix. Feed the re- queries and the articles side-by-side for our Farsi experts.
mainder to the finite state machine to check for non- After an expert reads the article, he makes binary deci-
possessive noun suffixes. sions as to the relevance of the document to each query.
Plural Remove the identified suffix. Feed the remain- No relevancy ranking is assessed.
der to the finite state machine to check for non-
7 Search Engine
possessive, non-plural noun suffixes.
Our search engine will contain several methodologies be-
All other suffix types Remove the identified suffix. cause we plan to study their effectiveness in retrieving
Consider the previous example, ÍÛw³ÛÌÏ. After identi- Farsi language texts. Initially two methods will be ap-
fying the suffix and suffix class, the stemmer will remove plied: the vector space and the probabilistic.
the suffix ÍÚ, leaving x³ÛÌÏ. Then it will look for the In the well-known vector space model, documents and
prefix o, and not find it. It will then find the prefix Î. It queries are represented as n-dimensional vectors of term
will check the word length, and remove the prefix, leaving weights. For a given query vector Q documents are
x³ÛË. It will then find the prefix ÜË, check length, and ranked in terms of their similarity to the query as deter-
remove the prefix. It outputs x³, which is the infinitive mined by the cosine measure:
“to go”, with the terminal Î removed. Q · Dd
Pn
wq,t × wd,t
Now, suppose the input is ÊnÔpnw», “my books.” The cosine(Q, Dd ) = = t=1
| Q || Dd | Wq Wd
finite state machine will identify the suffix Ê, and remove
it, leaving nÔpnw». Because Ê is a possessive suffix, the Dd represents the vector of a document d. The inner
stemmer feeds the remainder to the finite state machine product Q · Dd is interpreted in the standard way as
to look for more noun suffixes of the same word. Now, the sum of the products of the weights of correspond-
the finite state machine will identify the suffix nÓ, and re- ing terms in the query and document, wq,t and wd,t re-
move it, leaving onw». Because nÓ is a plural suffix, the spectively. Weights are determined using the usual tf×idf
stemmer feeds the remainder to the finite state machine scheme. The values Wq and Wd are the Euclidean lengths
to look for more noun suffixes. This time, the finite state of the query and a document [19].
machine halts in a rejecting state, signifying an absence Our Farsi retrieval engine will also use probabilistic
of more suffixes, so the stemmer returns onw», “book.” techniques. In a probabilistic search engine, documents
In addition, we are using our Farsi document collection are ranked by the probability that given a query q, doc-
to generate a list of stopwords. Stopwords are identified ument d should be retrieved. Using Bayes theorem and
by frequency analysis and expert judgment. Figure 3 dropping the constant probability for the fixed query from
shows potential stopwords from the list. the denominator, we have:
Our document collection consists of 1850 documents and Recent studies have shown that the specific probabilistic
60 queries. These documents were collected within a six approach called statistical language modeling has proven
month period from many websites. Some of these web- more effective than others on English collections [12, 20].
sites originate in Iran and they typically represent elec- Some success has been shown for Arabic texts as well [9].
tronic versions of popular Iranian newspapers and mag- Statistical language modeling rejects the assumption of
azines. The rest of the articles are from websites origi- other probabilistic approaches that there is an underlying
nating in the U.S. or in Europe. These documents mainly distribution model which the data should fit. For exam-
cover topics such as politics, economic issues associated ple, Bookstein and Swanson assumed that terms would
occur in documents with a Poisson distribution [1]. In- Future work includes performance testing of our stem-
stead, statistical language modeling uses non-parametric mer and search engine. We are also planning to make de-
techniques to estimate p(q|d). tailed reports available on our input/display technology,
Based on the concept of a language model developed stemmer, stopword list, and search engine implementa-
for speech recognition, a query is viewed as a randomly tion at www.isri.unlv.edu.
generated set of terms. The expression p(q|d) is then
understood as the probability that certain terms will oc- References
cur in a query given a particular model of the docu-
ments [12, 20]. Of course, queries are not typically gen- [1] Abraham Bookstein and Don Swanson. Probablis-
erated randomly, but from the “perspective” of the doc- tic models for automatic indexing. Journal of the
ument model, how the query is created is unknown and American Society for Information Science, pages
thus “appears” to be random. 312–318, 1974.
Since p(d) is typically assumed to be uniform in equa- [2] Stanley Chen and Joshua Goodman. An empirical
tion (1), the expression p(q|d) alone requires an estimate. study of smoothing techniques for language mod-
The basic approach in statistical language modeling is to eling. Technical Report TR-10-98, Harvard Univ.,
begin with the maximum likelihood estimate, 1998.
tft,d
pmle (t|d) = [3] Center for Advanced Research on Language Acqui-
DLd
sition (CARLA). Less commonly taught languages
where tft,d is the number of occurrences of a term t in a
(LCTL) project. http://carla.acad.umn.edu/
document d, commonly called term frequency, and DLd
LCTL/.
is the document length of a document d, defined P as the
total number of term occurrences in d, that is, t tft,d . [4] Manou & Associates Inc. Eurofarsi experiment.
However, the maximum likelihood estimate has two pri- http://www.iranonline.com/eurofarsi/
mary shortcomings in this context. First, the estimate will experiment.html.
assign a probability of zero to a document if one or more
query terms are not present, and second, documents only [5] Neda Rayaneh Institute. Persian nimrooz. http:
contain sparse data [12, 20, 15]. //neda.net.ir/downloads/fonts.shtml.
To solve these problems, smoothing techniques are em-
ployed. The general form of a smoothing technique can [6] ISIRI. Institute of standards and industrial research
be expressed as of Iran. http://www.isiri.org/.
ps (t|d) if t occurs in d [7] Shereen Khoja. Stemming. http://www.comp.
p(t|d) =
αd p(t|C) otherwise lancs.ac.uk/computing/users/khoja/
where ps (t|d) is the smoothed version of the maximum index.htm#stemming.
likelihood estimate, p(t|C) is the “collection language
model” and coefficient αd determines the amount of [8] Tore Kjeilen, Sidahmed Abubakr, and D. Josiya Ne-
probability transferred to terms not in the query. A gahban. Encyclopedia of the orient. http://
number of techniques have been proposed to define i-cias.com/e.o/.
these expressions [2, 20, 21, 12]. We plan to investigate [9] Leah Larkey and Margaret Connell. Arabic infor-
several of these to see how they impact effectiveness mation retrieval at umass in trec-10. In The Tenth
of retrieval on Farsi text, especially those which move Text Retrieval Conference, TREC 2001 NIST Spe-
beyond the independence assumption of the unigram cial Publication 500-250, pages 562–570, 2002.
model [15, 10].
[10] D. H. Miller, T. Leek, and R Schwartz. A hidden
8 Conclusion and Future Work markov model information retrieval system. In SI-
The explosion in availability of Farsi websites and the GIR’99, pages 214–221.
eventual electronic conversion of Farsi literature are the
[11] Terence J. Parr and Thomas Moog. Pccts 1.3mr33
driving forces behind our project. We believe standard
toolkit. http://www.polhode.com/pccts.
fonts, operating system independence, and input/display
html.
technology are critical factors in future development of
search tools for Farsi particularly. Because of the close [12] Jay Ponte and W. Bruce Croft. A language model-
relationship between Arabic and Farsi, our research can ing approach to information retrieval. In SIGIR’98,
indirectly benefit Arabic as well. For example, over 40% pages 275–281, 1998.
of all words in Farsi are Arabic, and thus one should ex-
pect common research issues between our stemmer and [13] M.F. Porter. An algorithm for suffix stripping. Pro-
an Arabic stemmer [7]. gram, 14, 1980.
[14] Roozbeh Pournader. Farsi keyboard.
http://lists.sharif.edu/pipermail/
persiancomputing/2000-August/000129.
html.
[15] Fei Song and W. Bruce Croft. A general language
model for information retrieval. In SIGIR’99, pages
279–280, 1999.
[16] Hughes Technologies. libhttpd. http://www.
Hughes.com.au.
[17] David Turner, Robert Wilhelm, and Werner Lem-
berg. Freetype project. http://freetype.
sourceforge.net.
[18] Columbia University. Cli processing rou-
tines. ftp://ftp.cc.columbia.edu/mm/
mm-ccmd-0.91.tar.
[19] I. Witten, A. Moffat, and T. Bell. Managing Gi-
gabytes: Compressing and indexing documents and
images. Morgan Kaufmann, 2nd edition, 1999.
[20] ChengXian Zhai and John Lafferty. A study of
smoothing methods for language models applied to
ad hoc information retrieval. In SIGIR’01, pages
334–342, 2001.
[21] ChengXiang Zhai and John Lafferty. Two-stage
language models for information retrieval. In SI-
GIR’02, pages 49–56.