Anda di halaman 1dari 4

3009723.

DOC

• QueryParser is thread-safe. Its static parse method creates a new instance of QueryParser
each time it is called.

• IndexSearcher is thread-safe. Multiple search threads may access the index concurrently
without any problems.

• When you IndexWriter.addDocument(doc) a document that is already in the index there


will be multiple copies of the same document in the index.

• Question: How do I restrict searches to only return results from a limited subset of
documents in the index (e.g. for privacy reasons)?

• The QueryFilter class is designed precisely for such cases. Another way of doing it is the folowing:

• Just before calling IndexSearcher.search() add a clause to the query to exclude documents in
categories not permitted for this search.
• The query parser always returns a BooleanQuery, so you can simply add required or prohibited
terms to restrict access to documents that have (or don't have) these terms.
• If you are restricting access with a prohibited term, and someone tries to require that term, then
the prohibited restriction wins.
• If you are restricting access with a required term, and they try prohibiting that term, then they
will get no documents in their search result.
• As for deciding whether to use required or prohibited terms, if possible, you should choose the
method that names the less frequent term. That will make queries faster.

• Deleting documents:
• If you know the document number of a document that you want to delete you may use:
IndexReader.deleteDocument(docNum)
• That will delete the document numbered docNum from the index. Once a document is deleted it
will not appear in TermDocs nor TermPositions enumerations.
• Attempts to read its field with the document method will result in an error. The presence of this
document may still be reflected in the docFreq statistic, though this will be corrected eventually
as the index is further modified.
• If you want to delete all (1 or more) documents that contain a specific term you may use:
IndexReader.deleteDocuments(Term)
• This is useful if one uses a document field to hold a unique ID string for the document. Then to
delete such a document, one merely constructs a term with the appropriate field and the unique
ID string as its text and passes it to this method. Because a variable number of document can
be affected by this method call this method returns the number of documents deleted.
• If you delete by Term, be sure that the Field is Field.Keyword. In my experience, Fields added
to the index as Field.Text cannot be used for the purposes of deleting (Update: Fields in
Lucene 2 are created differently).
• Basically, in order to delete a Document, Lucene first has to find it (via IndexReader).

• Fields are returned in the same order they were added to the document.

1
3009723.DOC
• Question: How does one determine which documents do not have a certain term?
• There is no direct way of doing that. You could add a term "x" to every document, and then
search for "+x -y" to find all of the documents that don't have "y".
• Note that for large collections this would be slow because of the high term frequency for term
"x".

• Question: How do I get the last document added that has a particular term?
• Call: TermDocs td = IndexReader.termDocs(Term);
• Then grab the last Term in TermDocs that this method returns.
• Does MultiSearcher do anything particularly efficient to search multiple indices or does
it simply search one after the other? MultiSearcher searches indices sequentially. Multi-
CPU machines would benefit from a multi-threaded version of MultiSearcher that
performs multiple searches in parallel.

• Question: Is there a way to limit the size of an index?


• This question is often brought up because of the 2GB file size limit of some 32-bit operating
systems.
• The easiest thing is to set IndexWriter.maxMergeDocs.
If, for instance, you hit the 2GB limit at 8M documents set maxMergeDocs to 7M. That will keep
Lucene from trying to merge an index that won't fit in your filesystem. It will actually effectively
round this down to the next lower power of IndexWriter.mergeFactor.
So with the default mergeFactor set to 10 and maxMergeDocs set to 7M Lucene will generate a
series of 1M document indexes, since merging 10 of these would exceed the maximum.
• A slightly more complex solution:
You could further minimize the number of segments if, when you've added 7M documents,
optimize the index and start a new index. Then use MultiSearcher to search the indexes.
• An even more complex and optimal solution:
Write a version of FSDirectory that, when a file exceeds 2GB, creates a subdirectory and
represents the file as a series of files.

• Proximity operator:
• There is a variable called slop in PhraseQuery that allows you to perform NEAR/WITHIN-like
queries. By default, slop is set to 0 so that only exact phrases will match. However, you can
alter the value using the setSlop(int) method. For instance, setSlop(3) will let you find
documents containing a phrase such as "monkeys love bananas".
• There is currently no way to specify the slop in the query, although there has been some
discussion about it on the Lucene mailing list recently.
• Unlike other types of Lucene queries, Wildcard, Prefix, and Fuzzy queries are case
sensitive.
• That is because those types of queries are not passed through the Analyzer, which is the
component that performs operations such as stemming and lowercasing.
• The reason for skipping the Analyzer is that if you were searching for " dogs*" you would not
want "dogs" first stemmed to "dog", since that would then match "dog*", which is not the
intended query. A workaround for this is simply to lowercase the entire query before passing it
to the query parser.

• IndexWriter.addIndexes(Directory[]) method is a final synchronized method.

2
3009723.DOC
• Get all documents:
• This cannot be done with Lucene out of the box, however there is a way to do this.
One solution is to add a field with a known and constant value to each document in the index.
Then searching for that field and value will give you all documents in the index.

• For date ranges one can use a RangeQuery containing String representations of date objects
(DateTools.dateToString()) for the lower and the upper term. At least one of terms must
not be null, in case you want to search after or before specified date.

• Document IDs change after merging indices or after document deletion.

• The write.lock is used to keep processes from concurrently attempting to modify an index.
It is obtained by an IndexWriter while it is open, and by an IndexReader once documents
have been deleted and until it is closed.

• The commit.lock file is used to coordinate the contents of the 'segments' file with the files
in the index. It is obtained by an IndexReader before it reads the 'segments' file, which
names all of the other files in the index, and until the IndexReader has opened all of these
other files.
• The commit.lock is also obtained by the IndexWriter when it is about to write the segments
file and until it has finished trying to delete obsolete index files.
• The commit.lock should thus never be held for long, since while it is obtained files are only
opened or deleted, and one small file is read or written.
• You can actually use the commit.lock by using the IndexReader.unlock(Directory) method.
This does, however, also delete the write.lock file.

• The Segments file:


• All segments in the index are listed in the segments file. There is no hard limit.
• For an un-optimized index it is proportional to the log of the number of documents in the index.
An optimized index contains a single segment.

• Question: What happens when I open an IndexWriter, optimize the index, and then close
the IndexWriter? Which files will be added or modified?
• All of the segments are merged into a single new segment file. If the index was empty to begin
with, no segments will be created, only the segments file.

• If I have two indexes and use the MultiSearcher will it be faster than only one index with
all my documents?
• That depends on the environment where MultiSearcher is used.
• If you have a single computer with a single CPU, then it may actually be a bit slower. However, it
could be faster if you're either running on a multiple-CPU machine, or your MultiSearcher is
composed of RemoteSearchers, each on a different machine.

• Question: If I use a compound file-style index, do I still need to optimize my index?

3
3009723.DOC
• Yes. Each .cfs file created in the compound file-style index represents a single segment, which
means you can still merge multiple segments into a single segment by optimizing the index.

• Question: What is the difference between IndexWriter.addIndexes(IndexReader[]) and


IndexWriter.addIndexes(Directory[]), besides them taking different arguments?
• When merging lots of indexes (more than the mergeFactor), the Directory-based method will
use fewer file handles and less memory, as it will only ever open mergeFactor indexes at once,
while the IndexReader-based method requires that all indexes be open when passed.
• The primary advantage of the IndexReader-based method is that one can pass it IndexReaders
that don't reside in a Directory.

• Caching: Lucene does come with a simple cache mechanism, if you use Lucene Filters. The
classes to look at are CachingWrapperFilter and QueryFilter.

• Lucene is not limited to English, nor any other language. To index text properly, you need to
use an Analyzer appropriate for the language of the text you are indexing. Lucene's default
Analyzers work well for English. There are a number of other Analyzers in Lucene
Sandbox, including those for Chinese, Japanese, and Korean.

• Question: Can Lucene do a "search within search", so that the second search is constrained
by the results of the first query?
• Yes. There are two primary options:
• Use QueryFilter with the previous query as the filter. (you can search the mailing list archives
for QueryFilter and Doug Cutting's recommendations against using it for this purpose)
• Combine the previous query with the current query using BooleanQuery, using the previous
query as required.
• The BooleanQuery approach is the recommended one.
• Reviewing javadocs and previous posts, search refinement or 'search within search' is best done
with a Filter. To fill the Filter's BitSet with the results of a search, a how to kiss the
HitCollector is the obvious solution. Unfortunately when using HitCollector I have to
implement all the functionality the Hits class usually provides myself.

• An index can be searched and optimized simultaneously.

Anda mungkin juga menyukai