lucene scanning

hi ,

1. whether any search query, will scan for all documents in the lucene indexes

2.

I want to search query faster.So I thought of if I could reduce the number of docs , lucene needs to search for , when given some search parameters. It would act lil faster.

Can we make subset (subindexes) or say give some subset which should be scanned ,so as the search performs efficiently (small subset to scan , faster ll b the result) Do not want to use filtered query as it will search twice the indexes and I simply do not have requirement to scan again on the basis of set of parameters again. I mean every query will be a unique one .

Thanks, Suman

Wanting batch update to avoid high disk usage

Don’t bother calling expunge deletes so often, makes no sense. Instead, call it once at the end, though, you are calling the optimize method in the end anyways so should take care of itself. there shouldn’t be any difference (but degradation in performance) on adding a call to expungedeletes().

slow search threads during a disk copy

Hi all,

We’re observing search threads slowing down during directory copies performed during updates to the index. The thread dump shows search threads blocked on a FSDirectory$FSIndexInput$Descriptor instance:

“Worker Thread – 12″ daemon prio=10 tid=0x082b2400 nid=0×4654 waiting for monitor entry [0x988ed000..0x988edf30] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirectory.java:542) – waiting to lock (a org.apache.lucene.store.FSDirectory$FSIndexInput$Descriptor) at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:152) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38) at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76) at org.apache.lucene.index.SegmentTermDocs.read(SegmentTermDocs.java:133) at org.apache.lucene.index.MultiSegmentReader$MultiTermDocs.read(MultiSegmentReader.java:573) at org.apache.lucene.search.TermScorer.next(TermScorer.java:106) at org.apache.lucene.search.ConjunctionScorer.init(ConjunctionScorer.java:80) at org.apache.lucene.search.ConjunctionScorer.next(ConjunctionScorer.java:48) at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:319) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:146) at org.apache.lucene.search.Searcher.search(Searcher.java:118)

The directory copy we do is a ‘cp -pr /* ‘ command prior to applying changes (addDocument calls) on the current “available” segment. This takes >7 mins to copy a directory of size 1.4G. During this time window, the searches are slow and the above thread stacks are observed.

Could there be any system level limits we’re hitting?

Our test environment is: lucene-2.3.2 4×2.6 GHz, 16G memory Red Hat 3.4.6-9

Thanks and regards,

– Gagan

) not deleting

On Aug 22, 2010, at 1:47 PM, Erick Erickson wrote:

I originally wrote:

So the answer is “yes.”

Yes.

Never mind. I figured it out. (Don’t you hate it when you can’t figure something out, you write-up a detailed question, post it, then go off an figure it out afterwards?)

The problem was the directory was being stored in the index like:

/path/to/file/

(with the trailing slash). The delete query, however, didn’t have the trailing slash since File.getAbsolutePath() doesn’t return trailing file separator characters. D’oh!

- Paul

) not deleting

Did you issue a commit (or close) the IndexWriter after you deleted the documents? And I’m assuming that something really weird didn’t happen like a case change, but your NOT_ANALYZED should take care of that at index time, but are you sure your cases match when you submit your term queries?

An interesting test would be to write out the file names you create your terms from, and see what happens if you search on those fields etc….

HTH Erick

On Sun, Aug 22, 2010 at 12:24 PM, Paul J. Lucas wrote:

) not deleting

Hi -

Using Lucene 2.9.3, I’m indexing the metadata in image files. For each image (“document” in Lucene), I have 2 additional special fields: “FILE-PATH” (containing the full path of the file) and “DIR-PATH” (containing the full path of the directory the file is in).

The FILE-PATH Field is created only once like:

private final Field m_fieldFilePath = new Field( “FILE-PATH”, “INIT”, Field.Store.YES, Field.Index.NOT_ANALYZED );

and reused; the DIR-PATH Field is created once per document like:

new Field( “DIR-PATH”, file.getParentFile().getAbsolutePath(), Field.Store.NO, Field.Index.NOT_ANALYZED )

(The reason the DIR-PATH Field is created once per document is because it’s part of indexing the rest of the image metadata and isn’t a special-case like FILE-PATH. I don’t believe this is relevant to the problem at hand, however.)

If an image file (or an entire directory of image files) gets deleted, I need to delete it (them) from the index. When deleting a single image, I could do:

Term fileTerm = new Term( “FILE-PATH”, file.getAbsolutePath() ); writer.deleteDocuments( new TermQuery( fileTerm ) );

When deleting an entire directory of images, I could do:

Term dirTerm = new Term( “DIR-PATH”, file.getAbsolutePath() ); writer.deleteDocuments( new TermQuery( dirTerm ) );

However, at the time of deletion, I don’t know whether “file” refers to a single image file or to a directory of images files. I can’t do file.isFile() or file.isDirectory() because “file” no longer exists (it was deleted). So to cover both cases, I do:

Query[] queries = new Query[]{ new TermQuery( fileTerm ), new TermQuery( dirTerm ) }; writer.deleteDocuments( queries );

I have non-Lucene code that monitors the filesystem for changes. For Mac OS X, I can only get directory-level change notifications. So if a file is deleted from a directory, I get a notification that the directory has changed. So I delete all the documents in that directory then re-add them.

However (and here’s the problem), the deletes never happen. If I delete a file from a directory, the directory (looks like) its unindexed and reindexed, but a query for that image file still returns a result. So it’s like the delete never happened.

Why not?

Additional information: I create/close a new IndexWriter for the delete. Even if I quit the application, relaunch, and run the query, the result still shows up (hence it’s not that the current reader isn’t seeing the deletion change).

- Paul

How to convert WAR application into console application (Making Unicorn has console application)

Hi all, Hi all, Unicorn just provide a URI and push the button. It will call a series of validation services and report the results.I have already downloaded and installed Unicorn. To Download the source code it is only available for download from the Mercurial repository. To download it, use the command “hg clone https://dvcs.w3.org/hg/unicorn

To compile Unicorn, Apache Ant and Ivy are required. From Unicorn’s directory, run: “ant retrieve generate_observer generate_tasklist default_conf war”

It works fine in apache-tomcat, what I want is how to make the unicorn as console application. The input should passed through command line arguments and output should displayed at console itself I don`t want to use any web server to deploy it.

asking about incremental update

hello all, you may remember me as the one who ask about how to understand lucene in the previous email,but I have now been able to create a sample application of lucene. I read the book and able to test it. which to me is very great, as I am a new learner.

here is my proof.

http://jacobian.web.id/2010/08/09/how-to-use-lucene-part-1/

but now I am taking lucene to a higher level, I was tasked to create an index that can update itself. it was called incremental update. basically lucene will index the text file periodically and will store the index first in memory then after a while it will be store on the harddisk.

so anyone can give me any idea of how these things can be done? maybe there is a sample application out there that I might have miss but can be of great importance for me to learn about this incremental update.

any help would be greatly appreciated.

Sorting a Lucene index

Hi Anshum,

I require sorted results for all my queries and the field on which I need sorting is fixed; so this lead to me the idea of storing in sorted order to avoid sorting cost with every query.

Thanks and Regards,

Shelly Singh Center For KNowledge Driven Information Systems, Infosys Email: shelly_singh@infosys.com Phone: (M) 91 992 369 7200, (VoIP)2022978622

Solr SynonymFilter in Lucene analyzer

I am trying to have multi-word synonyms work in lucene using Solr’s * SynonymFilter*.

I need to match synonyms at index time, since many of the synonym lists are huge. Actually they are really not synonyms, but are words that belong to a concept. For example, I would like to map {“New York”, “Los Angeles”, “New Orleans”, “Salt Lake City”…}, a bunch of city names, to the concept called “city”. While searching, the user query for the concept “city” will be translated to a keyword like, say “CONCEPTcity”, which is the synonym for any city name.

Using lucene’s SynonymAnalyzer, as explained in Lucene in Action (p. 131), all I could match for “CONCEPTcity” is single word city names like “Chicago”, “Seattle”, “Boston”, etc., It would not match multi-word city names like “New York”, “Los Angeles”, etc.,

I tried using Solr’s SynonymFilter in tokenStream method in a custom Analyzer (that extends org.apache.lucene.analysis. Analyzer – lucene ver. 2.9.3) using:

* public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new SynonymFilter( new WhitespaceTokenizer(reader), synonymMap); return result; } * where *synonymMap* is loaded with synonyms using

*synonymMap.add(conceptTerms, listOfTokens, true, true);*

where *conceptTerms* is of type *ArrayList* of all the terms in a concept and *listofTokens* is of type *List *and contains only the generic synonym identifier like *CONCEPTcity*.

When I print synonymMap using synonymMap.toString(), I get the output like

<{New York=<{Chicago=<{Seattle=<{New Orleans=….}>}>}>….}>

so it looks like all the synonyms are loaded. But if I search for “CATEGORYcity” then it says no matches found. I am not sure whether I have loaded the synonyms correctly in the synonymMap.

Any help will be deeply appreciated. Thanks!