Combining multi-word Lucene Search terms in GuideStar Search V1_1 API

Document created by JackCowardin Administrator on Oct 30, 2017Last modified by JackCowardin Administrator on Nov 9, 2017
Version 3Show Document
  • View in full screen mode

A recent question from a GuideStar Search API customer asked this question:

 

Hi Jack,

Hope you are doing good. The SOAP to REST feature is working fine on our end. Just today we noticed that some of the users are not able to search the charities which they previously used to donate. Can you please look into this.

Volunteers of America of Michigan  (ein 38-1566662)

First Presbyterian Church of Charlotte  (ein 56-0529970)

The search is by organization name. 

 

Smitha Mahishi

. Net Developer II | Business Technology Solutions (BTS)

 

The response to this question prompted looking into how the different means of combining search terms work with Lucene. Each separate word in a Lucene search query is considered a search "term". Each term is used to collect records by default using the "keyword" index, which is built by indexing terms in several  data fields, and each of these fields has a ranking in the building of the index. In the case of "first presbyterian church of charlotte" Lucene will try to find records that have all of these words. If the terms are combined into a search phrase by surrounding the terms with double quotes ("), then Lucene will select results that have all of those terms in that order.

 

If, on the other hand, search terms are combined with an "AND", then Lucene will find records that have all of those individual terms with no regard for order.

 

Here's what I found trying various query structures, and how I replied to Smitha's question:

 

Hi Smitha – There are nearly 800 “first Presbyterian church” entries in GuideStar. The one in question actually is listed as only “First Presbyterian Church”, as most of them are, rather than as First Presbyterian Church of Charlotte.  Here’s what you are looking for:

 

<hit>

<organization_id>7295729</organization_id>

<ein>25-6875685</ein>

<organization_name>FIRST PRESBYTERIAN CHURCH TR</organization_name>

<mission/>

<city>Charlotte</city>

<state>NC</state>

<zip>28262</zip>

 

So you need to also provide the “city” to identify a particular “first Presbyterian church”. But Lucene operates on Search “terms”. Each word is a term. So “First” is a term, “Presbyterian" is a term, and so is “church“.

 

You can combine these into a search “phrase” by surrounding them with double quotes (“). But in this case, it’s better to AND the terms together, then add the “city:Charlotte” parameter.

 

Separating the search terms with “organization_name:First AND Presbyterian AND Church” and then adding the “city: Charlotte” terms results in finding this organization as the first result of the search.

 

Here’s the url:

 

https://data.guidestar.org/v1_1/search?q=organization_name:first AND presbyterian AND church AND city:charlotte

 

 

The same is true for “Volunteers of America”. A search for “Volunteers of America of Michigan” results in almost 73,000 hits. In this case, simply grouping the search terms together as a “phrase” by surrounding the search term with double quotes, like this”

 

https://data.guidestar.org/v1_1/search?q=organization_name:”Volunteers of America of Michigan"

 

results in the first return being the one you want. That’s because the organization name is almost an exact match, and the search terms are adjacent.

 

<hit>

<organization_id>7527898</organization_id>

<ein>38-1566662</ein>

<organization_name>Volunteers of America of Michigan, Inc.</organization_name>

<mission>

Volunteers of America Michigan, Inc. is a charitable organization founded and driven on Christian values and principles. Our programs address the needs of our community?s disadvantaged, focusing on all levels of our continuum of care. We believe in the dignity of every person and strive to achieve self-sufficiency. We encourage service to God, by serving others.

</mission>

 

Using “AND” to combine the search terms does not work as well for this organization, as running the query below will show:

 

https://data.guidestar.org/v1_1/search?q=organization_name:Volunteers AND of AND America AND of AND Michigan

 

So deciding whether to combine search terms into a phrase with quotes or to AND them together is not an easy decision to make programmatically. But as you can see, one approach works better with ANDs, with an additional discriminator (“city:charlotte”) where the search query is a for a very common phrase.

 

The use of the phrase, combining discrete terms with double quotes, works better if the terms match in sequence in the organization name.

 

I realize that this does not give you specific guidance, but maybe you can implement a design that takes these facts of Lucene searching into account.

 

Regards,

Jack

Attachments

    Outcomes