Apache Solr Search with Solr > 4.7

Solr 4.x brings a plethora of improvements over 3.x and 1.x. All our new projects use 4.x and we try to upgrade any existing client implementations where and when possible. Last week we upgraded another client. The transition was smooth, except for odd entries in the indexing log and, is it turned out, nodes missing from the index.

java.lang.Thread.run(Thread.java:745)
Caused by:
java.lang.IllegalArgumentException: Document contains at least one immense term in field="sm_field_body"(whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[32, 67, 97, 116, 104, 101, 114, 105, 110, 101, 32, 66, 101, 97, 114, 100, 115, 104, 97, 119, 32, 67, 97, 116, 104, 101, 114, 105, 110, 101]...', original message: bytes can be at most 32766 in length; got 108809
[...]
Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 108809

It so happens, the Solr notes on upgrading from prior versions contain the following:

Prior to Solr 4.8, terms that exceeded Lucene’s MAX_TERM_LENGTH were silently ignored when indexing documents. Begining with Solr 4.8, a document an error will be generated when attempting to index a document with a term that is too large. If you wish to continue to have large terms ignored, use solr.LengthFilterFactory in all of your Analyzers. See LUCENE-5472 for more details.

Drupal Apache Solr fields are prefixed with a set of characters that denote the dynamic field nature and follow the Solr convention. e.g. ss_means “single-value string field”, sm_ — “multi-value string field”.

In our case sm_field_body and any sm_* fields are declared as solr.StrField fields which are not analyzed, just stored as is. Previously, fields larger than the allowed 32k limit were simply ignored, but not anymore.

In usual Solr configurations, a StrField could be truncated using TruncateFieldUpdateProcessorFactory

<processorclass="solr.TruncateFieldUpdateProcessorFactory"><strname="typeClass">solr.StrField</str><intname="maxLength">100</int></processor>

However, since sm_* fields are not processed, we need a different solution that does not involve modifying the core Solr configuration. And that comes as a simple hook implementation in a custom module.

<?php/** * Implements hook_apachesolr_index_documents_alter(). * * Fix for https://issues.apache.org/jira/browse/LUCENE-5472 */functionapachesolr_tweaks_apachesolr_index_documents_alter(array&$documents,$entity,$entity_type,$env_id){foreach($documentsas$id=>$document){if(empty($documents[$id]->sm_field_body)){continue;}foreach($documents[$id]->sm_field_bodyas$index=>$value){$documents[$id]->sm_field_body[$index]=truncate_utf8($value,31000);}}}

The above hook_apachesolr_index_documents_alter() implementation looks specifically at sm_field_body as that was the culprit in our case. You can follow the approach to alter any indexed field.

Once deployed, the indexing could continue without issues and all of the content is now searchable.