Methodology

I've created a test suite to take a Solr schema.xml and run it through its paces. The schema file is used to configure an instance of Solr that is fed a collection of one million documents up to 2,000 words in length each. Each document consists of random text with words chosen with an English-like frequency, and the total corpus is around 6GB in size.

Once the documents have been indexed, the suite simulates query load with 5 concurrent users performing a total of 10,000 queries (2,000 each). The terms in each query are chosen using the same technique as above, with various operators (AND/OR/NOT) thrown in for good measure. I've left out phrase queries for now because my "lean" ngram modifications mean they're no longer supported when searching the ngram field (I'm not sure we want to phrase search ngrams anyway...)

Keeping in mind these are generated by choosing random words with the frequency that people actually use them (NSFW?):

Test cases

The test cases I've run so far are:

ngrams-5 -- The current Solr schema (from github/master) with ngram settings changed to only generate ngrams of length 3, 4 and 5. schema.xml file
ngrams-5-lean -- As above, but with further modifications to remove term vectors from the "content" field and positional information from ngrams. schema.xml file
no-ngrams -- The current Solr schema (from github/master) with all ngram fields removed. schema.xml file
no-ngrams-lean -- As above, but with term vectors removed. schema.xml file
wildcards-lean -- Shares the same schema as the above, but every term of every query has a "*" appended to turn it into a wildcard query. Our "if we didn't use ngrams..." case.
edge-ngrams-5 -- ngram settings generating ngrams of length 3, 4 and 5 plus edge ngrams generating left-anchored ngrams up to length 15.

I had intended to benchmark the github/master schema file (with its max ngram length of 15) but indexing performance was really poor (~ 6 documents/second, and at the 12% mark the index size was 25GB!). This worried me enough that I didn't bother finishing the benchmark...

For whatever reason, I've used lean to signify schema changes that don't change the search semantics, but reduce the overall size of the index on disk.

The setup

All benchmarks run on an otherwise idle AMD Athlon(tm) 64 X2 Dual Core Processor 6000+ with 4GB RAM. About 3GB of that is free for the OS disk cache.
Search indexes are being served off a mirrored pair of crummy 5400RPM HDDs with LVM encryption. Pretty poor IO, but hopefully that doesn't matter for the sake of comparison...
Java version "1.6.0_24" Java(TM) SE Runtime Environment (build 1.6.0_24-b07) Java HotSpot(TM) Server VM (build 19.1-b02, mixed mode)
JVMs run with the following switches: "-Xmx1g", "-verbose:gc", "-XX:+PrintGCDetails", "-XX:+PrintGCTimeStamps", "-XX:+UseParallelGC"

Indexing performance

Indexing is performed by two processes POSTing batches of 1,000 documents via the Solr web interface. The indexing process is CPU-bound in all cases tested so far.

ngrams-5

Indexing performance:	175 documents/second
Indexing time:	5704 seconds
Total index size:	20GB

ngrams-5-lean

Indexing performance:	294 documents/second
Indexing time:	3403 seconds
Total index size:	5.4GB

no-ngrams

Indexing performance:	608 documents/second
Indexing time:	1643 seconds
Total index size:	8.8GB

no-ngrams-lean

Indexing performance:	977 documents/second
Indexing time:	1023 seconds
Total index size:	2.3GB

wildcards-lean

[Shares an index with no-ngrams-lean]

edge-ngrams-5

Indexing performance:	150 documents/second
Indexing time:	6633 seconds
Total index size:	20GB

JVM Memory usage

All tests show a similar pattern: reasonable GC activity while indexing, reduced while queries are running. Memory usage seems mostly unchanged by the different schema definitions.

Query performance

Note: for the ngrammed indexes, the ngram field was queried by repeating the original query (against the content) field and joining with OR. E.g.: content:fish OR ngram:fish.

5 concurrent searchers performing 2,000 queries each:

	wildcards-lean	ngrams-5	ngrams-5-lean	no-ngrams	no-ngrams-lean	edge-ngrams-5
Queries per second:	7	11	16	27	30	9
Average query response time (ms):	699	393	272	159	148	450

Each point below represents the response time of a single query:

As you would expect, adding ngrams has slowed query performance, but I think there are two effects contributing to that slowdown:

Slowness directly caused by the ngrams: more terms and having to search across two different fields means more work at query time.
Slowness incidental to the fact that adding ngrams have made the indexes bigger on disk. The ngrams-5 and ngrams-5-lean indexes differ only in the size of their indexes (the former has unused data that the latter doesn't), but there's a difference in their query performance. Less of the larger index fits in the OS caches and that means more queries end up hitting my (lousy) disks.

Appendix: compensating for poor IO performance

To better isolate the query-time overheads caused by ngrams from the performance hit caused by growing index sizes, I repeated the tests for ngrams-5-lean and no-ngrams-lean using only 250,000 documents for testing. The indexing stats were:

ngrams-5-lean (small dataset)

Indexing time:	802 seconds
Total index size:	1.4GB

no-ngrams-lean (small dataset)

Indexing time:	235 seconds
Total index size:	571MB

Both indexes are small enough to fit comfortably in the OS cache, so the performance hit caused by disk IO should be reduced.

Query performance against these two indexes (and throwing in our wildcards-lean baseline for good measure):

	ngrams-5-lean	no-ngrams-lean	wildcards-lean
Queries per second:	66	105	26
Average query response time (ms):	66	41	185

and their query performance graphs: