My Site : Stemmer Testing My Site : Stemmer Testing

Stemmer Testing

Main Image

The Paice-Husk Stemmer developed by Chris D Paice with the assistance of Gareth Husk at Lancaster University features an externally stored set of stemming rules, and this flexibility over the Porter stemmer made it of interest to several researchers including ourselves.

Apart from the Stemmer itself, Chris Paice developed a method for directly measuring the performance of stemmers using grouped lists of words, and we made use of his 'ERRT' method in our List Analyser software while developing and testing our own stemming rules in several languages for the dtSearch Stemmer, which also stores the rules in an external file much like the Paice-Husk Stemmer.

Interestingly, we had converted the original Java stemming test software used at Lancaster University into C# and early in 2016 while investigating minor differences between our results and those published by Hooper and Paice, discovered some bugs in the Java software. 

Chris Paice, one of the pioneers of research into stemming passed away 21 April 2016.

Our own research covered several years, and our conclusion was that stemming rules are best tuned according to the task in hand and the actual corpus being searched. If you don't have hundreds of hours to spare, then best not to mess about with stemming rules, since with some stemmers it is very easy to end up with endless loops or whole groups of words not being found!

These days our focus has moved on to other things, and so we made the decision to release the source code used in our List Analyzer onto a public repository on GitHub. We made a few changes to our software to enable us to replicate the same bugs as the Lancaster Java software, and we hope that others will try to improve on the software and refine the method that Chris Paice started way back in the '90s.

dtSearch uses the stemming rules at search time, commonly called search term conflation. This has the advantage that the index can contain many languages and the stemming rules can be easily swapped about to optimize search in different languages, it also doesn't matter a jot whether the 'words' produced by the stemmer are actual 'dictionary' words, since these are not seen by the user, dtSearch displays the actual words* in the document collection in a 'word list' so that searches are not 'blind'.

If you're looking for a research topic for your higher degree, check the list below out for some ideas;  when reading articles and academic papers don't take every word as gospel, question every assumption, you might be surprised – as we were - to see the conflict in assumptions, and citations made by authors that clearly did not have time to read or verify the original data but simple borrowed others conclusions to save time.

* in dtSearch Desktop the word list by default displays 'normalized' words, all accents and diacritics are removed and all letters are converted to lower case; the user can optionally choose to create an 'accent sensitive' index and/or a case sensitive index, these indexes can be searched at the same time, thus making it a popular choice for translators and editors that need to have a fast way of checking documents for missing accents or capitalization.


Comment Form

Comments are closed.