OpenEars 2.04 and compatibility versions of all plugins out now, with no more uppercase requirements
Today I’m happy to announce that OpenEars 2.04 and all plugins are out now. This is primarily a bugfix release to reduce memory overhead in OEPocketsphinxController and RapidEars while listening, and to prevent a very rare crash that could happen when stopping listening with RapidEars when a lattice search is still working. However there is one significant change which should be a nice improvement for many developers and I wanted to quickly point it out and explain so that everyone can start taking advantage of it ASAP.
When I first designed (OE)LanguageModelGenerator years ago I made the decision to require text input in uppercase letters for best results because it allowed the most optimization for very fast creation of dynamic language models. This didn’t seem like a big trade-off because at the time the size of a language model needed to be quite small in order to perform well during speech recognition on supported devices such as the 2nd-gen iPhone, which meant that for the most part it was command-and-control applications that were being developed with OpenEars. For a command-and-control vocabulary, word case is not such a big consideration in a UI because the words are out of context. Rather than transforming the developer’s text input automatically, I made the decision to support both all-caps and mixed-case but explain in the docs and in the logging output that mixed-case text input would have to be sent to the fallback phoneme lookup technique which would result in fewer available pronunciations, which would have an accuracy impact for words with multiple pronunciations. This felt like the least-bad compromise between strongly-competing concerns of speed, minimizing complexity, and not discarding the developer’s intentional choices.
Over the last couple of years as the devices and the framework and dependencies have gotten faster, it has become a viable choice with OpenEars to use larger vocabularies, and as a result more app developers have been using it with a broader variety of input sources such as written texts, speeches, etc, which is delightful to see. For that kind of application, the case of the input and output format matters to the developer and the user. The uppercase requirement/advantage no longer supported the goals of the developer or of pleasing UX and needed to be improved, so I revisited this early decision and found a way to do case-insensitive lookup without changing the baseline generation speed, and also improved the generation speed for larger models. That means that you can use normal word and sentence casing in your input text and it will be returned by your speech recognition hypothesis with the same casing intact, and larger text input will generate models faster (this doesn’t affect recognition speed, just how long dynamic model and grammar generation take).
There has also been an improvement in handling of punctuation in input, so in the cases that developers don’t do their own text cleaning to remove symbols which are too ambiguous to transcribe and probably not intended to be spoken (for instance, symbols like { or ^ or `) OELanguageModelGenerator will clean the input and it will be consistent across all the plugins and different model/grammar types. Symbols that can’t be transcribed will be removed, symbols which can be transcribed will usually be transcribed by the best-effort fallback grapheme generator (so you should still take a look at your input when you know it in advance and decide whether it would be better for you to transcribe your symbols into words yourself, especially numbers because only you know for sure whether you want 1600 to be transcribed as ‘sixteen-hundred’ or ‘one thousand six hundred’ or ‘a thousand six hundred’ or ‘one six oh oh’), and symbols which aren’t significant for recognition purposes (such as . or , or ; or ? or !) will be left in place and will become part of your model.
An example of this last point would be if you used the sentence “The Sand Snakes are with me.” as input. OELanguageModelGenerator will successfully find multiple pronunciations for any word in this sentence that has more than one pronunciation – it will leave the case intact and there will be no accuracy decline from that. That period (full stop) symbol at the end will stay attached to the word “me” in the model, meaning that when OEPocketsphinxController returns a hypothesis matching an utterance of the sentence, it will still have the period attached to it in the returned text hypothesis. If this isn’t the desired result and you don’t want the individual words in this input to have hints about their position in a sentence or statement, you can still give the original text to OELanguageModelGenerator without sentence punctuation, but the assumption now is that if you give sentence punctuation as input, it’s because you intend for it to be returned in a hypothesis. That also means that if you create a language model rather than a grammar, you can sometimes see a word with a period or comma appear in a different order in the sentence other than the input order, so that is something to think about when using punctuation and evaluating whether to use a language model (statistical model; words can be returned out of order so a word with a period attached can theoretically appear in the middle of a sentence if someone walks by the user and says it) or a grammar (ruleset; the order you choose is the order that will return).
The decision tree I use for these punctuation transformations and non-transformations is basically a simplified non-interactive version of my interactive text-cleaning tool TheKnownUnknowns, so please feel free to take a look at TheKnownUnknowns alongside OELanguageModelController for more info about considerations with different symbols. Please also feel free to use TheKnownUnknowns for preparing texts for OpenEars where you’d like to make your own decisions in advance about how to transcribe difficult cases. It is primarily designed to quickly clean text corpora before creating an acoustic model using long alignment and similar tasks on large texts that have to be prepared for some kind of transcription-related norm, but it is also a good tool for interactively cleaning text you want to use with OpenEars in advance, since they have their design and major assumptions in common.
Although this is not directly a recognition accuracy change, my sense is that there was a cluster of minor accuracy-related symptoms in some apps related to non-transcribable symbols entering the generator, mixed-case being used without realizing it affected how many pronunciations were found, and the possibility that unknown transcribable or ignorable symbols were being handled differently by the language model/grammar lookup than by the phonetic dictionary lookup, which theoretically could result in never-matching words. Projects that were experiencing any of these issues should see an improvement to accuracy from this change.
As always, OpenEars can be downloaded here and the new plugins can either be downloaded at your demo link or your licensed framework link. I hope this little improvement helps you make great apps!