- This topic has 13 replies, 2 voices, and was last updated 11 years, 6 months ago by Halle Winkler.
-
AuthorPosts
-
October 22, 2012 at 1:00 pm #11651rl1987Participant
I have generated a language model using OpenEarsSampleApp from a list of 194 single-word phrases. However, it tends to recognize words incorrectly, especially if device is being held closer to the speaker or if a speaker is pronouncing words relatively rapidly. How do I improve the functioning? Would reducing microphone sensitivity and/or introducing Automatic Gain Control help it? If so, how to implement these changes?
October 22, 2012 at 1:16 pm #11652Halle WinklerPolitepixWelcome,
Can you tell me a bit more about your application? Language, OpenEars version, which device, which kind of audio recording input, are you using any other media playback objects in your app like AVPlayer or MPMoviePlayerController, what is the accent, gender and age of the parties you are testing with, what is the accuracy rate you are seeing, have you verified that you are never sending any messages to the audio session or AVAudioSession, are the phrases found in the lookup dictionary or not (it’s the file CMU07a.dic that ships with the framework)?
It should work ideally when users are closer to the device so that is a sign that there could be an implementation issue.
October 22, 2012 at 1:48 pm #11653rl1987ParticipantLanguage is English, closest to American dialect I believe, although neither of the speakers are native Americans. Both males and females were trying to test this feature. All of the speakers have adult sounding voices. Also, for the testing I have been trying to use say(1) program on Mac OS X and been getting the same kind of problems.
We have been testing the application on iPhone 3GS, 4th gen iPod Touch and iPhone 4. In all cases, internal microphone of a device was used, there were no external microphones. OpenEars version is the latest one, 1.2.2. None of the audio-related iOS APIs were used in the code we have written since speech recognition is the only audio-related feature that is available in the application.
It seems that most of the phrases in my language model is found in the pre-packaged cmua07.dic file, but there are exceptions.
October 22, 2012 at 1:55 pm #11654Halle WinklerPolitepixYou can’t use synthesized text for testing recognition, it has to be a real speaker.
I’m confused about the idea of single word phrases — are they single words, or phrases?
Unfortunately you are always going to see reduced accuracy when the speaker has an accent, unfair as it is. What is the accuracy rate you are seeing?
Take a look at the .dic file that is output for the words which aren’t found in the cmu dictionary, because if the fallback method gets the pronunciation wrong, it won’t be recognized correctly.
October 22, 2012 at 2:15 pm #11655rl1987ParticipantWe have entered single words (one word per line) into phrases1.txt file and used the following code in the modified OpenEarsSampleApp to generate the language model:
LanguageModelGenerator *languageModelGenerator =
[[LanguageModelGenerator alloc] init];NSString *phrasesPath = [[NSBundle mainBundle] pathForResource:@"phrases1"
ofType:@"txt"];NSError *error =
[languageModelGenerator generateLanguageModelFromTextFile:phrasesPath
withFilesNamed:@"OpenEarsDynamicGrammar"];
Should the generated OpenEarsDynamicGrammar.dic contain only the words that aren’t found in the CMU dictionary? As of now, it contains all the words from phrases1.txt.
At worst, we would like to have at least 60% accuracy (that is, 6 successes from 10 experiments). 80% and higher accuracy would be good enough.
October 22, 2012 at 2:28 pm #11657Halle WinklerPolitepixMy suggestion is to see which words are being put in your OpenEarsDynamicGrammer.dic file which are not present in cmu07a.dic and look at the pronunciations that are listed therein and make sure that they are accurate descriptions of the way those words are pronounced. If they are not, those words will never be successfully recognized.
>At worst, we would like to have at least 60% accuracy (that is, 6 successes from 10 experiments). 80% and higher accuracy would be good enough.
Could you please tell me your current accuracy rate? It is improbable that you will get 80% for non-native speakers.
I would turn on logging (both verboseLanguageModelGenerator and OpenEarsLogging) and see if there are any error or warning messages.
October 22, 2012 at 2:39 pm #11658Halle WinklerPolitepixCan you be very specific with me about what this means: “Language is English, closest to American dialect I believe, although neither of the speakers are native Americans.” Where are the speakers from and what is their native language and why did you choose them for evaluating accuracy levels for English speech recognition?
The reason I ask is because it’s unusual to say that a non-native speaker has a native dialect or close to one — that’s an exceedingly unusual outcome in language learning. As an example, I speak German as a non-native speaker and the regional accent of German that I speak is probably closest to the Northwestern German pronunciation, but no one from that region would say that I had a Northwestern German dialect because my US accent is at least as strong as a regional German accent in my speech. I would not be a good test subject for evaluating accuracy of German speech recognition.
October 22, 2012 at 2:58 pm #11659rl1987ParticipantOur application is intended for users who live in Ireland and speak the corresponding dialect of English. We want OpenEars to be able to recognize words spoken with this kind of dialect. Some of the testers belong to this set of intended users. We, the developers, are from East Europe, but we can speak English fairly well, although we do have our accents.
An Irish female tester reported that accuracy was as low as 3% when she was holding iPhone as one normally would. When she increased the distance between herself and device, the accuracy got better.
I have uncommented
[OpenEarsLogging startOpenEarsLogging];
and tried generating the language model again.When OpenEars generates the model, I am getting a fair amount of warnings like this:
2012-10-22 16:41:09.436 OpenEarsSampleApp[84775:11f03] The word LOCKSMITHS was not found in the dictionary /Users/rimantasl/Library/Application Support/iPhone Simulator/6.0/Applications/9309879C-9CFD-48C1-8A12-305A6EC7FDA5/OpenEarsSampleApp.app/cmu07a.dic.
2012-10-22 16:41:09.437 OpenEarsSampleApp[84775:11f03] Now using the fallback method to look up the word LOCKSMITHS
2012-10-22 16:41:09.438 OpenEarsSampleApp[84775:11f03] Using convertGraphemes for the word or phrase LOCKSMITHS which doesn't appear in the dictionary
2012-10-22 16:41:09.440 OpenEarsSampleApp[84775:11f03] If this is happening more frequently than you would expect, the most likely cause for it is since you are using the default phonetic lookup dictionary is that your words are not in English or aren't dictionary words, or that you are submitting the words in lowercase when they need to be entirely written in uppercase.
2012-10-22 16:41:09.451 OpenEarsSampleApp[84775:11f03] The word LONGFORD was not found in the dictionary /Users/rimantasl/Library/Application Support/iPhone Simulator/6.0/Applications/9309879C-9CFD-48C1-8A12-305A6EC7FDA5/OpenEarsSampleApp.app/cmu07a.dic.
2012-10-22 16:41:09.452 OpenEarsSampleApp[84775:11f03] Now using the fallback method to look up the word LONGFORD
2012-10-22 16:41:09.452 OpenEarsSampleApp[84775:11f03] Using convertGraphemes for the word or phrase LONGFORD which doesn't appear in the dictionary
These warning are mostly related to Irish location names.
Besides, there has been these warnings just after running OpenEarsSampleApp in iPhone Simulator:
sih_add WARNING: repeated hashing of 'GAMES', older value will be overridden.
sih_add WARNING: repeated hashing of 'GAS', older value will be overridden.
sih_add WARNING: repeated hashing of 'LEITRIM', older value will be overridden.
sih_add WARNING: repeated hashing of 'LESSONS', older value will be overridden.
sih_add WARNING: repeated hashing of 'MEATH', older value will be overridden.
sih_add WARNING: repeated hashing of 'MONTESSORI', older value will be overridden.
sih_add WARNING: repeated hashing of 'MORTGAGE', older value will be overridden.
sih_add WARNING: repeated hashing of 'RENTAL', older value will be overridden.
sih_add WARNING: repeated hashing of 'SALON', older value will be overridden.
sih_add WARNING: repeated hashing of 'SEWER', older value will be overridden.
sih_add WARNING: repeated hashing of 'SUPPLIES', older value will be overridden.
sih_add WARNING: repeated hashing of 'TARMAC', older value will be overridden.
sih_add WARNING: repeated hashing of 'TARMACADAM', older value will be overridden.
sih_add WARNING: repeated hashing of 'TYRES', older value will be overridden.
sih_add WARNING: repeated hashing of 'USED', older value will be overridden.
sih_add WARNING: repeated hashing of 'VETINARY', older value will be overridden.
sih_add WARNING: repeated hashing of 'WINDSCREEN', older value will be overridden.
October 22, 2012 at 3:31 pm #11662Halle WinklerPolitepixOK, for future reference it would be helpful to lead with this information, or give it in response to the questions asked about it. It’s been some work for me to find out what accents you are really trying to recognize and what accuracy levels you are seeing. I would not expect great accuracy for Irish accent recognition with the default language model, which is entirely made up of US accents.
This sounds like a subjective report: “An Irish female tester reported that accuracy was as low as 3% when she was holding iPhone as one normally would. When she increased the distance between herself and device, the accuracy got better.”
The reason it sounds like a subjective report is that I doubt she tested 100 times (and if she did, is it 100 repetitions of the same phrase or 100 different phrases?), so 3% is more likely to be a qualitative statement for “recognition was bad in my testing round”. Reasons for this could be diverse — it could have something to do with your UI, it could have something to do with the environment she is testing in, it could have to do with the non-English (MEATH ) and misspelled (VETINARY ) words in your vocabulary which will have pronunciation entries in the dictionary which will never be spoken, it could have to do with her expectations of what can be recognized (most end users don’t realize the vocabulary is limited and/or that saying a lot of extra things that are outside of the vocabulary will affect recognition quality).
The symptom that recognition is worse when she is close to the phone is unlikely to be strictly true since closeness to the phone improves recognition under normal testing, so what is more likely is that the other variables mentioned above changed at the time that she got farther from the phone. I’m sure there is something to it but it is an isolated data point that is unexpected so it needs replication from your side.
I can’t really remote bugfix an issue you are receiving as a remote report — at some point, someone needs to make a first-hand observation of the issue and test it in an organized way and replicate. If something was wrong with her test session (she was saying words that aren’t in the vocabulary, or it was really noisy, or the UI in the app was giving her the impression it was listening at a time that it wasn’t listening) it’s harmful to try to adapt your approach to that limited data.
My recommendation is to obtain some WAV recordings of your speakers saying phrases that should be possible to recognize with your vocabulary and put them through PocketsphinxController to find out what the accuracy levels are for them. It is important to check the words in your OpenEarsDynamicGrammar.dic file that were not found in cmu07a.dic and make sure that the phonetic transcription in there is a real description of how someone would say those words, and to remove any typos, because if you have “VETINARY” and someone says “VETRINARY”, not only will you not recognize “VETRINARY” but it will hurt the recognition of any other words in the statement since you now have an out-of-vocabulary utterance in the middle of the statement.
October 22, 2012 at 4:04 pm #11663rl1987ParticipantWhat kind of work needs to be done to make OpenEars recognize words spoken in non-American dialect? As far as I know, PocketSphinx is a general purpose speech recognition engine and could be used to perform speech recognition in multiple languages. Do I need to have recordings of Irish people pronouncing words of interest and perform model training to generate models that correspond specifically to Irish dialect of English? What are the specifics of doing this?
October 22, 2012 at 4:12 pm #11664Halle WinklerPolitepixI don’t think you have enough information yet to commit to going down that path since you haven’t replicated the issue in-house with real numbers under known-working conditions, which means you don’t have a way of measuring improvement. But you can adapt the language model to an Irish accent — ask at the CMU board for specifics since it is a question about Pocketsphinx and SphinxTrain. They will also ask for concrete accuracy numbers and directly observed behavior if you ask about accuracy so it’s a good idea to set up your tests locally before getting started with that.
October 22, 2012 at 4:20 pm #11665Halle WinklerPolitepixBasically, you know you have bad data because there are misspelled and non-English entries in the phonetic dictionary, and a test that really had a 3% accuracy rate has to have been somehow mis-administered or was done on too little data to be meaningful. So if you start to adapt the language model based on this bad data, you will get more bad data out. The first step is getting rid of the issues you already know about and then getting into the accent adaptation once you are seeing tests with believable results (for me this would be better than 40% accuracy rate at least).
October 22, 2012 at 4:28 pm #11666rl1987ParticipantOkay. Thank you.
October 22, 2012 at 4:38 pm #11667Halle WinklerPolitepixNo problem, the best way to replicate for an accent that is not your own accent is to obtain WAV recordings of speech which should work (i.e. it contains the words in your language model) and run it through – (void) runRecognitionOnWavFileAtPath:(NSString *)wavPath usingLanguageModelAtPath:(NSString *)languageModelPath dictionaryAtPath:(NSString *)dictionaryPath languageModelIsJSGF:(BOOL)languageModelIsJSGF;
This should show very quickly whether the issue is in the language model/dictionary and/or is due to an issue with how the app is being tested.
-
AuthorPosts
- You must be logged in to reply to this topic.