[Resolved] Clarifications on the improved background noise cancellation feature – Politepix

[Resolved] Clarifications on the improved background noise cancellation feature

wfilleman — Mon, 08 Dec 2014 20:47:45 +0000

Hi Halle,

Overall this update is a nice big step forward for voice recognition. I thought the 1.x series was good, this is just that much better. Great work.

Something I’m now running into is the .vadThreshold setting. In my testing if this is left as the default or even 2.0, my 6 Plus is constantly entering the speech detect loop with the slightest noise. That’s not really a problem as I’m seemly getting MUCH better recognition across the room in a quiet environment which is going to be excellent for some of my customers.

If I use the low end of 1.5-2.0 for vadThreshold then where there is noise in the room the recognition engine seems to get pretty unreliable and flooded with the noise and has a really hard time detecting speech. If I bump this up to the max of 3.9 (looks like the framework has an upper limit of 4.0) then I can have lots of noise in the room without OE entering into it’s speech detecting loop until I say something directly into the mic. Again, this is pretty good behavior consider all the noise/music I’ve got playing and seeing it actually work.

The problem comes in when the environment has variable noise levels. Quiet at some parts of the day and noisy during other parts. (My customers use my app in a wall mounted scenario). Maybe I’m wrong on this, but I thought that the OE 1.x series did an auto-tune (calibration) for background noise and would make continual internal adjustments based on noise/recognition. Do I have that right? If so, is that something that can be turned on in 2.0 or is this behavior now up to us to implement by fine tuning the .vadThreshold in real time?

Thanks Halle!
Wes

Reply To: [Resolved] Clarifications on the improved background noise cancellation feature

Halle Winkler — Mon, 08 Dec 2014 21:08:47 +0000

Hi Wes,

Thank you for your kind words. It should still be reacting to the environment and updating itself – the vadThreshold isn’t an absolute volume level, but a s/n ratio setting, so it should continue to make sense as those values change. But you might not be able to use a value which is as aggressive as 3.9, since it will tend to reject real speech under normal circumstances.

What is your experience under changing circumstances using a value like 3.0? Your feedback is appreciated on this, since even though I have a lot of different audio in my tests, there’s no substitute for real-world feedback and this is part of the new Pocketsphinx code.

Reply To: [Resolved] Clarifications on the improved background noise cancellation feature

wfilleman — Mon, 08 Dec 2014 21:43:22 +0000

Thanks Halle,

I was digging into the framework when I saw that vadThreshold was how you describe is as a relative speech/silence threshold. That’s good to know.

I’ve set the vadTheshold to 3.0 and ran a couple of tests with background music from the radio at different volume levels. Overall it seems to be a little better. Now that I know what I’m looking for I can see that it is indeed adjusting to the various sound levels. When the music levels are above what I would say is just background music, it’s really tough to get OE to process the speech, but again, I’m asking a lot of the engine to throw out louder than background music and pull out my speech.

You are right, it’s a fine balance between upping the threshold and keeping it within speech detecting tolerance.

I may offer my users an option to say if they are installing this in a noisy room. If YES then I can set the vadThreshold to 3.0. If no, leave it as the default. What do you think?

Wes

Reply To: [Resolved] Clarifications on the improved background noise cancellation feature

wfilleman — Mon, 08 Dec 2014 22:10:35 +0000

I’m looking at this a little deeper and I *think* what I’m actually seeing is the OE framework adjusting to the different volume levels quite rapidly. For example, if I have a steady tone as background noise, OE pretty quickly sees this as noise and ignores it. I can then issue speech and it does pretty well.

If I’m playing music with various beat levels, I see OE struggle a little bit trying to determine what to ignore as noise since it’s seeing the threshold cross all over the place with the beat of the music.

I’m wondering if there’s a way to level this auto-adjustment out by increasing the number of frames OE considers for the “noise level” if that makes sense. For example, if OE only looks at a few frames, then the “noise” level would be rapidly changing from low to high and back. If OE looks at a larger group of frames as a moving average, then these intermediate spikes of noise could be leveled out and ignored.

Just my guess, but I think that’s what I’m actually seeing. Adjusting the vadThreshold is a way to work around this issue by forcing a larger discrepancy, but if (what I suspect) is a single frame of louder noise, it punches through the vadThreshold since the low/high detection appears to be pretty tight in terms of a low number of frames to analyze.

There’s no easy answer here as what I’m suggesting would have other tradeoffs as well if I’m even close to the issue.

Wes

Reply To: [Resolved] Clarifications on the improved background noise cancellation feature

Halle Winkler — Tue, 09 Dec 2014 08:38:13 +0000

I checked in with the CMU project and verified that this is correct (I’ll probably post the response later if I get permission to quote) – recalibration is definitely happening as you’ve seen, and 3.0 is probably the highest value that is correct to use. It is designed to be adaptive to changing environments but expects stationary noise, i.e. no dramatic oscillations that need to be reacted to in very short timeframes (this was also the case that would get the old VAD stuck, so we have an improvement if there’s no stuckness but recognition is sub-optimal).

It might be possible to change the VAD timeslice although it’s probably dangerous or possibly pointless to optimize in that area at the same time it continues to be developed by the Sphinx project.

If you feel like recompiling the framework, there are some config settings you can look at in OEPocketsphinxRunConfig.h related to VAD activity:

// #define kVAD_PRESPEECH //”-vad_prespeech”, int, default ARG_STRINGIFY(DEFAULT_PRESPCH_STATE_LEN), Num of speech frames to trigger vad from silence to speech.
// #define kVAD_POSTSPEECH //”-vad_postspeech”, int, default ARG_STRINGIFY(DEFAULT_POSTSPCH_STATE_LEN), Num of speech frames to trigger vad from speech to silence.

Or if the issue is that recognition is getting stuck, you can also reduce this check for a stuck utterance in OEContinuousModel.m to something lower than 25:

if(([NSDate timeIntervalSinceReferenceDate] – self.stuckUtterance) > 25.0)

Remember that the framework project has to be archived rather than just built or it won’t build a universal framework and you’ll get object errors with either a device or a simulator, depending.

Question, does your app play back audio or does it just take in mic audio?

Reply To: [Resolved] Clarifications on the improved background noise cancellation feature

wfilleman — Tue, 09 Dec 2014 14:23:13 +0000

Thanks Halle,

Ok, that’s good. That all makes sense with what I’m seeing. No problem with the framework rebuild. I already rebuilt it yesterday to add back in a custom feature I need in my app to be able to disable the bluetooth input option with OE via a BOOL variable on the PocketSphinxController.

I’ll play around with these settings and post back with what I find. The 25 seconds needs to come down for my use cases. Thanks for pointing me in the right direction there.

Yes, my app plays audio as well. One of the features is an IP Camera streaming option that can play video/audio from IP Cameras. While not in use 100% of the time, it’s possible a user could have voice recognition ON while watching their camera. I did look at this yesterday and it appeared to work like the 1.x framework. So, no concerns from me on that front.

Wes

Reply To: [Resolved] Clarifications on the improved background noise cancellation feature

Halle Winkler — Tue, 09 Dec 2014 14:28:10 +0000

Great, I will be happy to hear about what you discover. This is a .0 version so as more info comes in from real-world usage there can be adjustments where it makes sense.

Reply To: [Resolved] Clarifications on the improved background noise cancellation feature

OT — Tue, 09 Dec 2014 17:48:51 +0000

To add to this thread: in my (somewhat limited) testing with both my app and the sample app, the threshold between 2.5 and 3 works well. Default value of 1.5 seems to be too low.

(my testing was with Apple’s headset; with default levels ‘speech’ gets detected even with minor noise that’s far from the mic)

The main issue with the lower values is the end-pointing, which can affect the flow of the application. In other words, even if there is a false speech detection trigger (e.g. noise, bgnd speech, etc.), decoder will typically deal without any problems with that. But, those noise levels that triggered vad and started recognition will also prevent it to end, after user said what they were supposed to say.

This may not be an issue with RapidEars where decoding is done in the real time (so decoder is effectively doing VAD).

Reply To: [Resolved] Clarifications on the improved background noise cancellation feature

Halle Winkler — Tue, 09 Dec 2014 18:57:21 +0000

Yup, I’ve raised vadThreshold to 2.0 for the current version and we’ll see what the feedback on that is. Here is the CMU commentary on the VAD:

Our VAD does track the noise level continuously, it updates noise estimation every frame with sliding average of about 5 seconds.[….] it tracks the noise level and raises speech signal when the signal in some frequency band is higher than threshold * noise.

On the other hand, the VAD is designed to work with slowly changing colored (different levels in different bands) noise. It is not supposed to deal with non-stationary noise. The recommended threshold is about current value (2.0) or it could be 3.0 if you expect slightly more noise variation. Values over 3 are not very reasonable. The value of threshold describes how the noise changes (in what boundaries you consider the change as noise), not how the speech change so it should not be tuned.

Reply To: [Resolved] Clarifications on the improved background noise cancellation feature

wfilleman — Tue, 09 Dec 2014 19:09:39 +0000

Thanks Halle,

Based on their response, how would you expect the pre and post speech values to change and their impacts to overall speech detection? I’m not sure I’m following the link between the code change suggestion and the CMU response. It sounds to me that their sliding 5 sec average is fixed?

Wes