Voice Recognition / Speech Recognition and speech for the iPhone

Halle Winkler — Wed, 08 Sep 2010 12:31:15 +0000

Decibel metering from an iPhone audio unit

Halle Winkler — Fri, 18 Jun 2010 10:04:52 +0000

Hello visitor!

If Core Audio and iOS development is your cup of tea, you might also want to check out OpenEars, Politepix’s shared source library for continuous speech recognition and text-to-speech for iPhone and iPad development. It even has an API for defining rules-based recognition grammars dynamically as of version 1.7 – pretty neat! On to decibel metering:

[Take me right to the code please!] There are three levels of abstraction for audio on the iPhone, with the AVAudioPlayer as the easiest to use (great for 75% of cases) but with the least fine control and highest latency, then Audio Queue Services as the middle step, with less latency and a callback where you can do a lot of useful stuff, and then at the lowest level there are two types of Audio Unit: Remote I/O (or remoteio) and the Voice Processing Audio Unit subtype.

Audio Units are a little bit less forgiving than Audio Queues in their setup, and they have a few more low-level settings that need to be accounted for, and they are a little less documented than Audio Queues, and their sample code on the developer site (Auriotouch) is a little less transparent than the one for Audio Queues (SpeakHere), all of which has led to the impression that they are ultra-difficult and should be approached with caution, although in practice the code is almost identical to that for Audio Queues if you aren’t mixing sounds and have a single callback. At least, I’ve spent as much time being mystified by a non-working Audio Queue as by a non-working Audio Unit on the iPhone. But it needs to be said that the main reason that Audio Units aren’t much harder than Audio Queues at this point is because a lot of independent developers have put a lot of time into experimenting, asking questions, and publishing their results. A year ago they were much more of a black box.

The decision process on which technology to use is something like:

Q. Are any of the following statements true: “I need the lowest possible latency”, “I need to work with network streams of audio or audio in memory”, “I need to do signal processing”, “I need to record voice with maximum clarity”
A. If yes, Audio Units are probably best. If no,
Q. With the answers to the previous questions being no, do you still need to be able to work with sound at the buffer level?
A: If yes, use Audio Queues or Audio Units, whichever is more comfortable. If no, use AVAudioPlayer/AVAudioRecorder.

In my experience there is just one big downside to the Audio Unit on the iPhone, which is that there is no metering property for it. There is a metering property which you can see in the audio unit properties header and in the iPhone Audio Units docs, but it isn’t really turned on, and you can lose a lot of time discovering this via experimentation. So, if you’ve chosen to use Audio Units and your implementation is working, you have a render callback function. This is where you can meter your samples. I have only written/tested this for 16-bit mono PCM data so if you are using something else, adaptations might be required.

To meter the samples in the render callback requires six steps.

Step 1: get an array of your samples that you can loop through. Each sample contains the amplitude.
Step 2: for each sample, get its amplitude’s absolute value.
Step 3: for each sample’s absolute value, run it through a simple low-pass filter,
Step 4: for each sample’s filtered absolute value, convert it into decibels,
Step 5: for each sample’s filtered absolute value in decibels, add an offset value that normalizes the clipping point of the device to zero.
Step 6: keep the highest value you find.

That end value will be more or less the same thing you’d get when using the metering property for an Audio Queue or AVAudioRecorder/AVAudioPlayer.

[politepix-blog-inline-text-ad]

Now, the actual code:

	
static OSStatus	AudioUnitRenderCallback (void *inRefCon,
        AudioUnitRenderActionFlags *ioActionFlags,
        const AudioTimeStamp *inTimeStamp,
        UInt32 inBusNumber,
        UInt32 inNumberFrames,
        AudioBufferList *ioData) {

		OSStatus err = AudioUnitRender(audioUnitWrapper->audioUnit, 
                                               ioActionFlags, 
                                               inTimeStamp,  
                                               1, 
                                               inNumberFrames, 
                                               ioData);

		if(err != 0) NSLog(@"AudioUnitRender status is %d", err);
		// These values should be in a more conventional location 
                //for a bunch of preprocessor defines in your real code
#define DBOFFSET -74.0 
		// DBOFFSET is An offset that will be used to normalize 
                // the decibels to a maximum of zero.
		// This is an estimate, you can do your own or construct 
                // an experiment to find the right value
#define LOWPASSFILTERTIMESLICE .001 
		// LOWPASSFILTERTIMESLICE is part of the low pass filter 
                // and should be a small positive value

		SInt16* samples = (SInt16*)(ioData->mBuffers[0].mData); // Step 1: get an array of 
                // your samples that you can loop through. Each sample contains the amplitude.

		Float32 decibels = DBOFFSET; // When we have no signal we'll leave this on the lowest setting
		Float32 currentFilteredValueOfSampleAmplitude, previousFilteredValueOfSampleAmplitude; // We'll need 
                                                                                     // these in the low-pass filter
		
                Float32 peakValue = DBOFFSET; // We'll end up storing the peak value here

		for (int i=0; i < inNumberFrames; i++) { 

			Float32 absoluteValueOfSampleAmplitude = abs(samples[i]); //Step 2: for each sample, 
                                                                      // get its amplitude's absolute value.

			// Step 3: for each sample's absolute value, run it through a simple low-pass filter
			// Begin low-pass filter
			currentFilteredValueOfSampleAmplitude = LOWPASSFILTERTIMESLICE * absoluteValueOfSampleAmplitude + (1.0 - LOWPASSFILTERTIMESLICE) * previousFilteredValueOfSampleAmplitude;
			previousFilteredValueOfSampleAmplitude = currentFilteredValueOfSampleAmplitude;
			Float32 amplitudeToConvertToDB = currentFilteredValueOfSampleAmplitude;
			// End low-pass filter

			Float32 sampleDB = 20.0*log10(amplitudeToConvertToDB) + DBOFFSET; 
			// Step 4: for each sample's filtered absolute value, convert it into decibels
			// Step 5: for each sample's filtered absolute value in decibels, 
                        // add an offset value that normalizes the clipping point of the device to zero.

			if((sampleDB == sampleDB) && (sampleDB != -DBL_MAX)) { // if it's a rational number and 
                                                                                       // isn't infinite

				if(sampleDB > peakValue) peakValue = sampleDB; // Step 6: keep the highest value 
                                                                                  // you find.
				decibels = peakValue; // final value
			}
		}

		NSLog(@"decibel level is %f", decibels);

		for (UInt32 i=0; i < ioData->mNumberBuffers; i++) { // This is only if you need to silence 
                                                                          // the output of the audio unit
			memset(ioData->mBuffers[i].mData, 0, ioData->mBuffers[i].mDataByteSize); // Delete if you 
                                                                                  // need audio output as well as input
		}

		return err;
	}
}

That should give you a metered decibel value which is analogous to the output of the metering property for an Audio Queue. If anyone has any corrections to this or comments I hope they’ll get in touch.

My starting point for learning this technique was a helpful response email from iWillApps’ Will to a silly question I had which got me on track analyzing the actual samples, and this page where the math behind displaying DB is broken down pretty thoroughly, and this post on Stack Overflow which explains that the process needs to be done on a rectified signal and has the low-pass filter code example.

cocoa – Politepix

Voice Recognition / Speech Recognition and speech for the iPhone

Read more about OpenEars and download the OpenEars library for free!

Decibel metering from an iPhone audio unit

Hello visitor!