SaveThatWav and VAD

Home Forums OpenEars SaveThatWav and VAD

Viewing 4 posts - 1 through 4 (of 4 total)

  • Author
  • #1024796

    Hi Halle,

    Could you please shed a bit more light on how SaveThatWav works with respect to events that correspond to pocketsphinxDidDetectSpeech, pocketsphinxDidDetectFinishedSpeech, and secondsOfSilenceToDetect attribute?

    More specifically:
    – can I assume there is always (approximately) secondsOfSilenceToDetect seconds of silence at the end of the wav file? (silence as determined by pocketsphinx VAD)

    – what does pocketsphinxDidDetectFinishedSpeech event correspond to in the wav file? Can I (roughly) assume that that event was fired secondsOfSilenceToDetect before the end of the wav file?

    – what’s the padding in the beginning of the wav file? (seems like wavs are always much longer than end-start speech detected)

    – given the answers to the above questions, what’s fed to the decoder

    If there is some extra “padding” in the beginning, how does that affect situation when there are multiple simultaneous wav files almost immediately one after the other?

    Halle Winkler


    Are you experiencing a bug or do you have a question about how to use SaveThatWave’s API in an app? I can’t help with questions about how the plugins are implemented, sorry.



    The API is trivial, so no questions there… I don’t know if I am experiencing a bug, because I could not find a description of what I am supposed to get in the wav file (thus my original question).

    I am trying to understand how I can interpret what is captured via the plugin and how it relates to the events fired via Pocketsphinx. I think making things a bit more transparent on that front would help. I am not asking about the details of implementation; I am happy to pay for the plugin — which I did — and use it, but it would be nice to know what I’m getting in those wav files.

    To be more specific: when I look at the time between pocketsphinxDidDetectSpeech, pocketsphinxDidDetectFinishedSpeech and take into account secondsOfSilenceToDetect value (0.4sec) I can’t quite understand how sometimes signal of 250ms (determined from looking at the wav file, i.e. I see silence, very short word, and then silence again; the short word part is 250ms) that triggered VAD ends up reporting 400-450ms between the above events. When I look at the corresponding wav file saved via the SaveThatWav plugin, I get something longer with some leading and trailing silence (trailing silence seems to sometimes correspond to the secondsOfSilenceToDetect, but not always)…

    And why am I doing this: because VAD in Pocketsphinx doesn’t do a very good job with mouth noise, clicks, etc. and sometimes (with using Rejecto) still ends up mapping those into something from the grammar… So, I was hoping that I can filter out some of those false positives by looking at relevant durations. Makes sense?

    Halle Winkler

    I appreciate your purchase! I don’t think that SaveThatWave can be usefully put to task as a tool for analyzing the Sphinx project’s VAD.

Viewing 4 posts - 1 through 4 (of 4 total)
  • You must be logged in to reply to this topic.