-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hypothesis changes for the same audio file. #10
Comments
Decoder requires CMN estimations (those numbers printed in log) in the beginning. We might implement proper CMN one day, just not there yet. You can set initial CMN estimation for your device with -cmninit option or edit model/feat.params. |
@nshmyrev I see, thanks for pointing that out. However, I'm unclear what the initial value would be, since it's a moving target based on the input device. My understanding is that any time a different microphone is used, or a microphone channel input level is changed on a device, the CMN needs to be adjusted (CMN = Cepstral Mean Normalization, for anyone new to this who is reading). I found the CMUSphinx ticket from 2010 where you discuss this issue in more detail. In it, you also mentioned a possible workaround:
From that ticket, if I understand it correctly, you are saying that you will always need to decode the first utterance (probably poorly) to determine a CMN for the device, and then re-decode the same utterance with a properly set CMN. For example:
If that's the case, does that mean every new device basically needs to be decoded in two passes for each user session? |
Very interesting I hadn't realized this. Might be nice to have some facility in pocketsphinx-ruby which can figure out some values for a given device and serialize them for later re-use. Would these values be worth re-using for a given device with the same sensitivity/noise level? Are they independent of speaker/accent? Perhaps an implementation could even detect when the pre-supplied values were out by a certain tolerance? |
Well, this is what I saw for the "hello" example above: > decoder.configuration.details('cmninit')[:value]
=> "8.0"
> decoder.decode('hello.wav'); puts decoder.hypothesis
oh
=> nil
> decoder.configuration.details('cmninit')[:value]
=> "40,3,-1"
> decoder.decode('hello.wav'); puts decoder.hypothesis
hello
=> nil Since the
The main problem I see here from a production application standpoint is that each device (ie. each session) will have a different value that needs to be warmed up (ie. Pass 1 from above), and persisted (ie. Pass 2 from above, for example, via a unique device ID, or session cookie, etc). In any real-world application, this sort of thing is the job of the application controller to ingest the session identifier and the application model to persist, and not so much that of the @watsonbox This is such a critically important issue to get right otherwise the gem will perform very poorly (and be unusable in most cases), even though it's not a gem limitation, per se. I'm trying to think of the most sensible way to handle this sort of issue (ie. detailed |
@watsonbox Another possibility is to enable a brute-force approach by default. Something like:
|
This looks like a software bug, it should be 40,3,-1 from the beginning. Let me check this issue. |
@nshmyrev I'm having trouble locating any documentation regarding how |
@nshmyrev @watsonbox Actually, it looks as though there's a bunch of parameters that are not being set during the initialization of > decoder = Pocketsphinx::Decoder.new(Pocketsphinx::Configuration.default)
> decoder.configuration.changes
=> []
> decoder.decode('hello.wav'); puts decoder.hypothesis
oh
=> nil
> decoder.configuration.changes
=> [{:name=>"cmninit", :type=>:string, :default=>"8.0", :required=>false, :value=>"40,3,-1", :info=>"Initial values (comma-separated) for cepstral mean when 'prior' is used"},
{:name=>"fdict", :type=>:string, :default=>nil, :required=>false, :value=>"/usr/local/Cellar/cmu-pocketsphinx/HEAD/share/pocketsphinx/model/en-us/en-us/noisedict", :info=>"Noise word pronunciation dictionary input file"},
{:name=>"featparams", :type=>:string, :default=>nil, :required=>false, :value=>"/usr/local/Cellar/cmu-pocketsphinx/HEAD/share/pocketsphinx/model/en-us/en-us/feat.params", :info=>"File containing feature extraction parameters."},
{:name=>"lifter", :type=>:integer, :default=>0, :required=>false, :value=>22, :info=>"Length of sin-curve for liftering, or 0 for no liftering."},
{:name=>"lowerf", :type=>:float, :default=>133.33334, :required=>false, :value=>130.0, :info=>"Lower edge of filters"},
{:name=>"mdef", :type=>:string, :default=>nil, :required=>false, :value=>"/usr/local/Cellar/cmu-pocketsphinx/HEAD/share/pocketsphinx/model/en-us/en-us/mdef", :info=>"Model definition input file"},
{:name=>"mean", :type=>:string, :default=>nil, :required=>false, :value=>"/usr/local/Cellar/cmu-pocketsphinx/HEAD/share/pocketsphinx/model/en-us/en-us/means", :info=>"Mixture gaussian means input file"},
{:name=>"nfilt", :type=>:integer, :default=>40, :required=>false, :value=>25, :info=>"Number of filter banks"},
{:name=>"sendump", :type=>:string, :default=>nil, :required=>false, :value=>"/usr/local/Cellar/cmu-pocketsphinx/HEAD/share/pocketsphinx/model/en-us/en-us/sendump", :info=>"Senone dump (compressed mixture weights) input file"},
{:name=>"svspec", :type=>:string, :default=>nil, :required=>false, :value=>"0-12/13-25/26-38", :info=>"Subvector specification (e.g., 24,0-11/25,12-23/26-38 or 0-12/13-25/26-38)"},
{:name=>"tmat", :type=>:string, :default=>nil, :required=>false, :value=>"/usr/local/Cellar/cmu-pocketsphinx/HEAD/share/pocketsphinx/model/en-us/en-us/transition_matrices", :info=>"HMM state transition matrix input file"},
{:name=>"transform", :type=>:string, :default=>"legacy", :required=>false, :value=>"dct", :info=>"Which type of transform to use to calculate cepstra (legacy, dct, or htk)"},
{:name=>"upperf", :type=>:float, :default=>6855.4976, :required=>false, :value=>6800.0, :info=>"Upper edge of filters"},
{:name=>"var", :type=>:string, :default=>nil, :required=>false, :value=>"/usr/local/Cellar/cmu-pocketsphinx/HEAD/share/pocketsphinx/model/en-us/en-us/variances", :info=>"Mixture gaussian variances input file"}]
> decoder.decode('hello.wav'); puts decoder.hypothesis
hello
=> nil |
Ok, so it appears that something is going a bit haywire in the
At this point, I'm pretty confused as to what's actually going on with Example when using default model files (ie. not setting > decoder = Pocketsphinx::Decoder.new(Pocketsphinx::Configuration.default)
> decoder.configuration.details('cmninit')[:value]
=> "8.0"
> decoder.decode('hello.wav'); puts decoder.hypothesis
oh
=> nil
> decoder.configuration.details('cmninit')[:value]
=> "40,3,-1"
> decoder.decode('hello.wav'); puts decoder.hypothesis
hello
=> nil Example when using a custom adapted model (ie. setting > configuration = Pocketsphinx::Configuration.default
> configuration['hmm'] = '/tmp/custom_sphinxtrain_acoustic_model_folder'
> decoder = Pocketsphinx::Decoder.new(configuration)
> decoder.configuration.details('cmninit')[:value]
=> "8.0"
> decoder.decode('hello.wav'); puts decoder.hypothesis
cloaked
=> nil
> decoder.configuration.details('cmninit')[:value]
=> "8.0"
> decoder.decode('hello.wav'); puts decoder.hypothesis
hello
=> nil Here's a link to the hello.wav file I'm using. Any thoughts or help with fixing this would be great, as I'm pretty sure it makes the gem unusable without hacking around and decoding with multiple passes. Thx! |
@nshmyrev Also, I tried your suggestion regarding setting the |
You should be aware that the INFO: cmn_prior.c(131): cmn_prior_update: from < 40.00 3.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >
INFO: cmn_prior.c(149): cmn_prior_update: to < 48.76 9.26 -4.89 16.89 -25.00 8.89 16.14 -2.51 -4.90 -9.73 4.75 -4.32 -1.49 > These are the updated CMN values after the first decoding. What you can do is remove the decoder = Decoder.new(Configuration.default)
decoder.configuration['cmninit'] = %w{48.76 9.26 -4.89 16.89 -25.00 8.89 16.14 -2.51 -4.90 -9.73 4.75 -4.32 -1.49}.join(',') Then you'll get correct recognition the first time. Doing this automatically would require a way of getting these dynamic values out of Pocketsphinx and then perhaps comparing them to the previous values using some tolerance to decide whether audio needed re-decoding. I'm not against putting this kind of thing into |
@watsonbox Regarding This is verified in the example above where |
@watsonbox Thanks for the help. I removed # Apply initialization values from my machine
decoder.configuration['cmninit'] = %w{60.34 9.26 8.85 -8.44 -20.02 3.01 -6.74 -2.74 -9.94 -1.46 1.94 0.64 9.29}.join(',') Based on my experience with this process, I'm more convinced that the default behavior for the gem would be greatly improved by a brute-force configuration as suggested above. This is mainly because although there's so much great work in this gem, the current out-of-the-box configuration doesn't really work as expected and gives the wrong impression of the otherwise powerful tools. It would suck if other developers were to pass it by because they were unaware of all the configuration nuances contained in this issue ticket. I'd personally rather see an implementation that favors decoder accuracy and configuration clarity over speed by default, and then allows for developers to improve execution speed via optimizations (configuration files or monkeying around with cutting and pasting
What do you think? |
Why do you lazy init here? There are many points of failure during initialization and they must be reported in constructor I think. |
Okay so I think there are two issues here:
I've created #12 to track point 1, so this issue is only concerned with point 2. |
I agree that this is an important issue and it was mentioned by our users frequently. I believe we can fix this in pocketsphinx itself, it just needs some work. I can propose you to move this issue to pocketsphinx. |
@nshmyrev Yes I agree that the best solution would be to resolve this in Pocketsphinx itself. In the meantime I've had a play with a CMNDecoder implementation which will repeat the decoding if the CMN values are not within a certain tolerance of the previous set. However, this is really just an experiment since any likely solution would need to address the same issues with That would require pocketsphinx-ruby to cache each utterance for possible replay, which is not currently the case and leads me to think that this would be better done in the C library. |
I did a workaround technique |
I'm seeing an unexpected behavior while processing a fixed audio file. The hypothesis will occasionally change each time I decode the same file. I'm not sure if this is intended behavior or byproduct of how the decoder works, or a configurable option (some sort of random/pseudorandom generator, noise-reduction, phonetic hash sorting issue, warm-up, etc).
Here's an example of what I'm seeing using a 16-bit, 16000Hz PCM Wave file containing the spoken word _"hello"_:
Anybody have any insight as to what might be happening?
The text was updated successfully, but these errors were encountered: