Would be very neat indeed. Voice is a tricky thing though. Not sure about sphinx..One of my ex-professors was a digital signal guy at Westinghouse for many many years. I could ask him about voice recognition stuff. He was helping a group of undergrads with a project at one time..something about people singing into their computer, and having it show the notes and pitch they were singing show up on the screen in real-time. They did it all in java, using some opensource FFT libs. pretty cool stuff actually.
I did a project with a prof. back in the day using encoding of messages into recorded voice... it worked almost in real time, but the dsp board we used was too crappy for the amount of calcs and transforms required - so it was delayed about 2 seconds... get a recorded audio track... talk into microphone while playing audio track... voice is encoded into the stream, picked up by a computer and a program ran through it to remove the audio (we had to "parse" that one first)... it was pretty cool... anyway, if you want to do voice recognition yourself, you need to look at fuzzy logic (yes, that's real) and things like that... basically an "M" sound has a different shape from other sounds, but voices have different tones and pitches and inflections and accents... you need to match the waveforms in the same way your mind knows a car when it sees one, even if it has never seen one before (4 wheels... vaguely *this* shape... steering wheel... etc)