CMU Sphinx
Authors: Ramyad, Sayuj (Updated by Bryce Plunkett; Fall 2019) Date: 7/25/2019
Set-Up (Fast)
Prerequisites: - Python 2.7 - Numpy - Ubuntu-Based OS (Ubuntu, PopOS...)
Clone and install dependencies:
$ git clone https://github.com/parallel-ml/sphinxSpeech2Text
$ cd ./sphinxSpeech2Text
$ sudo chmod u+x ./install.sh
$ ./install.sh
Before compiling the C code:
- You might need to edit line 12 and the arecord
command parameters to match your environment (such as changing the recording device)
Compile:
$ make
Before you can run the code:
- Set the SPEECH_RECOGNITION
environment variable to point to the repository. For instance, if you cloned the repository into ~
, then
the SPEECH_RECOGNITION
environment variable should be set to ~/sphinxSpeech2Text
Running the code:
You now have compiled the code. A demo can be run via
./decode
. It will record your voice for 5 seconds, filter the noise, and try to parse commands from the limited corpus. The output text can be found in the /output
folder and the raw and filtered recordings can be found in the /inputs
folder.
Details
Parallel-ml repo: https://github.com/parallel-ml/sphinxSpeech2Text
I used the pocketsphinx to decode the audio files on the raspberry pi’s. I installed it on the raspberry pi by following these instructions: link
Then I used the pocketsphinx_continuous command line command. There are multiple options, such as -inmic
, which while use the system’s default microphone to detect and live decode the speech. You can also decode files using the -infile
flag, then type the directory of the file relative to where you are calling the command from.
You can change the dictionary and the language model that the program uses by using the -dict
and -lm
flags. I created my own dictionary an language model using a tool I found online link, specifically made for pocketsphinx. I did this so that we could reduce the language model size to improve performance and accuracy. I found that the performance was 6x faster when I used my reduced dictionary, and obviously the accuracy is better, but it loses flexibility.
The next steps are to increase the dictionary to include a more variety of words, and increase the flexibility of commands that can be given to the raspberry pi. Below I have attached pictures of terminal output that shows the difference in performance. The output on the top shows performance with smaller dictionary and language model, the output on the bottom is the original dictionary that pocketsphinx comes with. It took more than 6x longer and it was less accurate.
pi@n1:/Research$ ./decode.out
MOVE DOWN
MOVE UP
TURN TO ME
Time Elapsed: 2.049368
pi@n1:/Research$ ./decode.out
uh got caught
move up
learn to make
Time Elapsed: 2.049368
Originally it verbosely outputs every step while it processes the audio, and it was hard to find the actual output, so I created a command to output all the unwanted logs to a specific file, and the actual decoded speech into it’s own file.
Example of Running in Terminal
pocketsphinx_continuous -infile testfiles/Untitled.wav -dict dicts/8050.dic -lm dicts/8050.lm
Note: If you get an error such as: error while loading shared libraries: libpocketsphinx.so.3
, you may want to check your linker configuration of the LD_LIBRARY_PATH environment variable described below:
export LD_LIBRARY_PATH=/usr/local/lib
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
Installation
sudo apt-get install bison
sudo apt-get install swig
cd sphinxbase-5prealpha
./autogen.sh
.configure
make
sudo make install
export LD_LIBRARY_PATH=/usr/local/lib
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
cd ../pocketsphinx-5prealpha
./autogen.sh
.configure
make
sudo make install
Example of Running with C
Contents of decode.c
gcc -o decode decode.c
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#define BILLION 1000000000.0;
int main(void) {
struct timespec start, end;
system("export LD_LIBRARY_PATH=/usr/local/lib");
system("arecord --format=S16_LE --duration=5 --rate=16k -D sysdefault:CARD=1 --file-type=wav testfiles/noisy.wav");
system("echo done recording...");
system("python testfiles/noiseClean.py");
system("echo done cleaning...");
clock_gettime(CLOCK_REALTIME, &start);
system("\
pocketsphinx_continuous \
-infile testfiles/filtered.wav \
-dict dicts/8050.dic \
-lm dicts/8050.lm \
2>./output/unwanted-stuff.log | tee ./output/words.txt");
// pocketsphinx_continuous -infile testfiles/Untitled.wav -dict dicts/8050.dic -lm dicts/8050.lm 2>./output/unwanted-stuff.log | tee ./output/words.txt
system("echo done decoding...");
clock_gettime(CLOCK_REALTIME, &end);
double time_spent = (end.tv_sec - start.tv_sec) +
(end.tv_nsec - start.tv_nsec) / BILLION;
char *timerOutput = malloc(25);
sprintf(timerOutput, "echo Time Elapsed: %f\n", time_spent);
system(timerOutput);
}
System Mic Noise Fix
Using system/USB mic has noises, to clean, here is the content of noiseClean.py:
outname = 'testfiles/filtered.wav'
cutOffFrequency = 400.0
# from http://stackoverflow.com/questions/13728392/moving-average-or-running-mean
def running_mean(x, windowSize):
cumsum = np.cumsum(np.insert(x, 0, 0))
return (cumsum[windowSize:] - cumsum[:-windowSize]) / windowSize
# from http://stackoverflow.com/questions/2226853/interpreting-wav-data/2227174#2227174
def interpret_wav(raw_bytes, n_frames, n_channels, sample_width, interleaved = True):
if sample_width == 1:
dtype = np.uint8 # unsigned char
elif sample_width == 2:
dtype = np.int16 # signed 2-byte short
else:
raise ValueError("Only supports 8 and 16 bit audio formats.")
channels = np.fromstring(raw_bytes, dtype=dtype)
if interleaved:
# channels are interleaved, i.e. sample N of channel M follows sample N of channel M-1 in raw data
channels.shape = (n_frames, n_channels)
channels = channels.T
else:
# channels are not interleaved. All samples from channel M occur before all samples from channel M-1
channels.shape = (n_channels, n_frames)
return channels
with contextlib.closing(wave.open(fname,'rb')) as spf:
sampleRate = spf.getframerate()
ampWidth = spf.getsampwidth()
nChannels = spf.getnchannels()
nFrames = spf.getnframes()
# Extract Raw Audio from multi-channel Wav File
signal = spf.readframes(nFrames*nChannels)
spf.close()
channels = interpret_wav(signal, nFrames, nChannels, ampWidth, True)
# get window size
# from http://dsp.stackexchange.com/questions/9966/what-is-the-cut-off-frequency-of-a-moving-average-filter
freqRatio = (cutOffFrequency/sampleRate)
N = int(math.sqrt(0.196196 + freqRatio**2)/freqRatio)
# Use moviung average (only on first channel)
filtered = running_mean(channels[0], N).astype(channels.dtype)
wav_file = wave.open(outname, "w")
wav_file.setparams((1, ampWidth, sampleRate, nFrames, spf.getcomptype(), spf.getcompname()))
wav_file.writeframes(filtered.tobytes('C'))
wav_file.close()