Since convolutional neural networks are great for images recognition, I modified the task of speech recogniton into image recognition by creating spectrographs from .wav file. Then I trained a custom CNN model and it worked decently, achiving 85% accuracy.