I’m trying to make a Mycroft/Picroft respond in a voice like the classic BBC Dr Who baddie, a Dalek.
I started with the standard British male Mimic diphone voice, it’s already pretty robotic so it’s well suited. For those who may be interested, I’ve altered it so that it does a passable Dalek impression which has involved two main steps;
The first is to break up the response into the individually delivered words (as in ‘you … will … be … exterminated’) rather than running words together as in human speech. To do this on Mycroft I’ve interrupted coding at the point that the response has been translated into text (/mycroft-core/mycroft/audio/speech.py, at ‘def handle_speak(event):’) and changed the code at the ‘else’ point. Before I show any coding, I should say that, while I’ve been coding for many years, I’m a complete newbie to Python (and Mycroft/Picroft) and if I’m treading on toes or infringing things please let me know or delete this, and if you copy any of this you do so at your own risk (always make copies of the original files so that you can get back to the original code). This is what I changed it to;
#insert pauses (’. ') between words for that dalek sound
utterance = utterance.replace(" “,”. “)
utterance = utterance.replace(”,",". . ")
utterance = utterance + ". "
mute_and_speak(utterance, ident, listen)
The second step was to add the Dalek electronic twang to the voice. After extensive Googling I found that this was originally created by passing the actor’s voice through a ‘ring modulator’(?). On another site (which I can’t find at the moment, but the author deserves much the credit for this bit) I found that a ‘software only’ approximation of ring modulation was to merge a sine wave with the original voice. A sawtooth wave is a decent approximation of a sine wave and, I thought, might be faster so I chose that instead. Mycroft was reluctant to let me add the coding as a separate module so, again, I’ve had to butcher the original code, in this case ‘/mycroft-core/mycroft.tts/tts.py’ at ‘def _execute(self, sentence, ident, listen):’. The code was changed (at the point shown) to;
LOG.debug(“TTS cache hit”)
phonemes = self.load_phonemes(key)
wav_file, phonemes = self.get_tts(sentence, wav_file)
vis = self.viseme(phonemes) if phonemes else None
tooth_w = 0.01
tooth_h = 0.0
ifile = wave.open(wav_file,‘rb’)
channels = ifile.getnchannels()
frames = ifile.getnframes()
width = ifile.getsampwidth()
rate = ifile.getframerate()
audio = ifile.readframes(frames)
#remove the original file
#Convert buffer int16 using NumPy
audio16 = numpy.frombuffer(audio,
empty16 = ()
h = 1
d = tooth_w
for x in audio16:
h = h - d
if h > 1 or h < tooth_h:
d = d * -1
outarray = numpy.array(empty16, dtype=numpy.int16)
dalek_file = wave.open(wav_file,‘wb’)
except Exception as e:
self.queue.put((self.audio_ext, wav_file, vis, ident, l))
I also had to import the needed modules.
The tooth_h and tooth_w variables are the height and width of the sawtooth. I normally set tooth_h to 0, this means the sawtooth goes back and forth between 1 and 0 and the value deducted or added at each step is given by tooth_w (this should be between 0 and 1, preferably low) and the change in effect can be dramatic. There are hours of fun to be had messing about with tooth_w, there is a balance to be found between making it more ‘Dalek’ but keeping it intelligible.
My problem is that adding the coding at this point involves reopening the .wav file getting all the frames and processing each, then rebuilding the file. This adds a ‘noticeable’ (read irritating) delay to the response, probably at least doubling the original noticeable response delay. My understanding of diphone voices are that they are created by concatenating tiny speech sounds held in some sort of database held in the original flitevox voice file. What would make it much faster would be to sawtooth each of these tiny fragments and return them to the file so that the Dalek voice was built in. Since each sawtooth fragment would be the same size as the original this shouldn’t be a problem, if I could get at them. so my question is, is there an easy way to do this, or a complete description of the structure of a diphone file somewhere, or some kindly genius out there who could help?
I posted this initially on Github but am reproducing it in case anyone on here might be able to help. Cheers