If you want to run STT or TTS in anywhere near real-time, you need a GPU. Even on an i7 you can’t come close to the performance of a GPU for things like this.
At the same time, the underlying technology is changing very rapidly too. With DeepSpeech, the training sessions this spring took 2 weeks on 2 multi-GPU machines. After some major rearchitecture this summer, it now can run a training session against even more data on 1 of those machines in 3 days. So… specing exact hardware is probably not the way to start. I think we should begin building the technology and see what is needed to support it once we have all the basic pieces in place.