Use Nvidia Riva as tts/stt

I have an Nvidia AGX Xavier, and recently the riva speech services docker container has been made available. I have been testing it for a while and it is by far the most accurate and responsive speech recognition service I have tried. What would be required for me to adapt the code to make use of it? I feel fairly certain that I can make a crossover script to accept whatever api calls Mycroft makes to something like Mozilla tts, but is there a more direct way? Thanks!

I own a Xavier AGX too and had a similar idea. Unfortunately Nvidia likes to lock you into their eco-system. I understand Riva has some kind of API that could be used to build a “adapter” for Mycroft calls, but i found it being overly complex, let alone setting up a Trition server to run it.

I tried to run a Nemo-ASR STT model “standalone” as a mini-server but it didn’t work out and the developer team wasn’t too helpful when “3rd party developers” ask for new features.

Now I use Mimic3 for TTS (or Coqui-TTS for even better quality). Still searching for a good STT solution, maybe I make another try with Nemo-ASR…

Yeah I have been speaking to someone on the Rhasspy site who also seems to be finding Nvidia a little unfriendly, when its running its supposedly really good even though a Xavier AGX is at the far extremes of the Raspberry price range.

I am not sure why Mycroft supplies specific modules than just a framework that does allow you to ‘wire in’ any though as STT its input is text and its output is audio as that is what it does.
This refactoring and rebranding of permissive licenced code has always confused me and would almost seem a waste of precious dev time to create a framework to allow any.

mycroft has a plugin system to allow integration with any 3rd party STT/TTS

we also have mycroft compatible plugins for OpenVoiceOS, with the advantage that they can be used standalone in any other project

I never heard of Riva before, but i don’t see why it wouldn’t be compatible

Its the new nvidia framework to replace nemo

I did not bother to respond as starting to worry that everything I say about mycroft is a negative.
The plugins assume an all-in-one design where the plugged in item is the same host python code and I am not a fan.
I feel that each element should be standalone and connected by a network layer so that we are not later on trying to create a distributed model from a designed for singular base.
If you have multiples hosts, servers or containers you should be able to link modules without embedding in python code.
Mycroft has always had this all-in-one focus than a distributed infrastructure which is a shame as it doesn’t naturally scale up whilst you can scale down a distributed infrastructure of multiple containers on 1 host.

I just copied a chunk out of the example transcription script and as far as I can tell I have 2 blocks of code, audio in/returns string of what was said, and string in/ text to speech reads it. However I have no idea where to put this. I tried editing the source stt file and using the deep speech server module as a starting point, and I have it successfully making a call to the inference api, but there appears to be more that I am missing.

The code from the Nvidia Riva docs is basically this for speech to text,


auth = riva.client.Auth(uri='localhost:50051')

riva_asr = riva.client.ASRService(auth)


# Set up an offline/batch recognition request
config = riva.client.RecognitionConfig()
#req.config.encoding = ra.AudioEncoding.LINEAR_PCM    # Audio encoding can be detected from wav
#req.config.sample_rate_hertz = 0                     # Sample rate can be detected from wav and resampled if needed
config.language_code = "en-US"                    # Language code of the audio clip
config.max_alternatives = 1                       # How many top-N hypotheses to return
config.enable_automatic_punctuation = True        # Add punctuation when end of VAD detected
config.audio_channel_count = 1                    # Mono channel

response = riva_asr.offline_recognize(content, config)
asr_best_transcript = response.results[0].alternatives[0].transcript
print("ASR Transcript:", asr_best_transcript)

print("\n\nFull Response Message:")
print(response)

seems fairly simple to implement, check the chromium plugin for an example

basically init riva_asr object in the __init__ method reading any relevant value from self.config and return the transcript in execute.

self.config comes from mycroft.conf and can be used for any values the end user may want to modify, such as the host url in your case