#From Here to There: 2016 Mycroft Roadmap
One of the great things about what we are doing together as the Mycroft community is the breadth of the problem we are tackling and the huge impact it can have on the world. This is incredibly exciting, and daunting at the same time. But as a community we can break this down into smaller core technologies and tasks that improve the whole as each of the parts iteratively improves.
The Mycroft of today is built upon five Core Technologies:
- Wake Word
- Text to Speech (Mimic TTS)
- Intent Parsing (Adapt)
- Speech to Text (OpenSTT)
- Framework (aka mycroft-core)
The first four Core Technologies have intrinsic value as stand-alone tools, enabling other efforts outside of Mycroft. The last pulls the rest together to build something even greater which enables the AI Assistant we all working towards.
In addition to those Core Technologies, the Mycroft experience is strengthened by additional pieces we are developing which make the whole valuable to everyone. Those are:
- Mycroft Backend (API key management, deep learning dataset management, device management)
- Skills Ecosystem
- Individual Skills
- “Enclosures” – Mycroft Mark 1, Linux desktops, Android, cloud/web, …
Finally, there is research that will be valuable as the above technologies mature. This is incredibly vast, but I think several specific areas are worth noting as they will rapidly become invaluable:
- Machine learning frameworks
- Emotional understanding
Sections below will go through each of these in more detail discussing how they each can improve. As a community effort each can and will improve at different rates, but the whole will benefit regardless of the order of the improvements.
State of the Art
Today’s Mycroft is a strong technical preview, but certainly not the AI assistant promised by sci-fi masters like Heinlein, Roddenberry or Scalzi. Mycroft is “fun and interesting” and will soon hit the milestone of “very useful”; both important milestones on the path to becoming a trusted and pervasive AI for Everyone.
The maturity of the core technologies vary. But these pieces being developed for Mycroft are currently in demand and already proving useful outside of the Mycroft system. Sonar GNU Linux, for example, is incorporating Mimic TTS into their Linux distribution for the visually impaired.
Mycroft, Inc. – a subset of the Mycroft community
Although not part of the technical roadmap, it is important to understand the distinction between Mycroft AI, Inc. and the Mycroft community. These two are closely related and often intermingled, but distinctly different.
As we began developing Mycroft, we quickly realized that what we wanted to build was too important to remain in the hands of one private entity. We drew a line between what we had been created that was core technology and what we needed to build to support our interactive assistant vision. The core technology was then made Open Source which allows others to leverage it in ways we can’t even anticipate, and contribute back to help everyone.
Mycroft AI, Inc. remained necessary for several reasons. One was the practical aspects that some entity needed to sign contracts and pay for servers and services that are needed for the unified Mycroft ecosystem we imagined. Another is to create a dedicated core team that could focus on building Mycroft technology, not distracted by a competing “day job”. And just as important as the building technology was the need to let the world know what has been build, for many who can benefit from it are not the typical Open Source users.
In the subsequent sections you will see many bullet points of work for each individual technology. Each of these bullet points can be tackled by an individual or small team. Some of these will be performed by Mycroft AI, Inc., but we expect many will be tackled by individuals with unique talents and interest in the particular area. In those cases, Mycroft AI, Inc. will just help coordinate and support that work.
Over the next several weeks and months we will flesh-out frameworks to help organize the community and achieve these goals. Of course, individuals and sub-teams are welcome to use whatever tools they are most comfortable with for organization and development, but if identifying and building tools has already been done then people can focus on doing the real creative and important work.
Individual Technologies and near-term goals
The current system for waking up Mycroft is good. But as the kick-off point for all voice interactions, it needs to be great. Work in these areas will all help achieve that:
- Improve false positive/false negative triggers
- Easy wake word customization
- Voice registration / voice printing for user identification
- Extend for cloud-less commands (“Stop”, “Pause”, etc.)
- Come up with a catchy name!
Speech to Text (OpenSTT)
Of all the Mycroft technologies, OpenSTT is still in the earliest stages of development. Initial testing has been performed on top of the Kaldi ASR engine with high-quality results in English. Further validation needs to occur here, however, to achieve the type of high accuracy, low latency results users have come to expect from proprietary technologies like Siri or Google Assistant.
Other pieces needed to build a strong STT engine:
- Training data collection and preparation
- Multi-language support
- Language/Accent detection
- Mixed language support
Intent Parsing (Adapt)
Accurately determining user intent from conversational interaction is critical to providing Mycroft users with a high quality user experience. The current Adapt engine uses a rules-based approach which is an excellent solution for many applications, but just the beginning. Improvements to come include:
- Implementing deep learning to run in parallel with known entity approach
- Interface for training deep learning intents
- Disambiguation of training sets
- Designing conversational interaction to refine intents
Text to Speech (Mimic TTS)
Mimic is already good, but still not indistinguishable from a human. Yet.
- Build more voices in English with our partner VocaliD
- Support for global languages
- Enhance expressiveness (SSML), prosody, cadence and tone
- Performance: Phrase caching
- Performance: Pre-loaded Mimic (PyMimic)
Framework (aka mycroft-core)
- Enhance Skills API (much more on this below)
- Mechanism to run most of core in cloud for limited power systems
- Performance: Pre-process based on STT hypothesis
- Securing the framework: data isolation and preventing malicious activity
- Skill submission/deployment system (Skill Store)
- Account/device/skill management and customization
- Generalized OAuth mechanism for Skills
- eCommerce support
- Access to alternative STT/TTS engines
- Self service API access for organizations/developers
The Enclosure is the embodiment of a Mycroft – the portal that lets you access to your personal Mycroft. Each of these embodiments has unique capabilities – the knob on top of a Mycroft Mark 1, the screen of a Ubuntu desktop, the GPS and accelerometers of an Android phone. Enclosures should hide these differences for most Skills that don’t require special hardware, while allowing other Skills to exploit those unique capabilities.
Currently the mycroft-core includes some assumptions about the enclosure. Efforts to support Android and Ubuntu have required changes to the core. These efforts need to be unified.
- Virtualize the concept of Enclosure
- Ubuntu/Fedora Desktop enclosure
- Android App enclosure
- iOS enclosure
- Web enclosure
The key to the Mycroft system is the Skills, as its value will grow with every Skill added. Ultimately these Skills will be built by people who have yet to hear about Mycroft, and the process of needs to be as easy as possible while still giving them power to build anything they can imagine.
- Skill API (see entire section below)
- More well documented examples
- Tools to help Skill creation (forums, Skill Ideation Hub)
- Automated testing/validation systems
This section could be pages long. Here are some of the basics to get things started.
- Google Calendar
These pieces aren’t strictly needed for the first stages of Mycroft. But work in these areas can quickly be applied to enhance the framework.
- Machine learning framework
- Emotional understanding (analyzing inflection, words) as context
- Integrating other input (camera, biometrics, networks) for context
The importance of Skills to Mycroft cannot be understated. The utility of the system is dictated by the both the number of quality of Skills. And the rate at which those are created is dictated by the elegance and power of the tools provided to the Skill authors.
- Skill manager to negotiate intent keyword overlap (disambiguation)
- Break intent handling into two stages: parsing with confidence levels; and performing action
- Support multiple STT interpretations with probabilities
- Groups and intra-group communication
- Provide access to context (user, location, history, environment)
- Bluetooth user recognition
- “Converse” stage for active Skills before normal intent parsing
- Conversation scripting/branching
General API Tools
- Alternative keyword definition (flexible, simpler, non-Regex)
- Easy state save/load
- Parsing tools: Date/time extraction, Location extraction, Number extraction
- Formatting tools: ‘Nice’ time/date formatting, spoken numbers
- Easy HTTP GET (with caching), POST and DELETE
- Built-in JSON and XML parsing tools
- Generalized monitoring of GPIOs (callbacks on state change)
- Callbacks for non-voice events (time-based, user arrival, etc)
- Generalized time events
- Enclosure capabilities exploration/access
- Metadata description for skill option editing
- Required Enclosure capabilities system
- Constant listening for a few seconds after being woken-up or any interaction
- Nearby screen association architecture
- “What is visible” context
- HTML serving framework
- Linux box (Openelec based?)
- Roku adaptor
- Chromecast adaptor
- Other adaptors
Longer Term Goals
Extensive long-term planning has limited value – unanticipated change is inevitable. But thinking about where you can go next in general terms is important.
As an open, auditable, trusted collective we can leverage all of our voice interactions to rapidly accumulate volumes of data. Unrecognized requests can feed back into the system, associated with the recognized requests that immediately follow them. Users can volunteer to read known phrases to assist in building a corpus of accents. And it all can be stored in an anonymous manner but made available to all for research and unexpected uses.
Using the above data, machine learning frameworks can feed back into the STT systems. TTS may also be able to leverage this corpus of known human pronunciations to create more natural-sounding speech.
As we build move beyond support for just English, the TTS engines can be combined with the translations in the Skill vocabularies to begin automatic “translation” of commands. So a Skill that has been coded with only English in mind will still be able to handle commands spoken in German.
This truly just scratches the surface of possibilities.
Mycroft has been born and we take the nurturing of this technology seriously. They say it takes a village to raise a child, and this extraordinary entity is going take much more than that. We will endeavor to build the relationships, frameworks, technologies and systems needed to allow public and private organizations, businesses, groups and individuals to participate in this effort.