Improving Mycroft through Metrics: The Mycroft Benchmark

eric-mycroft · August 30, 2018, 1:00pm

Originally published at: https://mycroft.ai/blog/the-mycroft-benchmark/

Machine Learning requires data to improve. The best source of that data is through our Community, who Opt-In to share the data from their interactions with Mycroft. That allows us at Mycroft AI to build Open Datasets of tagged speech to improve Wake Word spotting, Speech to Text, and Text to Speech. But, to improve the software that utilizes those engines, we need a different kind of data to analyze. How does Mycroft compare to other voice assistants and smart speakers? How does Mycroft itself improve over time? How can you help?

Benchmarking Mycroft

A benchmark is important for a number of reasons, first and foremost being it offers us a baseline of Mycroft’s performance on a given date that we can compare once changes are in place. Then, as necessary, we can compare different configurations of Mycroft, new platforms and hardware for Mycroft, and our competition.

Over the last couple of weeks, we’ve been preparing and conducting a repeatable benchmark of Mycroft against other voice assistants in the field. This will be a new addition to the Mycroft Open Dataset; not tagged speech or intent samples, but a standard process and metrics that anyone can use to measure Mycroft and other voice assistants quantitatively. Below, I’ll report on the results of the first iteration where we compared a Mycroft Mark I to a first generation Amazon Echo and Google Home.

The Process

To conduct this benchmark, we had to put together a series of questions, which wasn’t as easy as it sounds. Being an emerging technology, there aren’t industry standards that exist yet. So who better to set that standard than the Open player? We prepared a starter set of 14 questions based on the observed usage of Skills by Opted-In Mycroft users (more on that later), taking into consideration industry-reported most used Skills from places like Voicebot. That first run of questions was:

How are you?
What time is it?
How is the weather?
What is tomorrow’s forecast?
Wikipedia search Abraham Lincoln
Tell me a joke
Tell me the news
Say “Peter Piper picked a peck of pickled peppers”
Set volume to 50 percent
What is the height of the Eiffel Tower?
Play Capital Cities Radio on Pandora
Who is the lead singer of the Rolling Stones?
Set a 2 minute timer
Add eggs to my shopping list

That list has already evolved a bit to make the benchmark more objective. For one, this test originally was meant to check the intent parsing along with response times, which will in the future be split into separate benchmarks. As an example, Google does not return any response when asked to “Wikipedia search” a topic, and both Mycroft and Alexa only accept volume adjustments between 0 and 10. For next time, the list will probably look more like this:

Tell me about Abraham Lincoln
What is the height of the Eiffel Tower
Who is the lead singer of the Rolling Stones?
How is the weather?
What is tomorrow’s forecast?
Play Capital Cities Radio on Pandora
Play Safe and Sound by Capital Cities on Spotify
Set a 2-minute timer
Set an alarm for tomorrow morning at 7:00
What time is it?
Tell me the news
Add eggs to my shopping list
Set volume to 5
How are you?
Tell me a joke
Say/repeat/Simon says “[random sentence]”

To make sure we could properly check the times of a response, I set up to record the responses on video. I set the devices next to each other in the same room on the same wifi network and did a network speed test on a laptop on the same network for reference. Once all that was taken care of, the requests began.

Issuing all the requests to each assistant took about 45 minutes. To get the best idea of when requests ended and responses started, I imported the audio into Audacity and used the waveforms to determine five points:

The Wake Word
The end of the request
The beginning of the response
The start of 'real info'
The end of the response

About ‘Real Info’ - We wanted to see if the other assistants in the field might pad their responses with cached phrases to give more time to synthesize the real info of the response. This seems like an obvious way to improve the perception of a response. Hearing “Right now in Kansas City” - which can be easily pre-generated and cached to stream at the start of a weather response - certainly doesn’t detract from the experience. Though it does mean an extra second or so until you actually hear the temperature and weather. Deciding what is padding and when ‘real info’ starts is a subjective call right now, but we’ll be trying to define it well as things progress.

The Results

Now to the good stuff, or in this case, the “room for improvement” stuff. Here are the results from the first Mycroft Benchmark.

Time to Response

One of the biggest points we wanted to track was the ‘Time to Response.’ In this context, that means the ending of the provided request to the beginning of an audible response. We tracked that across the 14 questions using the new Mimic2 American Male voice. We found that Mycroft currently responds an average of 3.3x slower than Google and Amazon. On average for our sample, Alexa responded to requests in 1.66 seconds, Google Assistant responded in 1.45 seconds, and Mycroft in 5.03 seconds.

Time to Real Info

The next thing we decided to track was when the voice assistant’s response actually began answering the question it was asked. As mentioned above, this is a subjective decision for the time being, but still offers some interesting data to look at. On average, Alexa started providing real info 3.02 seconds after the request finished. Google provided real info at 3.55 seconds. Mycroft started providing real info at 5.7 seconds.

We can see that the graph is a good bit tighter here, and in one case, “Tell me the news,” Mycroft actually comes out on top. My presumption is that Mycroft’s competition is adding some phrasing to the beginning of responses that require API hits or pulling up a stream. Though, it also included the reason behind the outlier that is Google’s response to the News query - a nearly 16 second notification about being able to search for specific topics or news sources. I also did a quick look at the time between the response starting and when the assistant provided Real Info. On average, Alexa spoke for 1.36 seconds before providing Real Info. Google Assistant spoke for 2.1 seconds before Real Info. Mycroft spoke for 0.66 seconds before providing Real Info.

Where to go from here

This benchmark was especially helpful in comparing Mycroft objectively to Google and Amazon. Eventually, we’ll be able to broaden it to others in the space. Now the trick is figuring out how to improve the experience, then return to this benchmark periodically to reassess.

For improvements to the experience, we have another source of metrics from which we’ll be able to get actionable information: the Mycroft Metrics Service.

Our Opted-In Community Members have timing information for their interactions with Mycroft anonymously uploaded to a database for analysis. This is how we determined the Mycroft Community’s most used Skills (that is, the Opted-In users most used Skills) for the 14 questions of the Benchmark. Apart from Skill usage, we have visibility of what steps are carried out in an interaction, and how long each step takes. From there we can determine what steps of a Mycroft interaction take the longest, and work to speed them up or find creative improvements to the Voice User Experience.

We’ll also revise the benchmark to be more explicit in comparing the timing of responses. It’s likely we’ll create one or more subjective measures for quality of response. As Skills expand, the number of questions will certainly expand too.

There’s also the question of where this information will live and be available to the community. The blog is a great place for explaining a new process but isn’t great for storing and displaying data. We’ve had some Skill Data published on the Github since May. A repo and/or Github.io page will likely be the residence of data, graphs, and more regular updates on Mycroft Metrics and Benchmarking. That will make it free and available for anyone to use, whether you’re comparing the speed of your local system to others, planning an improvement to Mycroft Core to speed up interactions, or creating a visualization for research. This data is Open and yours to use. Since that will take some time to set up, here is a Google Sheet to give you immediate access to the first round of data.

How can you help?

I’m so glad you asked! Like I mentioned, metrics come back only for Community Members who have Opted-In to the Open Dataset. So the best way to help is to Opt-In and use Mycroft! That way, we get a population of interactions that is as broad as possible. People on different networks in different locations using different devices interacting with Mycroft in different ways provides the best information for Mycroft and the community to make decisions on.

To Opt-In:

Go to home.mycroft.ai and Log In
In the top right corner of the screen, click your name
Select “Settings” from the menu. You’ll arrive at the Basic Settings page
Scroll to the bottom and once you’ve read about the Open Dataset, check “I agree” to Opt-In
That’s it!

Once you’ve done that you’ll not only be providing the metrics from your interactions, but also helping build STT, Wake Word spotting, and Intent Parsing for Mycroft. We always want to thank those members of our community who have Opted-In to help make AI for Everyone.

Have an idea to improve Mycroft’s metrics and benchmarking? Maybe a question on the process? Let us know on the forum.

baconator · August 30, 2018, 3:20pm

Would be curious about this done with mimic (1), as well as what hardware you’re using for mycroft.

eric-mycroft · August 30, 2018, 4:16pm

Getting a round done with Mimic1 is certainly part of the next benchmark, which I’ll be carrying out pretty soon. Like I said, we immediately found some improvements to make to the process. But I didn’t want that to stop us from getting the word out and publishing some data.

The hardware was a Mark I, we’ll add in Picroft and desktop in the future. Sometime in the next few months we’ll be able to check out a Mark II as well!

mycroft · August 30, 2018, 5:40pm

Mycroft QA (Quality Assurance) and Testing

What do you guys think about exposing a dozen instances of Mycroft to the community to field ‘any’ query (text only) and log the results?

Something interactive like the percise and deep speech training pages at mycroft.ai.

The community could stress test Mycroft and submit desired results for a query that didn’t match expectations.

Croudsourcing the QA would polish the core skills quickly.

This approach would eliminate STT/TTS and would concentrate on accurate X Y Z results? It would also be separate from Persona.

It could also be used for the Mycroft Voice UI, currently there is no response for…
What can I do?
How do install skills?

Inspired from this thread Disappointing start: a lot of skills are not properly working

Thoughts?

pcwii · August 30, 2018, 7:41pm

I am working on a “help” skill here that should take care of the “What can I do”.
Still a work in progress and need to see what impact the 18.08 skill meta editor will do.
Basically I am grabbing the answer to “What can I do” from the meta editor created readme.md file for each of the installed skills then rambling off some examples.

KathyReid · August 31, 2018, 9:56am

Hey @pcwii - the new Meta Editor should make this task a lot easier. It will also generate a large JSON file of all Skills and their meta data, so might be able to just consume the JSON rather than having to traverse all the READMEs. Let’s get 18.08 out and then we can take a look

pcwii · September 1, 2018, 11:28pm

@KathyReid,
Looks like my picroft updated to 18.08 today. I actually thought I was going to get prompted to upgrade as per the blog, but no problems so far.
Can you tell me where I would find the Skills JSON file you referred to?

KathyReid · September 3, 2018, 12:32pm

Right here