@J_Montgomery_Mycroft I believe many were waiting for this post: https://mycroft.ai/blog/building-strong-ai-strategy/
A crowd-sourced, open AI might emerge from this strategy, I’m very happy to see it.
I am ready to contribute myself, starting with a single question: will the anonymity of user-contributed content be preserved?
Specifically, do you consider offering technical measures preventing even you, Mycroft team, from - for instance: linking user-contributted utterances to the other information on users that you already have? (like IPs, home.ai profile preferences, use patterns, … ?)
I could have omitted/missed some information though. What are your thoughts on it?
Privacy = Good
Tracking = Bad
We are only pushing data through the feedback loop from users who explicitly opt-in either through home.mycroft.ai or by setting the “Learn” feature of their Mark I device. Other users data will not be tracked and, if we do for some reason collect it in a log file or something, it will be purged just as soon as I become aware of it.
On the Speech-To-Text side of things we’ve put in place ( or are putting in place ) mechanisms to allow users to have their voice data removed from the online repository. Consumers of the data will be required, as part of their license, to refresh it periodically. Unfortunately, to allow removal of the data, by definition, we need to know which user goes with which utterance.
We set it up this way because we were forced to strike a balance between securely anonymizing the data and having it live forever -OR- Giving users control of their data so they can delete it. In light of our certainty that speaker identification software will be ubiquitous in 10 years ( allowing companies to unmask speakers based on a small voice sample ) we decided it is better to allow users to delete their data.
On the natural language understanding side of the house we don’t plan to tie the data to the user in any way. We will also put in place reasonable safeguards to prevent data like credit card numbers, social security numbers and phone numbers from entering the public data set. As much as possible data will be scrubbed of IP addresses and other personally identifiable information.
We don’t want your data. We are not interested in making money off of your data. We are not interested in tracking you or your behavior online. We don’t want to sell you paper towels. We are not in the advertising game and have zero interest in unloading billions of dollars worth of overpriced computers with a fruit stenciled on them.
Our sole purpose is to build an AI that runs anywhere and interacts exactly like a person. Period.
To do that we need to build a data set, but we intend to do it while collecting as little information as possible.
Does that answer your question?
that is what we all wanted to hear
Agreed, it complements the blog post very nicely.
Myself, I trust in the purity of your intentions and the above convinces me.
However, as you’re a company and there are funding sources you need to secure, these could be factors making some to think how exactly, in details, do you solve all these privacy aspects. Implementing technical measures assuring zero-knowledge on the provider side might be troublesome with numerous aspects of what you are doing but these make, in my opinion, a standard in how tools gain users trust (as in software tools privacy-concerned use for work or communication). I might have missed that information but if it’s not there yet, an emphasized transparency on that would probably prevent further doubts.
This is our first official version version and it is unlikely to be the last. We have written it broadly enough to allow us to implement without having to publish a new version with every single code change, but I believe we have included the intentions in a legally-accurate way.
It is really difficult to include nuance of intention in a legal doc without binding yourself so that creativity is stifled. As we implement things we WANT the community to point out concerns so we can address them. The strength of what we are building is cemented by a bond of trust. Losing that trust will effectively cause the effort to crumble as we have built-in a ripcord in our Opt-In mechanism for everyone to pull their data out of the shared dataset. So we always have a strong incentive to “do the right thing”.