How to get better/smarter fuzzy matches

danielquinn · June 22, 2020, 8:33pm

I’m writing a skill that digs through my browser bookmarks and pulls out relevant matches to a spoken phrase. So for example if I say “Search my bookmarks for chicken” I want it to find all the recipes I have stored for chicken.

I thought that the smart thing to use here was mycroft.util.parse.fuzzy_match(), but it’s not doing what I expected.

In a list of bookmarks with titles like Chicken Kiev , Chicken Soup with Garlic and Sour Cream , and Chicken Parm Lasagna, the most relevant according to this function is Arch Linux. (score: 0.5) That soup recipe has a score of 0.26!

Now I know that I could just do a search for the keyword, but that’d have its own problems like “Chicken Parm” wouldn’t match “Chicken Parmesean” for example.

What’s the “right” way to do this?

danielquinn · June 23, 2020, 9:14am

Nevermind, I think I’ve found what I need. the fuzzywuzzy package seems to do what I need:

choices = (
  "Arch Linux",
  "Chicken Parmesean",
  "Gradma's Chicken Soup with Garlic and potato - somerecipesite.com",
  "This is almost chiken soup"
)
process.extract("chicken", choices=choices)

Result:

[('Chicken Parmesean', 90),
 ('This is almost chiken soup', 77),
 ("Gradma's Chicken Soup with Garlic and potato - somerecipesite.com",
  60),
 ('Arch Linux', 47)]

forslund · June 23, 2020, 10:07am

Yeah fuzzywuzzy is a great choice for fuzzy matching. The implementation in Mycroft is a poor-man’s version. I think the main reason it’s not used in mycroft-core were licensing issues.

danielquinn · June 23, 2020, 10:09am

Fuzzywuzzy is GPL-2. What’s Mycroft using that won’t play nice with that?

forslund · June 23, 2020, 10:31am

It’s under Apache v2.

I’m not a lawyer so I don’t know how / if it would work but that’s the reason I remember from way back.

Dominik · June 23, 2020, 12:51pm

You may try RapidFuzz which produces similar matching results like FuzzyWuzzy but comes with MIT licence…

maxbachmann · July 4, 2020, 12:53pm

We did run into this license issue with rhasspy aswell (and it was notoriously slow). Which is why we created RapidFuzz