LLaMa: GPT-quality language model on commodity hardware (eg RPi)

LLaMa allows Large Language Models (of comparable quality to the Generative Pre-trained Transformer) on commodity hardware (IE on a single GPU, although even more impressively, I believe pared-down versions like LLaMa 7B have even been run on an RPi and on a smartphone).

The Hackaday article is here, and more informative than anything I’d write.

I presume we’re all proponents of having some modest degree of autonomy ourselves when it comes to domestic software & hardware, and although developed by Meta/Facebook, this is a model trained on publicly available text, and for which the code and model weights can be obtained (see the hackaday article for details).

As with all processing of inputs, there are huge security advantages to locally processing as much of them as practicable.

As with anything pertaining to that which a lay reader might think of as AI, I’d like to append a note to explain why I’d describe this as Machine-Learning rather than AI (and, at that, the word ‘Learning’ is somewhat of a lazy anthropomorphisation/pretending-something-is-a-human. I know, I know, I probably sound like my grandmother telling me ‘car’ is a vulgar contraction of ‘motor-car’…). The following text is from this web-page:

How LLMs Work:
LLMs like GPT-3 are deep neural networks—that is, neural networks with many layers of “neurons” connected by billions of weighted links. Given an input text “prompt”, at essence what these systems do is compute a probability distribution over a “vocabulary”—the list of all words (or actually parts of words, or tokens) that the system knows about. The vocabulary is given to the system by the human designers. GPT-3, for example, has a vocabulary of about 50,000 tokens.

For simplicity, let’s forget about “tokens” and assume that the vocabulary consists of exactly 50,000 English words. Then, given a prompt, such as “To be or not to be, that is the”, the system encodes the words of the prompt as real-valued vectors, and then does a layer-by-layer series of computations, whose penultimate result is 50,000 real numbers, one for each vocabulary word. These numbers are (for obscure reasons) called “logits”. The system then turns these numbers into a probability distribution with 50,000 probabilities—each represents the probability that the corresponding word is the next one to come in the text. For the prompt “To be or not to be, that is the”, presumably the word “question” would have a high probability. That is because LLMs have learned to compute these probabilities by being shown massive amounts of human-generated text. Once the LLM has generated the next word—say, “question”, it then adds that word to its initial prompt, and recomputes all the probabilities over the vocabulary. At this point, the word “Whether” would have very high probability, assuming that Hamlet, along with all quotes and references to that speech, was part of the LLMs training data.

1 Like

I have been integrating this in OVOS as part of the persona sprint

you can follow progress here


Holy smoke. Thanks @JarbasAl ! I was just mentioning this in the abstract; I didn’t think anyone would remotely be onto turning this into reality in the context of voice assistants! You just blew my mind :smiley:


Hi all, new poster here. This looks awesome, I was hoping to try out and potentially help develop some local LLM voice assistant action. :slight_smile:

I am still trying to figure out the basics of the Mycroft Mark II, which skill store to install skills from and so on, but once I get that sorted out I was hoping to be able to talk with the Mark II which would get responses generated on the GPU on a local stationary computer. Right now I am using a 7B model called wizardLM and think the conversations are pretty good.

Were you aiming to run the LLM actually on the Raspberry Pi as a proof of concept? Or are you also aiming to communicate with some local computer?

I also saw this YouTube video which inspired me. Maybe you have already seen it because it is pretty old.

How far have you gotten? All the best.

1 Like

some work has been done already, mostly just exploring ideas

main issue tracking progress

1 Like

Hi @JarbasAl , thanks for the quick reply! It looks very promising and ambitious. Should I have OVOS to try things out or does it also work with Neon OS?

1 Like

since Neon is built on top of OVOS, any components made for OVOS will also work in Neon

in this case there are many loose proof of concepts, those can be used but do not yet come together as a final product you can just install and be done with, prioritizing this work it is a stretch goal of our ongoing fundraiser, right now updates only come whenever i work on this for fun or as a side effect of working on related code

1 Like

As Jarbas said, you can work with either operating system, and often with just a little care skills & other projects can be compatible with both OVOS and Neon AI. :slight_smile:

Neon has a skill for talking to ChatGPT working in our beta version right now, which you might like to check out. With the Neon OS running on the Mark II, the commands are:

  1. Enable pre-release updates
  2. Check for updates
  3. Update my configuration
  4. and then, “Chat with ChatGPT”
    You’ll still need to either press the button on top of the Mark II or use the wakeword for each sentence you want to say to ChatGPT. We’re considering how to make that smoother - perhaps by leaving the microphone open while the ChatGPT skill is active. Suggestions are welcome. :slight_smile:

Yes, Coqui is an excellent project! We’ve put some contributions in there, and feel we’re very close to enabling our own STT & TTS. :slight_smile: If it’s of interest, here’s our Coqui demo - Neon AI - Coqui AI TTS Plugin | Neon AI

Georgi Gerganov has another excellent repo as he did with Whisper.cpp

Its a pretty easy install but on Rpi even with the amazing optimisation work its still going to be excruciatingly slow its sort of Ok on a RK3588 which is x5 Pi4 perf and that is maxing out the CPU.
ASR+LLaMA+TTS could make a really cutting edge home assistant, but needs some Oooomf!