Skip to content

Latest commit

 

History

History
32 lines (20 loc) · 1.78 KB

README.md

File metadata and controls

32 lines (20 loc) · 1.78 KB

Running a LLM on the ESP32

LLM on ESP32 LLM Output

Summary

I wanted to see if it was possible to run a Large Language Model (LLM) on the ESP32. Surprisingly it is possible, though probably not very useful.

The "Large" Language Model used is actually quite small. It is a 260K parameter tinyllamas checkpoint trained on the tiny stories dataset.

The LLM implementation is done using llama.2c with minor optimizations to make it run faster on the ESP32.

Hardware

LLMs require a great deal of memory. Even this small one still requires 1MB of RAM. I used the ESP32-S3FH4R2 because it has 2MB of embedded PSRAM.

Optimizing Llama2.c for the ESP32

With the following changes to llama2.c, I am able to achieve 19.13 tok/s:

  1. Utilizing both cores of the ESP32 during math heavy operations.
  2. Utilizing some special dot product functions from the ESP-DSP library that are designed for the ESP32-S3. These functions utilize some of the few SIMD instructions the ESP32-S3 has.
  3. Maxing out CPU speed to 240 MHz and PSRAM speed to 80MHZ and increasing the instruction cache size.

Setup

This requires the ESP-IDF toolchain to be installed

idf.py build
idf.py -p /dev/{DEVICE_PORT} flash