Running a LLM on the ESP32

Summary

I wanted to see if it was possible to run a Large Language Model (LLM) on the ESP32. Surprisingly it is possible, though probably not very useful.

The "Large" Language Model used is actually quite small. It is a 260K parameter tinyllamas checkpoint trained on the tiny stories dataset.

The LLM implementation is done using llama.2c with minor optimizations to make it run faster on the ESP32.

LLMs require a great deal of memory. Even this small one still requires 1MB of RAM. I used the ESP32-S3FH4R2 because it has 2MB of embedded PSRAM.

With the following changes to llama2.c, I am able to achieve 19.13 tok/s:

Utilizing both cores of the ESP32 during math heavy operations.
Utilizing some special dot product functions from the ESP-DSP library that are designed for the ESP32-S3. These functions utilize some of the few SIMD instructions the ESP32-S3 has.
Maxing out CPU speed to 240 MHz and PSRAM speed to 80MHZ and increasing the instruction cache size.

This requires the ESP-IDF toolchain to be installed

idf.py build
idf.py -p /dev/{DEVICE_PORT} flash