Ollama on Raspberry Pi. Benchmarking local LLMs

Installing local LLMs on Raspberry Pi CM5 and benchmarking performance

02/12/2025

At blackdevice we constantly push small hardware platforms to see how far they can really go. This time we focused on our own Pi Hack carrier board paired with a Raspberry Pi Compute Module 5 (CM5), 8GB RAM and a 256GB NVMe SSD. The goal was simple: install Ollama, run several lightweight local language models (LLMs), and compare how they behave when fed the same prompts. The promise of running models locally is tempting — privacy, offline work, full control — but how does that promise translate into real-world responsiveness and usefulness on constrained hardware? Spoiler: some models are great; others quickly prove impractical. In this article we’ll walk you through the installation, the tests we ran, the raw stats the LLMs returned, and a model-by-model conclusion of what we actually experienced.

What’s Ollama

Ollama is a command-line–driven local LLM runner. Instead of relying on cloud APIs, we download models directly to the machine and interact with them through the terminal. For our workflow, this has two advantages:

Everything stays local and offline.
The entire installation, loading and inference process is transparent.

Given the experimental nature of our Pi Hack board, Ollama is a perfect match for quick, controlled model testing.

Hardware & initial setup

Hardware used

Pi Hack carrier board (for Raspberry Pi Compute Module 5)
Compute Module 5 (CM5) — 8 GB RAM
NVMe SSD — 256 Gb, connected via M.2
PoE for network + power

Imaging the disk

Use rpi-boot to expose the eMMC/NVMe as an external mass-storage device to your host machine.
Use Raspberry Pi Imager to flash the 64-bit Raspberry Pi OS onto the NVMe SSD so the Pi will boot from it.

Initial boot

Connect monitor + keyboard and PoE cable for first boot.
Once booted, obtain local IP and SSH from your computer.

Installing Ollama on our Pi Hack

Run these commands:

Update the system and install basic tools
sudo apt update && sudo apt upgrade -y
sudo apt install -y curl wget jq git ca-certificates
2. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
3. Confirm the service status with:
ollama --version
sudo systemctl status ollama

How we benchmarked: the experiment design

To compare models fairly, we defined a simple and repeatable testing procedure. Each model received the exact same prompts and was executed with --verbose so we could collect statistics.

- Prompt 1. Translation: Translate to english the following sentence written in spanish: Probar modelos de inteligencia artificial en local nos permite compararlos y comprobar el rendimiento en diferentes dispositivos.
- Prompt 2. Historical milestones: Choose three historical milestones in the history of science and arrange them from the oldest to the most recent.

These two tasks reveal essential differences: basic language understanding, structure, and generation speed. Every model was pulled and executed on the Raspberry Pi CM5 (8GB RAM + NVMe) under identical conditions. The tasks were designed to be simple enough for these light LLMs.

Models tested

We evaluated a range of lightweight models accessible through Ollama:

TinyLlama 1.1B
Deepseek R1 — 1.5B, 7B and 8B
Gemma3 — 270M, 1B and 4B
Phi-4 mini reasoning — 3.8B

Gemma3 LLM 1b parameters test on Raspberry Pi Compute Module 5

For each one, we ran both prompts and analyzed the output and the raw performance metrics.
Models tested, with the exact run commands used:

tinyllama (TinyLlama 1.1b) — ollama run tinyllama:1.1b --verbose
deepseek-r1:1.5 — ollama run deepseek-r1:1.5 --verbose
deepseek-r1:7b — ollama run deepseek-r1:7b --verbose
deepseek-r1:8b — ollama run deepseek-r1:8b --verbose
gemma3:270m, gemma3:1b, gemma3:4b — ollama run gemma3:<size> --verbose
phi4-mini-reasoning:3.8b — ollama run phi4-mini-reasoning:3.8b --verbose

Model-by-model raw results and commentary

Deepseek R1: 1.5B

Quality: Fast but poor quality. Translation misunderstood “en local” and treated it like a physical locality. Historic milestones returned are imprecise and with wrong ordering/dates.
Stats (prompt 1):

total duration: 25.00 s
load duration: 222.91 ms
prompt eval count: 40 token(s) — prompt eval duration: 1.87 s (21.37 tokens/s)
eval count: 261 token(s) — eval duration: 22.60 s (11.54 tokens/s)

Stats (prompt 2):

total duration: 58.653 s
load duration: 208.187 ms
prompt eval count: 23 token(s) — prompt eval duration: 0.992 s (23.17 tokens/s)
eval count: 641 token(s) — eval duration: 56.703 s (11.30 tokens/s)

Conclusion: Too inaccurate for practical use despite modest speed.

Deepseek R1: 7B

Quality: Much slower and still gives poor or confused answers. The model spends lots of time (long loops) and the output quality doesn’t justify the time.
Stats (prompt 1):

total duration: 2 m 9.477 s (≈129.48 s)
load duration: 227.324 ms
prompt eval count: 40 — prompt eval duration: 8.961 s (4.46 t/s)
eval count: 294 — eval duration: 1 m 59.887 s (≈119.887 s) (2.45 t/s)

Stats (prompt 2):

total duration: 11 m 7.088 s (≈667.09 s)
load duration: 224.595 ms
prompt eval count: 23 — prompt eval duration: 4.734 s (4.86 t/s)
eval count: 1473 — eval duration: 11 m 0.373 s (≈660.37 s) (2.23 t/s)

Conclusion: Too slow for the CM5 8GB; poor time-to-quality tradeoff.

Deepseek R1: 8B

Quality: Better than 7B for quality, but still very slow — answers are acceptable but too slow to be useful.
Stats (prompt 1):

total duration: 2 m 35.038 s (≈155.04 s)
load duration: 252.184 ms
prompt eval count: 39 — prompt eval duration: 9.297 s (4.19 t/s)
eval count: 295 — eval duration: 2 m 25.208 s (≈145.21 s) (2.03 t/s)

Stats (prompt 2):

total duration: 6 m 42.290 s (≈402.29 s)
load duration: 216.430 ms
prompt eval count: 21 — prompt eval duration: 4.802 s (4.37 t/s)
eval count: 769 — eval duration: 6 m 36.656 s (≈396.66 s) (1.94 t/s)

Conclusion: Slow; 8B can be run but it’s borderline practical — quality improved vs smaller Deepseek models, but still not great relative to time cost.

Gemma3: 270M

Quality: Very fast, literal but correct; good for very simple tasks. Excellent performance on CM5.
Stats (prompt 1):

total duration: 1.416 s
load duration: 264.957 ms
prompt eval count: 40 — prompt eval duration: 0.161 s (≈248.5 t/s)
eval count: 24 — eval duration: 0.898 s (≈26.7 t/s)/li>

Stats (prompt 2):

total duration: 12.205 s
load duration: 258.755 ms
prompt eval count: 28 — prompt eval duration: 0.104 s (≈269.0 t/s)
eval count: 284 — eval duration: 11.295 s (≈25.1 t/s)

Conclusion: For tiny models the throughput is superb; answers are usable, at least for simple tasks.

Gemma3: 1B

Quality: Very good quality for size; returns multiple translation options and a strong recommendation. Good balance of speed and response richness.
Stats (prompt 1):

total duration: 15.083 s
load duration: 529.896 ms
prompt eval count: 40 — prompt eval duration: 1.293 s (30.92 t/s)
eval count: 151 — eval duration: 13.048 s (11.57 t/s)

Stats (prompt 2):

total duration: 52.409 s
load duration: 551.967 ms
prompt eval count: 27 — prompt eval duration: 0.767 s (35.21 t/s)
eval count: 563 — eval duration: 49.811 s (11.30 t/s)

Conclusion: Excellent performer on CM5 — high quality with acceptable latency for many local tasks.

Gemma3: 4B

Quality: Slower than 1B but still solid and more detailed. For our test prompts the 1B often sufficed and returned faster answers; 4B costs more time to load/generate.
Stats (prompt 1):

total duration: 1 m 10.156 s (≈70.16 s)
load duration: 536.921 ms
prompt eval count: 40 — prompt eval duration: 4.529 s (8.83 t/s)
eval count: 251 — eval duration: 1 m 4.728 s (≈64.73 s) (3.88 t/s)

Stats (prompt 2):

total duration: 2 m 53.720 s (≈173.72 s)
load duration: 537.524 ms
prompt eval count: 28 — prompt eval duration: 2.750 s (10.18 t/s)
eval count: 639 — eval duration: 2 m 49.515 s (≈169.51 s) (3.77 t/s)

Conclusion: Good quality but heavier than necessary for the simple prompts tested; 1B is often the sweet spot.

TinyLlama: 1.1B

Quality: Fast but unreliable. Failed the translation in one test and the historical list was imprecise. Slightly better than Deepseek R1 small models but still inferior to Gemma3.
Stats (prompt 1):

total duration: 5.285 s
load duration: 83.223 ms
prompt eval count: 77 — prompt eval duration: 2.880 s (26.73 t/s)
eval count: 39 — eval duration: 2.298 s (16.97 t/s)

Stats (prompt 2):

total duration: 15.123 s
load duration: 93.980 ms
prompt eval count: 57 — prompt eval duration: 1.240 s (45.95 t/s)
eval count: 244 — eval duration: 13.706 s (17.80 t/s)

Conclusion: Quick but too noisy / inaccurate for reliable usage.

Phi-4 mini reasoning: 3.8B

Quality: The model attempts deep reasoning, but it’s very slow and for our simple tasks spends excessive time (even looping on the second prompt). Final answers can be correct but the time cost is prohibitive.
Stats (prompt 1):

total duration: 2 m 32.606 s (≈152.61 s)
load duration: 277.632 ms
prompt eval count: 48 — prompt eval duration: 5.185 s (9.26 t/s)
eval count: 453 — eval duration: 2 m 26.460 s (≈146.46 s) (3.09 t/s)

Stats (prompt 2):

total duration: 10 m 38.034 s (≈638.03 s)
load duration: 267.611 ms
prompt eval count: 36 — prompt eval duration: 2.339 s (15.39 t/s)
eval count: 1783 — eval duration: 10 m 32.768 s (≈632.77 s) (2.82 t/s)

Conclusion: Not practical on CM5 8GB for these tasks.

AI model benchmark results on Raspberry Pi hardware

Model (size)	Prompt 1 total	Prompt 2 total	Quality (summary)	Suitable on CM5?
Gemma3 (270M)	1.42 s	12.21 s	Fast, literal but correct	Yes (excellent)
Gemma3 (1B)	15.08 s	52.41 s	Very good quality, good balance	Yes (recommended)
Gemma3 (4B)	70.16 s	173.72 s	Good but slower	Maybe (if you accept latency)
Deepseek R1 (1.5B)	25.00 s	58.65 s	Fast but poor quality	No
Deepseek R1 (7B)	129.48 s	667.09 s	Very slow, poor	No
Deepseek R1 (8B)	155.04 s	402.29 s	Slow, better quality	No (too slow)
TinyLlama (1.1B)	5.29 s	15.12 s	Fast, unreliable	No (quality)
Phi-4 mini (3.8B)	152.61 s	638.03 s	Slow, loops	No

Conclusion

In this post, we set out to evaluate how several lightweight language models perform on hardware that was never intended for AI workloads. The results were clear: while all models worked, Gemma3 delivered a level of speed and efficiency that genuinely surprised us. Running these models locally on a minimal setup helps us understand their real-world behavior, their limits, and their practical value for small, embedded systems.

This test is also an important baseline for future comparisons. By running the same models with the same prompts on different devices, we’ll be able to evaluate new hardware more objectively and measure how each platform really performs under identical conditions.

Thanks for reading! If you want to follow our tests, projects, and tutorials, make sure to visit our blog, subscribe to our mailing list for updates, and check out the video versions on our YouTube channel.

Installing local LLMs on Raspberry Pi CM5 and benchmarking performance

What’s Ollama

Hardware & initial setup

Hardware used

Imaging the disk

Initial boot

Installing Ollama on our Pi Hack

How we benchmarked: the experiment design

Models tested

Model-by-model raw results and commentary

Deepseek R1: 1.5B

Deepseek R1: 7B

Deepseek R1: 8B

Gemma3: 270M

Gemma3: 1B

Gemma3: 4B

TinyLlama: 1.1B

Phi-4 mini reasoning: 3.8B

AI model benchmark results on Raspberry Pi hardware

Conclusion

Subscribe to get updates from blackdevice