How ToolNeuron Runs LLMs on Android: Architecture Deep Dive

Most people think running LLMs on a phone means "call an API." ToolNeuron runs them entirely on-device — no server, no internet, no API keys. Here's how.

The Problem

Running a 7B parameter model requires roughly 4-7GB of RAM depending on quantization. A typical Android phone has 6-8GB total, shared with the OS, other apps, and the GPU. You can't just load a model into memory and call it a day.

The real challenge isn't "can it run?" — it's "can it run without the OS killing your app, while maintaining acceptable token generation speed, on hardware that varies wildly across devices?"

Architecture Overview

ToolNeuron's inference stack has four layers:

┌─────────────────────────────────┐
│  Kotlin UI (Jetpack Compose)    │  ← User sees this
├─────────────────────────────────┤
│  Kotlin Inference Manager       │  ← Orchestrates everything
├─────────────────────────────────┤
│  JNI Bridge (C++ ↔ Kotlin)     │  ← Crosses the language boundary
├─────────────────────────────────┤
│  Native C++ (llama.cpp/GGML)   │  ← Actual tensor operations
└─────────────────────────────────┘

Each layer exists for a specific reason. Let me walk through them bottom-up.

Layer 1: Native C++ Inference (GGML + llama.cpp)

The foundation is llama.cpp — Georgi Gerganov's C/C++ implementation of LLM inference. It uses GGML, a tensor library designed for inference on consumer hardware.

Why not TensorFlow Lite or ONNX Runtime?

Both are good frameworks, but for LLM inference on Android:

TFLite doesn't natively support the transformer architectures used by Llama, Qwen, Mistral, etc. You'd need to convert models, losing quantization quality.
ONNX Runtime supports transformers but its Android footprint is large, and KV-cache management for autoregressive generation isn't as mature.
llama.cpp/GGML was built specifically for this. It understands GGUF model format natively, handles KV-cache efficiently, and supports quantization schemes (Q4_K_M, Q5_K_S, Q8_0) that were designed for the exact memory/quality tradeoffs we need on mobile.

GGUF Model Loading

When ToolNeuron loads a model, here's what actually happens:

// Simplified — the actual code has error handling and progress callbacks
struct llama_model_params model_params = llama_model_default_params();
model_params.n_gpu_layers = 0; // CPU-only on most Android devices
model_params.use_mmap = true;  // Memory-map the file — critical for mobile

struct llama_model* model = llama_load_model_from_file(path, model_params);

use_mmap = true is critical. Memory-mapping means the OS loads model weights from disk on demand instead of reading the entire file into RAM. On a 4GB quantized model, this can mean the difference between 4GB resident memory and 1-2GB, because the OS only keeps recently-accessed pages in RAM.

Quantization: The Key Tradeoff

A 7B model in FP16 is ~14GB. That won't fit on any phone. Quantization compresses the weights:

Format	Size (7B)	Quality	Speed on ARM
Q8_0	~7.5GB	Near-FP16	Baseline
Q5_K_S	~5.0GB	Good	1.1x faster
Q4_K_M	~4.1GB	Acceptable	1.3x faster
Q3_K_M	~3.3GB	Noticeable degradation	1.5x faster

ToolNeuron defaults to Q4_K_M for 7B models on devices with 6GB+ RAM. For 8GB+ devices, Q5_K_S gives a meaningful quality improvement at a small speed cost.

The quantization format isn't just about size — it affects which SIMD instructions can be used. ARM NEON can process Q4 weights using vld1q_u8 + shift operations that are extremely efficient. Q5 requires extra bit manipulation that's measurably slower.

Layer 2: JNI Bridge

Kotlin can't call C++ directly. The Java Native Interface (JNI) is the bridge, and it's where most Android AI apps get it wrong.

The Naive Approach (Don't Do This)

// BAD: Allocating a new string for every token
external fun generateNextToken(): String

This creates a new Java String object for every generated token. At 20 tokens/second, that's 20 JNI boundary crossings and 20 heap allocations per second. The garbage collector will hate you.

What ToolNeuron Does Instead

// GOOD: Batch callback with direct byte buffer
external fun startGeneration(
    modelPtr: Long,
    prompt: String,
    params: Long,
    callback: Long  // Function pointer to Kotlin callback
)

The native code runs the full generation loop in C++, calling back to Kotlin only when it has a batch of tokens or when the user needs to see output. This minimizes JNI crossings from O(tokens) to O(batches).

For the token data itself, we use DirectByteBuffer — a Java NIO buffer that lives outside the JVM heap, accessible from both C++ and Kotlin without copying:

// C++ side: write token directly into shared buffer
void write_token_to_buffer(JNIEnv* env, jobject buffer, const char* token, int len) {
    char* buf = (char*)env->GetDirectBufferAddress(buffer);
    memcpy(buf + offset, token, len);
}

Memory Management Across the JNI Boundary

The model pointer (llama_model*) lives in C++ heap memory. Kotlin holds it as a Long. This means:

Kotlin can pass the pointer to any JNI function
The model isn't subject to garbage collection
You MUST manually free it — there's no destructor

ToolNeuron uses a reference-counted wrapper that calls llama_free_model() when the last Kotlin reference is released. If the app crashes before cleanup, Android's process death handles it — the OS reclaims all memory including native allocations.

Layer 3: Kotlin Inference Manager

This layer handles everything the C++ layer shouldn't care about:

Model discovery: Scanning storage for GGUF files, reading metadata
Runtime model switching: Unloading one model, loading another without restarting
Context management: Creating/destroying llama contexts for different conversations
Parameter management: Temperature, top-p, top-k, repeat penalty — all configurable per-conversation
Streaming output: Converting token callbacks into Kotlin Flows that the UI observes

The Model Switch Problem

Switching models on mobile is expensive. You need to:

Finish or cancel current generation
Free the context (KV-cache memory)
Free the model (weights memory)
Wait for memory to actually be reclaimed
Load the new model
Create a new context

Steps 3-4 are where Android gets tricky. llama_free_model() calls free(), but the allocator might not return memory to the OS immediately. ToolNeuron forces a malloc_trim(0) after freeing to hint that the freed memory should be returned.

Layer 4: UI

Jetpack Compose observes the Kotlin Flows from the Inference Manager. Token-by-token streaming is rendered using LazyColumn with incremental text updates. Nothing fancy here — the complexity is all below.

What I Learned Building This

mmap is non-negotiable on mobile. Without it, you're dead on devices with less than 8GB RAM.
JNI crossings are expensive. Batch your data. Use direct buffers. Minimize the number of JNI calls in hot paths.
Quantization choice depends on the device, not just the model. Q4_K_M on a Snapdragon 8 Gen 3 gives different quality/speed tradeoffs than on a Dimensity 9300 because of different NEON implementations and cache sizes.
Android will kill your app. If you're using too much memory and the user switches to another app, Android's LMK (Low Memory Killer) will terminate your process. You need to be aggressive about memory management and graceful about being killed.
Users don't care about tokens/second. They care about "does it feel responsive?" Streaming the first token fast matters more than peak throughput. ToolNeuron prioritizes time-to-first-token by using smaller batch sizes for the initial prompt processing.

Numbers

On a Snapdragon 8 Gen 2 device with 8GB RAM, running Qwen 2.5 7B Q4_K_M:

Model load time: ~3.2 seconds (mmap, cold start)
Prompt processing: ~180 tokens/second
Token generation: ~18 tokens/second
Memory usage: ~2.1GB resident (4.1GB model, mmap keeps most on disk)
Time to first token (for a 50-token prompt): ~280ms

These numbers vary significantly across devices. A Snapdragon 680 (budget phone) generates at ~4 tokens/second. A Snapdragon 8 Gen 3 hits ~24 tokens/second with the same model.

Open Source

ToolNeuron is Apache 2.0 licensed: github.com/Siddhesh2377/ToolNeuron

The native inference engine: github.com/Siddhesh2377/Ai-Systems-New

If you're building on-device AI for Android and want to talk architecture, find me on X/Twitter or LinkedIn.