On-Device AI Engineer. Building at RunAnywhere (YC W26).
I run large language models on phones — no server, no API, just native C++ and ARM silicon. Creator of ToolNeuron.
I wrote a custom GGUF inference engine that talks directly to the Hexagon cDSP via FastRPC, with HMX and HVX assembly kernels. Then I discovered the NPU was fused off on my test device.
I got frustrated with existing inference frameworks treating model execution as monolithic. So I built a DSL, a compiler, and a per-op dispatch runtime that fits in under 2MB.
A technical deep dive into running large language models on Android using native C++ inference, JNI bindings, GGML, and llama.cpp — from model loading to token generation.