Siddhesh Sonar

writing

Bypassing QNN: Running LLMs Directly on Hexagon DSP Feb 22, 2026 · 7 min

I wrote a custom GGUF inference engine that talks directly to the Hexagon cDSP via FastRPC, with HMX and HVX assembly kernels. Then I discovered the NPU was fused off on my test device.

Building a DSL for Edge Inference: Why I Wrote My Own Compiler Jan 14, 2026 · 6 min

I got frustrated with existing inference frameworks treating model execution as monolithic. So I built a DSL, a compiler, and a per-op dispatch runtime that fits in under 2MB.

How ToolNeuron Runs LLMs on Android: Architecture Deep Dive Dec 18, 2025 · 7 min

A technical deep dive into running large language models on Android using native C++ inference, JNI bindings, GGML, and llama.cpp — from model loading to token generation.