Bypassing QNN: Running LLMs Directly on Hexagon DSP
Here's what happens when your benchmark lies to you.
I built a custom GGUF inference engine that bypasses Qualcomm's QNN SDK entirely and dispatches matrix operations directly to the Hexagon DSP via FastRPC. Wrote the GEMM kernels in HMX assembly. Ran the benchmarks. Got 8 TFLOPS FP16 and 5 TOPS INT8.
Then I checked the output. All zeros.
I'll explain the architecture first, then the bug hunt, then why 8 TFLOPS of zeros taught me more about mobile silicon than any documentation ever did.
Why Bypass QNN
Qualcomm's QNN SDK is the official way to run compute on the Hexagon DSP. You define a graph of ops, QNN compiles it, and the runtime dispatches it to the DSP. It works, but it has constraints:
1. You submit whole graphs, not individual ops. If you want per-layer CPU/NPU dispatch decisions at runtime (say, based on thermal state), you're fighting the API.
2. QNN's op coverage has gaps. Custom attention variants, non-standard activations, anything QNN hasn't seen before falls back to CPU through QNN's own fallback path, which adds dispatch overhead.
3. You don't control the memory layout. QNN decides how tensors are laid out in VTCM (the DSP's tightly coupled memory). For GGUF models where weights are already in a specific quantized layout, this means an unnecessary repacking step.
I wanted to skip all of that and talk to the DSP directly.
The Architecture
Android App / CLI
| (JNI or direct exec)
v
n-llm-engine (C++)
| (GGUF loader, tokenizer, KV cache, scheduler)
v
libhtp_ops.so (ARM64 stub, FastRPC marshaling)
| (FastRPC over /dev/cdsprpc)
v
libhtp_ops_skel.so (Hexagon, runs on cDSP)
| (HMX/HVX instructions, VTCM scratchpad)
v
Hexagon DSP hardware
FastRPC is the IPC mechanism between the ARM CPU and the Hexagon DSP. It serializes function arguments, sends them over a character device (/dev/cdsprpc), and the DSP-side skeleton library receives them. No QNN in the path.
For buffer sharing, I use rpcmem: ION/dmabuf-backed allocations that are accessible from both CPU and DSP without copying. The CPU allocates a buffer, calls fastrpc_mmap with the FASTRPC_MAP_FD flag, and the DSP can access the same physical memory directly via HAP_mmap_get.
That last detail took half a night to figure out. The standard remote_register_buf function uses FASTRPC_MAP_STATIC which does NOT enable HAP_mmap_get on the DSP side. You get a 0x80000600 error and no indication of why. The fix was buried in /usr/include/misc/fastrpc.h.
Writing HMX Kernels
HMX is the matrix extension on Hexagon V73+. It has a systolic array that does 32x32 tile multiply-accumulate operations. The programming model is:
1. Clear the accumulator (mxclracc)
2. Load activation tile from VTCM into the activation register (activation.ub = mxmem)
3. Load weight tile from VTCM into the weight register (weight.b = mxmem)
4. The MAC happens implicitly after both loads
5. Store the result (mxmem = acc)
Sounds simple. The details are where it gets ugly.
The weight layout has to be in "Crouton" format, which is a specific tile permutation that matches HMX's internal data flow. You can't just point HMX at a row-major weight matrix. I wrote a weight packer that converts Q4_0 quantized weights into Crouton FP16 tiles at model load time.
For FP16 GEMM, the instruction sequence is:
mxclracc
{ activation.hf = mxmem(Ract, Rlimit)
weight.hf = mxmem(Rwgt, Rlimit) }
cvt.uh = acc(2):2x2
mxmem(Rout, 0) = cvt
The cvt step converts the accumulator to UINT16 output. On V73, you MUST use cvt.uh = acc(R):2x2 for FP16. The more intuitive cvt.hf = acc(R) (direct FP16 output) silently produces zeros on this silicon revision. I learned this the hard way.
The DSP kernel library includes: FP16 and INT8 GEMM (HMX), FlashAttention (HMX tiled), RMSNorm (HVX 1024-bit vector ops), and elementwise ops (HVX). Each kernel manages its own VTCM allocation and DMA staging.
The Bug Hunt
Getting the first correct output from the DSP took about a week. Along the way I found and fixed 7 real bugs:
1. shared_free didn't call fastrpc_munmap. The DSP TLB entry leaked on every buffer resize. After enough resizes, the DSP threw a TLBMISS RW crash on the next matmul.
2. transfer_permuted_weight_chunk_fp16 used DMA for VTCM staging, but DMA doesn't handle HAP_mmap_get addresses correctly. Weights arrived as all-zero in VTCM. Switched to memcpy.
3. htp_ops_mat_mul_buf used QURT_MEM_CACHE_INVALIDATE after memcpy to VTCM. This drops the bytes you just wrote, because invalidate discards dirty cache lines. Changed to FLUSH.
4. HMX :deep mode on FP16 activations. Produces zeros on V73. Removed.
5. cvt.hf = acc(R) for FP16 output. Produces zeros on V73. Changed to cvt.uh = acc(R):2x2.
6. mxclracc.hf was called once for all output tiles instead of per-tile. The accumulator had stale values from the previous tile.
7. fastrpc_mmap requires FASTRPC_MAP_FD=2, not the FASTRPC_MAP_STATIC=0 that remote_register_buf uses. Without it, HAP_mmap_get fails on the DSP.
Each of these individually would produce wrong output or a crash, and they were all present simultaneously. Debugging on the DSP means reading FARF logs from logcat, grepping for CDSP0:[DU], and reasoning about what's happening on hardware you can't attach a debugger to.
The Fuse-Off Discovery
After fixing all 7 bugs, I wrote a known-input test. Fill activation and weight buffers with 1.0, run a 32x32 matmul, expect 32.0 in every output element.
Got all zeros.
Ran the op_tests benchmark suite. It reported 8 TFLOPS for FP16 HMX and 5 TOPS for INT8 HMX. Impressive numbers. But the benchmark only measures cycle throughput. It counts how fast the HMX instructions retire. It never checks the output bytes.
So I modified the benchmark to print the output after each run. Every element: 0x0000.
The HMX instructions execute without faulting. They retire in ~14 cycles (44.7 ns at 314 MHz cDSP). That's way too fast for a real 32x32 matmul. The instruction decodes and retires as a no-op.
HMX is fused off on the SM7635 (Snapdragon 7s Gen 3). The HMX_SUPPORT_DEPTH=0 flag in the QNN runtime is truthful, not misleading like I initially thought. Qualcomm ships the HMX code paths in libQnnHtpV73Skel.so for V73 SoCs that DO have HMX enabled. The SM7635 is a different silicon bin with HMX disabled.
The reference paper I was working from (arxiv:2509.23324) was developed on Snapdragon 8 Gen 2 (V75) where HMX is enabled. Same ISA, different silicon.
What's Actually Usable
After the fuse-off discovery:
HMX (matrix): fused off, writes zeros NONE
HVX (vector): works, 1024-bit SIMD ~0.5 TOPS via vrmpy/vmpy
VTCM: works, 2 MB data staging
FastRPC: works (after our 7 bug fixes) host to DSP buffer passing
Adreno GPU: untested ~1.5 TFLOPS FP16 via OpenCL
ARM CPU: works 1.23 tok/s baseline
The engine currently produces correct output at 1.23 tok/s on CPU with DSP dispatch gated off. The HVX-only path (no HMX) gets about 3-4x over the CPU baseline. The same codebase targets V75/V79/V81 (8 Gen 2, 8 Gen 3, 8 Elite) where HMX is enabled and the 8 TFLOPS would be real.
What I Took Away
Don't trust benchmarks that don't verify output. This is obvious in hindsight, but when your benchmark reports 8 TFLOPS, you don't immediately think "what if those are 8 TFLOPS of nothing?"
Fuse-off is a thing. Mid-range SoCs share die designs with flagships but disable blocks to hit price/power targets. The ISA still decodes. The instructions still retire. They just don't compute.
The FastRPC plumbing works. Zero-copy buffer sharing between CPU and DSP is real and fast once you get the mmap flags right. The 7 bugs I fixed are all in my kernel library, not in FastRPC itself. Qualcomm's IPC stack is solid.
Writing DSP assembly is not that hard once you accept that the documentation won't tell you everything. The real knowledge lives in Qualcomm's own compiled skeleton libraries. Disassembling libQnnHtpV73Skel.so taught me more about HMX instruction patterns than any reference manual.
The code isn't public, but if you're doing similar work on Hexagon, I'm happy to compare notes: [siddheshsonar2377@gmail.com](mailto:siddheshsonar2377@gmail.com)