Building a DSL for Edge Inference: Why I Wrote My Own Compiler

Every inference framework I've used makes the same mistake. They treat the model as one blob and run it on one backend. TFLite compiles your model for CPU or GPU. ONNX Runtime picks a single execution provider. QNN compiles the whole graph for the Hexagon NPU.

On paper that sounds fine. In practice, on a Snapdragon SoC, you have a CPU with big and little cores, an Adreno GPU, and a Hexagon NPU, and different ops perform best on different hardware. A depthwise conv might fly on the NPU but a custom attention op runs faster on CPU. Existing frameworks don't let you split that. You pick one backend and eat the cost on the ops that don't fit.

I wanted per-op dispatch. So I built it.

What Edge AI Studio Is

It's three things:

1. A DSL called .edge for defining model compute graphs
2. A compiler that turns .edge scripts into .egraph binaries
3. A runtime that reads .egraph files and dispatches each op to the right backend

The runtime is under 2MB. It runs Qwen3-0.6B on Snapdragon today.

The .edge Language

An .edge script describes a compute graph. Not a model file, not weights. The graph structure and the data flow between ops.

@model qwen3_0_6b
@precision fp16

node embed: embedding(vocab=151936, dim=1024)
node rms_0: rms_norm(dim=1024, eps=1e-6)

node q_proj: linear(in=1024, out=1024) [backend: npu]
node k_proj: linear(in=1024, out=256) [backend: npu]
node v_proj: linear(in=1024, out=256) [backend: npu]

node attn: attention(heads=16, kv_heads=4, dim=1024) [backend: cpu]
node ffn_gate: linear(in=1024, out=2816) [backend: npu]
node ffn_up: linear(in=1024, out=2816) [backend: npu]
node ffn_down: linear(in=2816, out=1024) [backend: npu]

flow: embed -> rms_0 -> q_proj, k_proj, v_proj -> attn -> ffn_gate, ffn_up -> silu_mul -> ffn_down

The [backend: npu] annotations are hints. The compiler doesn't blindly follow them. It checks them against a hardware manifest (more on that below) and overrides them if the target device can't handle it. If you're compiling for a device without an NPU, all those hints get silently redirected to CPU or GPU.

Why a DSL and not just a config file? Because compute graphs have structure. Branching, merging, loops for recurrent models. Trying to express that in JSON or YAML gets unreadable fast. A dedicated syntax makes the graph structure visible.

The Compiler

The pipeline is:

.edge source -> Lexer -> Parser -> AST -> Optimizer -> Backend Assigner -> .egraph binary

The optimizer does standard graph-level stuff. Op fusion (linear + bias + activation into a single fused kernel), dead node elimination, constant folding. Nothing unusual here.

The backend assigner is where it gets interesting. It reads a hardware manifest for the target SoC and assigns each op to a backend. The assignment considers:

- Does this backend support this op at this precision?
- What's the estimated throughput for this op shape on this backend?
- What's the memory transfer cost to move tensors between backends?

That last one matters a lot. If an op runs 2x faster on the NPU but requires copying 8MB of tensor data from CPU memory to NPU memory and back, the copy cost might eat the speedup. The compiler tries to minimize backend transitions by grouping adjacent ops on the same backend when the per-op gain doesn't justify the transfer.

Hardware Manifests

A hardware manifest is a JSON file that describes what a specific SoC can do.

{
"soc": "sm8550",
"name": "Snapdragon 8 Gen 2",
"backends": {
"cpu": {
"cores": [4, 3, 1],
"isa": ["neon", "i8mm", "dotprod"],
"l2_cache_mb": 4,
"max_threads": 8
},
"gpu": {
"name": "adreno_740",
"api": "opencl",
"fp16_tflops": 2.4,
"memory_bw_gbps": 51.2
},
"npu": {
"name": "hexagon_v73",
"api": "qnn",
"supported_ops": ["linear", "conv2d", "depthwise_conv", "pool"],
"unsupported_ops": ["attention", "rms_norm", "rope"],
"max_tensor_mb": 64
}
}
}

The key insight is that unsupported_ops list. Every NPU has gaps. QNN on the Hexagon V73 doesn't support fused attention or rotary position embeddings. If you compile the whole model for QNN, those ops get executed on CPU anyway, but through QNN's fallback path which adds overhead. With per-op dispatch, I just assign those ops to CPU directly and skip the QNN overhead entirely.

I maintain manifest files for 5 SoCs right now. Adding a new one takes about an hour of benchmarking per op type on the target device.

The .egraph Binary

The compiler output is a binary format. Not human readable, not meant to be. It contains:

- Op graph topology (adjacency list)
- Per-op backend assignment
- Tensor shape metadata
- Memory allocation plan (which tensors can share buffers)
- Fused kernel identifiers

The memory allocation plan is important. Inference has predictable memory access patterns. The compiler analyzes tensor lifetimes and assigns overlapping buffers to tensors that are never alive at the same time. On a 2GB RAM device this can cut peak memory usage by 30-40%.

The Runtime

Under 2MB stripped. Loads the .egraph file, memory-maps the weight file, and runs the graph.

Each backend is a shared library loaded at startup:

- CPU backend: GGML with NEON/i8mm optimizations
- CUDA backend: Custom CUDA kernels for desktop/server
- OpenCL backend: For Adreno and Mali GPUs on mobile
- QNN backend: Qualcomm NPU dispatch via QNN SDK

When the runtime hits a backend transition (say, CPU to NPU), it handles the tensor transfer. On SoCs with unified memory architecture, this is basically free. On discrete GPU systems, it involves a real copy. The compiler already accounted for this cost when making the assignment.

The LSP

I spent two weeks building language server protocol support for .edge files. CLion plugin and VS Code extension. Diagnostics (red squiggles on undefined nodes, shape mismatches), completion (suggests valid op names and parameters), hover (shows inferred shapes), go-to-definition.

This sounds like a luxury but it's not. Without the LSP, I was spending 20 minutes debugging a typo in a node name that the compiler reported as "unknown op at line 47." With the LSP, the editor catches it as I type.

The LSP also validates shapes through the graph. If you connect a linear(out=1024) to an attention(dim=512), it flags the mismatch immediately. The compiler would catch it too, but the feedback loop from "write code, compile, read error, find line, fix" to "see red squiggle as you type" is a meaningful productivity difference.

Numbers

Running Qwen3-0.6B on Snapdragon 8 Gen 2 with the hybrid CPU+NPU dispatch:

Prefill: 310 tokens/second
Decode: 42 tokens/second
Peak memory: ~480MB (weights + KV cache + runtime)
Runtime binary: 1.8MB stripped

For comparison, running the same model purely on CPU through llama.cpp gives about 28 tokens/second decode. The hybrid dispatch picks up most of the gain from offloading the linear projections to the NPU while keeping attention on CPU.

This is a private project. I'm not open-sourcing it, at least not yet. But I wanted to write about the architecture because the per-op dispatch idea is underexplored in the edge inference space, and I think more people should be thinking about it.

If you're working on similar problems, I'd like to hear about it: [siddheshsonar2377@gmail.com](mailto:siddheshsonar2377@gmail.com)