Crucible

Raw inference. Zero wrappers.

Grimvane's from-scratch LLM inference engine. Direct CUDA computation from GGUF model weights to generated text. No Ollama. No llama.cpp. No HuggingFace. Every layer is Grimvane code.

Architecture

Six modules. One pipeline.

GGUF in, text out. Each module is single-responsibility. No external dependencies beyond CuPy for CUDA compute primitives.

GGUF file │ ▼ model_loader parse binary, extract tensors, load to GPU │ ▼ tokenizer BPE encode input text to token IDs │ ▼ transformer embed → [RMSNorm → Attention → FFN] × N → logits │ ├──── kv_cache store K/V per token, advance position │ ▼ sampler temperature → top-k → top-p → sample token │ ▼ text out decode token IDs back to text

Models

Supported architectures.

Three model families. All share the same core transformer pattern with minor variations in attention, normalization, and activation.

Llama

LlamaForCausalLM

Llama 2, Llama 3, Mistral, Qwen. Grouped-query attention, RoPE positional encoding, SwiGLU activation.

GQA RoPE SwiGLU

Phi

Phi3ForCausalLM

Microsoft Phi-3 and Phi-4. Partial attention rotation, compact architecture tuned for high quality at small parameter counts.

Partial RoPE Dense

Gemma

GemmaForCausalLM

Google Gemma. Multi-query attention, GeGLU activation, different normalization placement.

MQA GeGLU

Performance

Quantization and memory.

Run on consumer NVIDIA GPUs. Quantization controls the quality/size tradeoff.

Quantization formats

Format Bits Description F16 16 Half precision (baseline) Q8_0 8 Simple 8-bit quantization Q4_K_M 4 4-bit k-quant medium (best quality/size) Q4_0 4 Simple 4-bit (fastest)

VRAM budget RTX 4070 Ti Super, 16GB

Model Quant VRAM Status 7B Q4_K_M ~4.5 GB fits 7B F16 ~14 GB tight 13B Q4_K_M ~8.5 GB fits 13B Q8_0 ~14 GB tight 30B Q4_K_M ~18 GB needs offload

Get started

Two ways to run.

# Command line python -m crucible --model path/to/model.gguf --prompt "Hello" # As a library from crucible import Engine engine = Engine("path/to/model.gguf") response = engine.generate("Hello", max_tokens=128) print(response)