Crucible

Raw inference. Zero wrappers.

Grimvane's from-scratch LLM inference engine. Direct CUDA computation from GGUF model weights to generated text. No Ollama. No llama.cpp. No HuggingFace. Every layer is Grimvane code.

Quick start See the pipeline

Architecture

Six modules. One pipeline.

GGUF in, text out. Each module is single-responsibility. No external dependencies beyond CuPy for CUDA compute primitives.

GGUF file
   │
   ▼
model_loader    parse binary, extract tensors, load to GPU
   │
   ▼
tokenizer       BPE encode input text to token IDs
   │
   ▼
transformer     embed → [RMSNorm → Attention → FFN] × N → logits
   │
   ├──── kv_cache     store K/V per token, advance position
   │
   ▼
sampler         temperature → top-k → top-p → sample token
   │
   ▼
text out        decode token IDs back to text

Models

Supported architectures.

Three model families. All share the same core transformer pattern with minor variations in attention, normalization, and activation.

Llama

LlamaForCausalLM

Llama 2, Llama 3, Mistral, Qwen. Grouped-query attention, RoPE positional encoding, SwiGLU activation.

GQA RoPE SwiGLU

Phi

Phi3ForCausalLM

Microsoft Phi-3 and Phi-4. Partial attention rotation, compact architecture tuned for high quality at small parameter counts.

Partial RoPE Dense

Gemma

GemmaForCausalLM

Google Gemma. Multi-query attention, GeGLU activation, different normalization placement.

MQA GeGLU

Performance

Quantization and memory.

Run on consumer NVIDIA GPUs. Quantization controls the quality/size tradeoff.

Quantization formats

Format     Bits    Description
F16         16     Half precision (baseline)
Q8_0         8     Simple 8-bit quantization
Q4_K_M       4     4-bit k-quant medium (best quality/size)
Q4_0         4     Simple 4-bit (fastest)

VRAM budget RTX 4070 Ti Super, 16GB

Model   Quant      VRAM        Status
7B      Q4_K_M     ~4.5 GB     fits
7B      F16        ~14 GB      tight
13B     Q4_K_M     ~8.5 GB     fits
13B     Q8_0       ~14 GB      tight
30B     Q4_K_M     ~18 GB      needs offload

Get started

Two ways to run.

 # Command line
python -m crucible --model path/to/model.gguf --prompt "Hello"

# As a library
from crucible import Engine

engine = Engine("path/to/model.gguf")
response = engine.generate("Hello", max_tokens=128)
print(response)