Crucible
Raw inference. Zero wrappers.
Grimvane's from-scratch LLM inference engine. Direct CUDA computation from GGUF model weights to generated text. No Ollama. No llama.cpp. No HuggingFace. Every layer is Grimvane code.
Architecture
Six modules. One pipeline.
GGUF in, text out. Each module is single-responsibility. No external dependencies beyond CuPy for CUDA compute primitives.
GGUF file
│
▼
model_loader parse binary, extract tensors, load to GPU
│
▼
tokenizer BPE encode input text to token IDs
│
▼
transformer embed → [RMSNorm → Attention → FFN] × N → logits
│
├──── kv_cache store K/V per token, advance position
│
▼
sampler temperature → top-k → top-p → sample token
│
▼
text out decode token IDs back to text Models
Supported architectures.
Three model families. All share the same core transformer pattern with minor variations in attention, normalization, and activation.
Llama
LlamaForCausalLM
Llama 2, Llama 3, Mistral, Qwen. Grouped-query attention, RoPE positional encoding, SwiGLU activation.
Phi
Phi3ForCausalLM
Microsoft Phi-3 and Phi-4. Partial attention rotation, compact architecture tuned for high quality at small parameter counts.
Gemma
GemmaForCausalLM
Google Gemma. Multi-query attention, GeGLU activation, different normalization placement.
Performance
Quantization and memory.
Run on consumer NVIDIA GPUs. Quantization controls the quality/size tradeoff.
Quantization formats
Format Bits Description
F16 16 Half precision (baseline)
Q8_0 8 Simple 8-bit quantization
Q4_K_M 4 4-bit k-quant medium (best quality/size)
Q4_0 4 Simple 4-bit (fastest) VRAM budget RTX 4070 Ti Super, 16GB
Model Quant VRAM Status
7B Q4_K_M ~4.5 GB fits
7B F16 ~14 GB tight
13B Q4_K_M ~8.5 GB fits
13B Q8_0 ~14 GB tight
30B Q4_K_M ~18 GB needs offload Get started
Two ways to run.
# Command line
python -m crucible --model path/to/model.gguf --prompt "Hello"
# As a library
from crucible import Engine
engine = Engine("path/to/model.gguf")
response = engine.generate("Hello", max_tokens=128)
print(response)