Symbolic Index for LLM Knowledge — a neuro-symbolic search architecture.

Searches with 64-bit integers. No vector DB, no ANN graphs, no embedding models.

GitHub Repository

Core Thesis

Search was never a hard problem — it was a symptom of unstructured data. Assign 64-bit structure at write time, and the symptom disappears.

results = index[(index & mask) == pattern]

Sub-second search across 100 million Wikidata entities in 1.3 GB of memory. Python (NumPy) alone beats optimized C++/Rust vector DBs — an architectural win.

Why Not Vector Embeddings

Vector embeddings crush structure.

"Admiral Yi Sun-sin was a military leader of the Joseon Dynasty"
 → A human instantly grasps: Korean, military, 16th century

Embedding model:
 [0.234, -0.891, 0.445, ..., 0.112] (384-dim float)
 → Structure destroyed. Can't tell if it's a person or a location.

To recover crushed structure:
 ANN graphs (HNSW, IVF-PQ), cross-encoders, rerankers, metadata filters...

SILK preserves structure.

SIDX: [Human / military / east_asia / early_modern]
 → Structure lives in the bits. Readable.
 → No recovery needed. It was never crushed.

The key is inverting the order.

Traditional: Write first → structure later (indexing)
SILK:        Structure at write time → search is free

SIDX Bit Layout

SIDX follows the GEUL Grammar Entity Node specification.

[prefix 7 | mode 3 | entity_type 6 | attrs 48]
 MSB(63)                                  LSB(0)
FieldBitsWidthDescription
prefix63-577GEUL protocol header (0001001 fixed, ignored during search)
mode56-543Quantification/number mode (registered entity=0, definite, universal, existential, etc. — 8 modes)
entity_type53-48664 top-level types (Human=0 ~ Election=62, unclassified=63)
attrs47-048Per-type attribute encoding (defined by codebooks)

Search target: entity_type 6 bits + attrs 48 bits = 54 bits. QIDs are not included in SIDX; they are stored in a separate array.

attrs 48 bits — Per-Type Schema

Human(0):       subclass 5 | occupation 6 | country 8 | era 4 | decade 4 | gender 2 | notability 3 | ...
Star(12):       constellation 7 | spectral_type 4 | luminosity 3 | magnitude 4 | ra_zone 4 | dec_zone 4 | ...
Settlement(28): country 8 | admin_level 4 | admin_code 8 | lat_zone 4 | lon_zone 4 | population 4 | ...
Organization(44): country 8 | org_type 4 | legal_form 6 | industry 8 | era 4 | size 4 | ...
Film(51):       country 8 | year 7 | genre 6 | language 8 | color 2 | duration 4 | ...

Currently 5 types have defined attribute bit layouts. The rest encode only entity_type.

Architecture

SILK has two pipelines: Encoding (at write time) and Search (at query time). Both follow the same principle: symbolic structure handles form, LLM handles meaning.

Encoding Pipeline (write time)
┌──────────────┐   ┌──────────────┐   ┌──────────────┐
│  LLM Tagging │──►│  VALID Check │──►│Codebook Encode│
│  Doc → JSON  │   │ Codebook     │   │ JSON → SIDX  │
│  (Semantic   │   │ valid values │   │ (Bit assembly)│
│   classify)  │   │ Halluc → drop│   │              │
└──────────────┘   └──────────────┘   └──────────────┘
   LLM tags         Code validates    Codebook-based encoding
Search Pipeline (query time)
┌──────────────┐   ┌──────────────┐   ┌──────────────┐
│ Query Parse  │──►│ Bit AND Filter│──►│  LLM Judge   │
│ Codebook     │   │  NumPy SIMD  │   │ Few candidates│
│ lookup + LLM │   │100M → dozens │   │ Dozens → answer│
└──────────────┘   └──────────────┘   └──────────────┘
   Meaning extract   Broad filter      Narrow judgment

Core strategy: SIDX bit AND filters broadly without missing anything, then LLM judges only the few remaining candidates.

Each component does what it's best at:
 Human: Structure design (codebooks)
 LLM:   Semantic classification (tagging) + semantic judgment (search)
 Code:  Rule validation (VALID) + bit assembly
 CPU:   Bulk comparison (NumPy SIMD)

Encoding: LLM Tagging → VALID Validation

The LLM reads a document and tags it as JSON. VALID performs mechanical checks against codebooks. Values not in the codebook, cross-type consistency violations, and constraint violations physically cannot enter the index.

LLM tagging: {"type": "Human", "occupation": "military", "country": "korea"}
VALID:       occupation="military" ∈ codebook? ✓  country="korea" ∈ codebook? ✓
             Human with constellation field? ✗ Dropped
Encoding:    Codebook lookup "military"→6 bits, "korea"→8 bits → bit assembly → SIDX uint64

LLMs can hallucinate. But VALID acts as the gatekeeper, so the probability of index contamination is zero. Only JSON that passes VALID is encoded into SIDX 64-bit integers via codebooks.

Search: Filter Broadly, LLM Narrows

1. Query meaning extraction   Codebook lookup 80% / LLM assist 20%
2. Bitmask assembly           Deterministic — algorithm
3. NumPy bit AND              Deterministic — full scan of 100M in 20ms
4. LLM final judgment         Few candidates only — semantic precision

80% of Searches Are Structural Queries

"Yi Sun-sin"     → Q484523 exact match. Bit AND. Done.
"Samsung news"   → org/company + doc_meta/news. Bit AND. Done.
"Biden-Xi summit" → Q6279 ∩ Q15031 ∩ meeting. Intersection. Done.
Structural queries (80%): Completed by SILK bit AND. No LLM needed.
Semantic queries  (15%): SILK narrows candidates → LLM judges 5-10.
Generative queries (5%): SILK identifies documents → LLM generates.

Multi-SIDX

A single document or event can carry multiple SIDXs.

News article "Samsung, NVIDIA, Hyundai CEOs Meet":

SIDX[0]: [Human / business / east_asia ]  Jay Y. Lee
SIDX[1]: [Org   / company / east_asia ]  Samsung Electronics
SIDX[2]: [Human / business / n_america]  Jensen Huang
SIDX[3]: [Org   / company / n_america]  NVIDIA
SIDX[4]: [Human / business / east_asia ]  Euisun Chung
SIDX[5]: [Org   / company / east_asia ]  Hyundai Motor
SIDX[6]: [Event / meeting / east_asia ]  Meeting

All the same 64-bit SIDX. Same index. Same bit AND search. The entity_type field distinguishes entities (Human, Org) from events (Event).

Index Structure

sidx_array = np.array([...], dtype=np.uint64)  # 108.8M × 8B = 870MB
qid_array  = np.array([...], dtype=np.uint32)  # 108.8M × 4B = 435MB
# Total ~1.3GB memory
Index construction:
 Elasticsearch: Tokenize → analyze → inverted index → segment merge
 Pinecone:      Embed → HNSW graph → clustering
 SILK:          sort

No data structures. Just one sorted array.

Vector DB Comparison

SILKVector DB
Index size (1T entries)12 TB1.5 PB (125x)
Index constructionsortHNSW — days
Cold startOpen file, readyGraph construction — hours
Partitioned scanPossible (same results)Impossible (graph breaks)
Compound conditionsIntersection (exact)Compressed into 1 vector (approximate)
ResultsExact (set operations)Approximate (similarity ranking)
AuditabilityJSON (white box)Impossible (black box)
Bit AND is order-independent, stateless, and mergeable.
Vector DB: The entire graph must be in memory. Partition = destruction.
SILK:      Cut anywhere, it still works. Partition = slower, same results.

Audit Pipeline

Vector embeddings are black boxes — unauditable. SIDX is JSON — fully auditable.

Stage 1 — Small LLM tagging    Llama 8B / GPT-4o-mini. Accuracy 85-90%.
Stage 2 — VALID mechanical check  Valid values, consistency, constraints. Cost $0.
Stage 3 — Large LLM audit      confidence=low only. Accuracy 99%+.

VALID is the gatekeeper: hallucinations outside codebook values physically cannot enter the index.

License

MIT — GitHub