LLM on Library of Babel

This revision is from 2026/01/29 02:57. You can Restore it.

Training on the Library of Babel is a conceptual challenge. Since the Library contains every possible sequence of 1,312,000 characters, 99.9999...% of it is absolute gibberish. If you train a model on the raw Library, you will simply end up with a Random Number Generator because the probability of any character following another is perfectly uniform.

To make this a "novelty LLM" that actually functions, you shouldn't just look for "truth"—you should look for structure.

The Preprocessing Pipeline: Filtering the Infinite

To find the "needles" without limiting yourself to dry facts, you need a multi-stage filtration system that separates Noise from Potential Meaning.

1. Shannon Entropy Thresholding

Gibberish has a specific mathematical signature. You can calculate the Information Entropy (H) of a page.

  • High Entropy: Pure random noise (e.g., "qjwv...").
  • Low Entropy: Repetitive sequences (e.g., "aaaa...").
  • The Sweet Spot: Human language usually falls within a specific range of bits per character.

The formula for the entropy of a discrete random variable X:

H(X)=−i=1∑n​P(xi​)logb​P(xi​)

You would discard anything outside the "Linguistic Window." This keeps the "poetic nonsense" and "dream logic" while ditching the static.

2. The "Gibberish" Classifier (Markovian Filtering)

Instead of checking for facts, check for phonotactics. Does the sequence of characters follow the "rules" of any known or plausible language?

  • Filter: Use a simple N-gram model trained on multiple world languages.
  • Goal: If a "word" in the Library contains zero vowels for 50 characters, it’s discarded. If it "sounds" like it could be a word in an extinct or alien tongue, it stays.

3. Semantic Density Scoring

Since you don't want just "verified facts," use a "Meaning Density" filter. You can use a small, lightweight embedding model (like a distilled BERT) to see if a string of text clusters near anything human-readable.

  • If the vector for a page is a "Zero Vector" (meaning it's too chaotic to map to any concept), toss it.
  • If it maps to "Dream-like/Surrealist" clusters, keep it.

Feature,Raw Library Training,Filtered Novelty Training

Model Output,White Noise / Random characters,"Surrealist, ""Borgesian"" prose"

Training Efficiency,0% (Infinite loss),High (Learning structured patterns)

Vibe,Broken typewriter,"The ""Ghost in the Machine"""

Goal,Total randomness,"Discovering ""The Lost Books"""

  

📝 📜 ⏱️ ⬆️