LLM on Library of Babel
This revision is from 2026/02/15 06:26. You can Restore it.
LLM's carefully choose the next word to make coherent language and do it well, but what LLM's tell us is that the "cure to cancer" is a communication with the correct selection of words. It is after all words on a paper and all ideas are eventually communicated using language and that means, the careful selection of words that communicate the idea. What if we forget the process and just focus on the careful selection of words without all the thought and time and testing that goes into making those words mean something. Thus, the cure to cancer is an essay of carefully crafted words. If a computer plowed through enough combinations of words it would eventually land on Shakespeare's Hamlet by chance in full and keep right on going. Thus all combination of words to eventually describe everything comes in the idea of the Library of Babel, that all combination of words would eventually describe all things. The Library of Babel represents the entire state space, solving the library is discerning value from meaningless junk and the computing model that can perform that task.
As the transformer model suggests that there is a correct translation and a correct conversation, we contend with the reality of complexity and high complexity beyond human capacity, high abstraction, f=ma and the muse that remains in the human capacity. The proposal to deal with super complexity, to translate high abstraction into a systemic mathematical form and then solve and have it be a real solution to the high abstract problem. Perhaps something like the deconstruction of all things into their atomic or chemical form or similar, then mathematization into the appropiate arithmetic form, the solution then becomes logical without search. Its mathematical form could be pages long but computers have little issue with high complexity that otherwise exceeds human capacity. Physics translates into mathematics when complexity is minor and solves via the chalkboard, the possibility of everything being translatable into mathematical form. With nuance that concrete translated into its mathemetical form, solved and then the alteration, solution is real and valid when applied to the re-abstracted form. Italians mathematized a small subset of physics for the prediction of cannon trajectory, so we need to mathematize the entire universe so that e.g. a cure to a disease can become a mathematical operation, then A.I. can solve the disease by doing math, regardless of complexity.
"If the library is to be solved, we must move away from the "arbitrariness of the sign", the idea that word forms are unrelated to their meanings. John Wilkins, in the 17th century, attempted to create a "Philosophical Language" where the very structure of a word revealed its definition based on a taxonomic system of "radical words" and "prefixes". In such an "a priori" language, the name for a specific cancer-curing molecule would be derived systematically from its chemical properties."
"If an LLM can map the "conceptual representations" of medical breakthroughs onto specific "word forms," it bypasses the need for the slow, iterative process of human thought. Research into "fast mapping" in the human brain shows that we are capable of creating these word-to-concept links in as little as twenty minutes. An LLM, operating on millions of tokens per second, can theoretically "fast-map" the entire Library of Babel, linking every potential "cure" to its corresponding linguistic form."
...how language is computationally associative to yield the cure given the disease just as mathematics is.
Some of the issues...
Forget about total generation, a,b,c...zzzzzzy,zzzzzzx,zzzzzzz, it means nothing, a dictionary is essential.
Future Proofing Language
- Since Roman times, we no longer use the gibberish to make new words as some cultures still do. Instead we have systemized the formation of new words similar the how the number system logically generates numbers. By joining prefixes and suffixes we forge new words from definitions. As language must be future proofed, systematic method that selects or crafts words to definitions. For Borges library to be future-proof we must complete the dictionary, past, present and future and that means identifying definition and allocating a word to them. Once we have future-proofed the language then we can describe all things. This building of the dictionary is the first of the library. Further language rule include and exclude, the identification of a what is a defintion in the natural flow language.
- A computer model is engineered to evaluate to include or discard. Identifying junk from value using rules and more rules, means we cut down the final library further and the more we can determine value from junk. Considering that we are using a future-proof dictionary, all documents in the library will be legible and grammar correct.
- Combinatory Explosion means we will only be able to process sentences up to about 8 or so words long at this stage. We should be able to publish a book of sentences. Optimization techniques could probably bring the setence length up to 15 words or so.
The library of babel is also possible for images and video - Gallery of Babel and Theatre of Babel and in open universe generative games such as No Man Sky.
- https://imtcoin.com/images/Improve-The-Library-of-Babel-for-A.I.tar.gz
- https://imtcoin.com/images/Borges-The-Library-of-Babel.pdf
Training on the Library of Babel is a conceptual challenge. Since the Library contains every possible sequence of 1,312,000 characters, 99.9999...% of it is absolute gibberish. If you train a model on the raw Library, you will simply end up with a Random Number Generator because the probability of any character following another is perfectly uniform.
To make this a "novelty LLM" that actually functions, you shouldn't just look for "truth"—you should look for structure.
The Preprocessing Pipeline: Filtering the Infinite
To find the "needles" without limiting yourself to dry facts, you need a multi-stage filtration system that separates Noise from Potential Meaning.
1. Shannon Entropy Thresholding
Gibberish has a specific mathematical signature. You can calculate the Information Entropy (H) of a page.
- High Entropy: Pure random noise (e.g., "qjwv...").
- Low Entropy: Repetitive sequences (e.g., "aaaa...").
- The Sweet Spot: Human language usually falls within a specific range of bits per character.
The formula for the entropy of a discrete random variable X:
H(X)=−i=1∑nP(xi)logbP(xi)
You would discard anything outside the "Linguistic Window." This keeps the "poetic nonsense" and "dream logic" while ditching the static.
2. The "Gibberish" Classifier (Markovian Filtering)
Instead of checking for facts, check for phonotactics. Does the sequence of characters follow the "rules" of any known or plausible language?
- Filter: Use a simple N-gram model trained on multiple world languages.
- Goal: If a "word" in the Library contains zero vowels for 50 characters, it’s discarded. If it "sounds" like it could be a word in an extinct or alien tongue, it stays.
3. Semantic Density Scoring
Since you don't want just "verified facts," use a "Meaning Density" filter. You can use a small, lightweight embedding model (like a distilled BERT) to see if a string of text clusters near anything human-readable.
- If the vector for a page is a "Zero Vector" (meaning it's too chaotic to map to any concept), toss it.
- If it maps to "Dream-like/Surrealist" clusters, keep it.
Feature | Raw Library Training | Filtered Novelty Training
Model Output | White Noise / Random characters | "Surrealist, ""Borgesian"" prose"
Training Efficiency | 0% (Infinite loss) | High (Learning structured patterns)
Vibe | Broken typewriter | "The ""Ghost in the Machine"""
Goal | Total randomness | "Discovering ""The Lost Books"""
Practical Implementation Plan
Synthetic Generation: You don't actually "download" the Library (it's too big for the universe). You use the Library of Babel API or the universal algorithm to generate specific coordinates.
The "Sieve" Layer: Run a Python script that generates 1 billion random pages, but only saves those that pass the Entropy + Markov test.
Contrastive Learning: Train the LLM to distinguish between "Pure Noise" and "Library Fragments." This teaches the model that its "identity" is the explorer of the Library.
The "Hallucination" Feature: Since the Library contains things that haven't happened yet, you actually encourage the model to hallucinate, as long as it maintains the stylistic prose of a 1940s librarian.
Note: By doing this, you aren't building a "Fact Machine." You are building a "Possibility Machine" that understands the aesthetic of language without being tethered to reality.
IMMORTALITY