Library of Babel Puritan Strategies
Strategies for achieving puritans work in borges library of babel, what strategies can be used to only keep legible pages in library of babel, these are math problems like combinatorial explosion etc
The Puritan's Work
Strategies for the Library of Babel
The Library of Babel contains every possible page of 3,200 characters drawn from 25 symbols — a space so vast it dwarfs the observable universe. The Puritan's task is to cull this infinity down to only the legible pages. This is a problem of combinatorial explosion, information theory, and computational linguistics. Below are the seven principal strategies, arranged from cheapest to most powerful.
The Scale of the Problem
The Library uses 25 symbols (22 letters + space, comma, period). Each page has 40 lines × 80 characters = 3,200 characters. The total number of pages is:
If English has an entropy rate of ~1.5 bits/character versus the maximum of log2(25) ≈ 4.64 bits/char, then the number of English-like pages is roughly:
This means only ~1 in 103033 random pages is legible. No brute-force search can work. You need intelligent filtering strategies.
Strategy 1 — Entropy-Based Thresholding
Core idea: Natural language has much lower Shannon entropy than random text. A random page spreads probability uniformly over all 25 symbols; a legible page concentrates probability on a tiny subset of character sequences.
Compute the zero-order entropy of each page:
Filter out any page where H exceeds a threshold (e.g., 3.0 bits/char). This is cheap to compute and eliminates the vast majority of random pages since their entropy clusters near the maximum.
Limitation: This catches pages with any statistical structure, not just natural language. A page of "ABABABAB..." passes but isn't legible.
Strategy 2 — N-Gram Statistical Filtering
Core idea: Apply a cascading filter where each stage is more expensive but more selective. The power of n-gram analysis lies in the exponential growth of discriminability with n.
Stage 1 — Unigram filter:
Compare character frequency distribution against expected English using a χ² test. Eliminates ~99%+ of random pages cheaply. This is O(n) and requires only a single pass.
Stage 2 — Bigram filter:
Check pairs of consecutive characters. English has highly non-uniform bigram distributions (e.g., "TH" is common, "QZ" is not). Eliminates ~99.9%+ of remaining candidates. Only 625 possible bigrams to track.
Stage 3 — Trigram filter:
Three-character sequences are even more diagnostic. The space of valid English trigrams is a small fraction of 25³ = 15,625 possibilities. This stage begins to capture syntactic patterns.
Stage 4 — Word-level filter:
After segmenting by spaces, check: word lengths are plausible (1–15 characters), words appear in a dictionary or follow morphological patterns, and word frequency distributions follow Zipf's law.
Mathematical advantage: Each stage reduces the candidate set by orders of magnitude. If stage k eliminates fraction fk of candidates, the combined filter eliminates fraction:
Strategy 3 — Compression-Based Filtering
Core idea: A page is legible if it is compressible — if its Kolmogorov complexity is significantly less than its length. By the incompressibility theorem, random strings are incompressible with high probability. Any significant compression implies algorithmic structure.
Implementation: Compress each page with a standard algorithm (gzip, LZMA, PPM). If the compression ratio falls below a threshold (e.g., compressed size < 60% of original), the page likely contains structured information.
Practical advantage: Compression algorithms implicitly capture n-gram statistics, repetition, and higher-order patterns all at once — they serve as a "universal" statistical filter without requiring any explicit model of language.
Strategy 4 — Language Model Perplexity
Core idea: Use a language model to compute the perplexity of each page. Low perplexity indicates likely natural language. This is the most powerful single filter available.
A good language model captures long-range dependencies (subject-verb agreement across clauses), semantic coherence patterns, syntactic structure, and pragmatic conventions — all at once.
Hierarchical deployment: Use a cheap character-level RNN/transformer for first-pass filtering, then a larger model for refined scoring of survivors.
Strategy 5 — Typographical / Structural Constraints
Core idea: Before any statistical test, apply hard structural constraints that legible pages must satisfy. These are O(n) cheap to check and compound multiplicatively. A page failing any constraint is rejected instantly.
Strategy 6 — The Generative Inverse Approach
The key insight: Instead of filtering the Library, generate all legible pages directly. This reframes the problem from "search through 253200 pages" to "enumerate all strings of length 3,200 that are within distance δ of natural language under some model."
This is equivalent to channel coding: think of natural language as a code, and the Library as the space of all possible received signals. The set of legible pages is the "decoding" of the codebook.
Practical approach:
1. Train a language model on real text
2. Enumerate all strings where the model assigns probability > threshold
3. This set, while still huge, is guaranteed to contain only legible pages
Theoretical bound: If the language model has cross-entropy H, the number of generated pages is at most 2H×3200, which is exactly the right order of magnitude (~101443 for English).
Strategy 7 — Combinatorial Enumeration with Constraints
Core idea: For a more rigorous combinatorial approach, count the number of pages satisfying progressively tighter constraint sets, turning the problem into exact enumeration.
Level 0 (No constraints):
253200 ≈ 104476
Level 1 (Character frequency bounds):
Pages where each character appears between amin and amax times. By multinomial counting, this reduces the space by a factor of roughly 25! ≈ 1.55 × 1025.
Level 2 (Bigram constraints):
Further restrict to pages where each bigram appears a plausible number of times. This is a Markov chain counting problem — count the number of walks in a weighted graph of length 3,200.
Level 3 (Word-level constraints):
Model as a regular language (all strings over the 25 symbols that segment into dictionary words). The number of such strings of length n satisfies a linear recurrence and can be computed via the transfer matrix method.
Optimal Multi-Stage Architecture
The practical answer combines all strategies into a funnel. Each stage is progressively more expensive but applied to a dramatically smaller set. The total cost is dominated by the first cheap stage applied to all pages.
The Fundamental Impossibility Result
Despite all these strategies, there is a deep information-theoretic impossibility: the Puritan can never be sure they've found all legible pages.
1. Legibility is undecidable
Determining if a page contains a coherent argument requires solving problems equivalent to the halting problem. There is no algorithm that can, in general, distinguish profound but obscure text from well-structured nonsense.
2. The set is too large to enumerate
Even 101443 legible pages is beyond any physical process. The number of atoms in the observable universe is only ~1080. No physical computer could list, store, or even count these pages.
3. False positives are inevitable
Any statistical filter will admit some nonsensical pages that happen to have the right statistics. The boundary between "legible" and "illegible" is not sharp — it is a gradient that no threshold can perfectly separate.
The Puritan's work is, in the deepest sense,
asymptotically completable but never finished
— a fitting metaphor for the Library itself.
THE LIBRARY OF BABEL — COMBINATORIAL STRATEGIES
IMMORTALITY