Detecting definitions in a page of literary text

When scanning the library of babel, what is the most important things, in case of images it could be scematics and engineering plans, cads. In text it could be definitions, a word(s) that preceeds a explanation of what the word means. Perhaps these are the most important things. True or false and feasibility could then be determined or a prcessable list.

🔍 1. Clarify What Counts as a "Definition" in Fiction

In novels, definitions typically appear as:

  • Explicit cue phrases: X means Y, X is defined as Y, known as X, called X
  • Appositives: The obsidian, a volcanic glass, glittered...
  • Colon/parenthetical glosses: Grok: to understand deeply... or He carried a kama (a short farming knife)...
  • Relative/explanatory clauses: A quidditch seeker, who flies fastest to catch the Snitch, ...
  • Dialogue explanations: "You mean a chronometer?" "Yes. A pocket watch, but precise."

🛠️ 2. Tiered Detection Strategies

✅ Tier 1: Lexical & Pattern Matching (Fast, Interpretable)

Use regular expressions or string matching for definitional cue phrases and punctuation structures.

Common cue phrases:

CUES = [
    r"\bmeans\b", r"\bis defined as\b", r"\bis known as\b", r"\bis called\b",
    r"\brefers to\b", r"\bin other words\b", r"\bthat is\b", r"\bi\.e\.",
    r"\bnamely\b", r"\bdescribed as\b", r"\bcan be defined as\b"
]

Structural patterns:

  • Term : explanation
  • Term (explanation)
  • Term, a/an [noun phrase], ... (appositive)
  • Term, which/that [clause explaining it], ...

Quick Python example:

import re

def extract_pattern_definitions(text):
    patterns = [
        r'(?P

✅ Pros: Fast, transparent, zero training data.
❌ Cons: Misses implicit/contextual definitions, sensitive to style/punctuation.

🌲 Tier 2: Syntax & Dependency Parsing (More Robust)

Use an NLP parser to find equative or explanatory grammatical relationships.

Key spaCy dependency relations:

  • appos: apposition (The kama, a farming knife, ...)
  • attr/acomp: copular definitions (A chronometer is a precise timepiece.)
  • acl/relcl: relative clauses defining a noun
  • conj: coordinated explanatory phrases

spaCy example:

import spacy
nlp = spacy.load("en_core_web_sm")

def extract_syntax_definitions(text):
    doc = nlp(text)
    defs = []
    for sent in doc.sents:
        for token in sent:
            # Copular definition: X is a Y / X is Y
            if token.pos_ == "NOUN" and token.dep_ == "attr" and token.head.lemma_ == "be":
                term = list(token.head.lefts)
                if term:
                    defs.append({"term": term[0].text, "definition": token.text + " " + " ".join(c.text for c in token.rights)})
            # Apposition: Term, appos,
            if token.dep_ == "appos":
                head_term = token.head
                if head_term.pos_ == "NOUN":
                    defs.append({"term": head_term.text, "definition": token.text + " " + " ".join(c.text for c in token.children if c.head == token)})
    return defs

✅ Pros: Captures grammatical structure, less brittle than regex.
❌ Cons: Still misses semantic nuance, requires parsing overhead.

🤖 Tier 3: Machine Learning / LLM-Based Extraction (Highest Recall)

For literary text where definitions are implicit or stylistically varied, transformer models or LLMs perform best.

Option A: LLM Prompting (Zero/Few-Shot)

Extract any terms being defined or explained on the following page. 
Return a JSON list of {"term": "...", "definition": "..."} pairs. 
Only include explicit or clear contextual definitions. Ignore metaphors, examples, or general descriptions.
Text: {page_text}

Use structured output (JSON mode) and temperature ~0.2 for consistency.

Option B: Fine-Tuned Relation Extraction

  • Frame as term-definition relation extraction.
  • Use datasets like SemEval-2016 Task 13 (definition extraction), ACL Anthology glosses, or create a small literary annotation set.
  • Fine-tune a model like deberta-v3-base or roberta-large with token/span labeling or pair classification.
  • Tools: transformers, spacy-train, prodigy (for annotation).

✅ Pros: Handles implicit/contextual definitions, adapts to literary style.
❌ Cons: Requires compute, prompt engineering, or labeled data; LLMs may hallucinate.

📊 3. Practical Pipeline Recommendation

  1. 1Start simple: Run regex + spaCy patterns. Evaluate precision on a few pages.
  2. 2Filter false positives: Literary text uses metaphors (He was a lion in battle). Add POS filters (require NOUN/PROPN as term) and semantic similarity checks if needed.
  3. 3Add LLM fallback: Pass low-confidence or unmatched sentences to an LLM with strict JSON output.
  4. 4Human-in-the-loop: Randomly sample 10-20% of extractions for manual validation. Track precision/recall.
  5. 5Iterate: Expand cue lists, adjust dependency filters, or fine-tune if volume justifies it.

⚠️ Key Caveats for Novels

  • Low density: Most pages won't contain definitions. Optimize for precision over recall.
  • Style dependence: 19th-century prose uses more appositives; modern fiction uses more dialogue glosses.
  • Metaphor vs definition: She was a storm ≠ definition. Use POS, semantic plausibility, or LLM disambiguation.
  • Cross-sentence definitions: Sometimes the term appears in sentence A, definition in B. Requires coreference resolution or sentence-pair modeling.

📦 Tools & Libraries to Use

  • spacy + en_core_web_trf (for high-accuracy parsing)
  • regex or re (pattern matching)
  • openai, anthropic, or open-weight models via vllm/ollama (LLM extraction)
  • prodigy or doccano (if annotating custom data)
  • scikit-learn / transformers (if training a classifier/relation extractor)
If you share a sample page, I can run a quick extraction demo using one of these methods and show exactly what gets caught vs missed.

  

📝 📜 ⏱️ ⬆️