Abstract

LLM-generated code fails in production not because models are wrong about syntax. They fail because the code makes implicit promises that no one verifies. A function named parseUserInput implicitly promises to handle null. An API endpoint implicitly promises to return the shape its caller expects. These claims are invisible to compilers, invisible to linters, and invisible to the model that wrote them.

This paper defines claim extraction as a neurosymbolic primitive. The neural layer reads code and surfaces every implicit claim — things the code promises but does not prove. The symbolic layer (a deterministic oracle) tests each claim and returns a binary verdict. The loop between them is the architecture. Neither layer alone solves the problem.

Under Kautz's taxonomy, this is Type 5 (Neural[Symbolic]): the neural system does what symbols cannot (understand semantic intent), and the symbolic system does what neural cannot (provide deterministic ground truth). We benchmark this architecture on HumanEval (164/164 pass@3) and SWE-bench (+32.7% relative improvement over baseline). Cost: $0.003 per verification.

The Verification Problem

Current AI coding tools have a structural gap. The generation layer (LLM) produces code. The review layer (human or static analysis) checks syntax, style, and obvious bugs. No layer systematically verifies that the code does what it claims to do.

Static analysis tools (ESLint, mypy, Clippy) enforce rules. They catch undefined variables. They flag type mismatches. They cannot evaluate semantic claims. A function that handles errors by returning an empty array instead of throwing passes every linter. It fails every caller that expects an exception.

LLM-as-judge approaches use a second model to review the first model's output. This compounds the problem. Both models share the same failure modes. When GPT-4 misses a null-handling edge case, GPT-4-as-reviewer misses it too. The correlation between model errors means the review adds noise, not signal.

Human code review works, but it scales with headcount. At $200-500/hour for senior engineers, reviewing every AI-generated function is economically unsustainable as generation velocity increases. Teams are already shipping AI code faster than they can review it.

The gap: we need a verification layer that is semantically aware (understands what code claims), deterministic (returns ground truth, not probability), and fast enough to run on every commit. Claim extraction as a neurosymbolic primitive fills that gap.

Neurosymbolic Taxonomy: Where LUCID Sits

Henry Kautz's 2020 taxonomy defines six types of neurosymbolic integration. Most AI systems in production are Type 1 or 2. LUCID is Type 5.

Type	Notation	Description	Example
1	Symbolic ∪ Neural	Separate modules, no shared reasoning	Search + LLM re-ranking
2	Symbolic[Neural]	Symbolic outer, neural subroutine	Symbolic planner with neural perception
3	Neural[Symbolic]	Neural outer, symbolic subroutine	LLM with calculator tool
4	Neural \| Symbolic	Compiled together, shared representation	Differentiable logic programs
5	Neural[Symbolic] (shared task)	Both contribute to same reasoning task	LUCID, AlphaGeometry
6	Neural ≡ Symbolic	Full mutual embedding	Theoretical; not yet built

Most “AI + tools” architectures are Type 3: an LLM calls a calculator or a database and uses the result. The symbolic component is a subroutine. The neural component drives.

Type 5 is different. Both components contribute to the same reasoning task. In AlphaGeometry, the LLM generates candidate geometric constructions. The formal deduction engine verifies them. Neither can solve the problem alone. Both are necessary for the solution.

LUCID works the same way. The neural component extracts claims the symbolic layer cannot identify on its own. The symbolic component verifies claims the neural layer cannot evaluate reliably on its own. Remove either layer and the system fails.

Architecture: Extract-Verify-Remediate-Regenerate

The LUCID pipeline has four stages. Two neural, two symbolic, interleaved.

Extract (Neural) — An LLM reads a code file and produces a structured list of implicit claims. Claims describe what the code promises: input contracts, output contracts, error handling behavior, side effect guarantees. This stage requires semantic understanding. No regex, AST parser, or static analysis tool can reliably surface “this function promises to return a non-empty array when given valid input.” The LLM can.
Verify (Symbolic) — A deterministic oracle tests each claim against the actual code. For code claims, the oracle writes targeted test cases, executes them in an isolated subprocess, and captures stdout, stderr, and exit codes. The oracle does not hallucinate. It runs the code and reports what happens. A claim either passes or fails. No probability distribution. No confidence score. A binary verdict.
Remediate (Neural) — When claims fail, the LLM receives the oracle's exact output: the failing test, the actual output, the expected output. It generates a targeted fix. This neural stage is constrained by symbolic evidence — the model cannot rationalize away a failing test. The ground truth is deterministic.
Regenerate (Symbolic) — The oracle re-runs against the patched code. If all claims pass, the loop terminates. If claims still fail, the loop continues (up to a configurable max iterations). The loop itself is deterministic. The reasoning inside it is hybrid.

The critical design insight: the symbolic layer is the ground truth anchor. Without it, the neural layer can rationalize failures. With it, failures are facts. The neural layer must respond to facts, not opinions.

The Pluggable Oracle Pattern

The architecture separates claim extraction from claim verification. This separation is intentional. It makes the system domain-agnostic.

The neural layer (claim extraction) is identical across domains. It reads content and surfaces implicit claims. The symbolic layer (oracle) is swapped per domain.

Domain	Oracle	Verdict Source	Status
Code (general)	Test execution subprocess	Exit code, stderr, stdout	Production
Legal documents	Policy-to-codebase audit	Compliance gap report	Production
API contracts	Live endpoint testing	HTTP response schema diff	Beta
Database migrations	Schema validation + rollback test	Migration dry-run output	Beta
Security claims	SAST + fuzzer	CVE match, input mutation result	Planned

The pluggable pattern means the neural layer improves once and benefits all domains. A better claim extractor (better prompt, better model) makes every oracle more effective. Each new oracle adds a domain without touching the extraction layer.

This also constrains what counts as a valid oracle. The oracle must return deterministic verdicts. A second LLM is not a valid oracle — it produces probability distributions, not ground truth. A formal verifier is. A test runner is. A schema validator is.

Benchmarks

Full benchmark data, methodology, and reproducible results are at /proof. Summary:

HumanEval: 164/164 pass@3 (100%). Baseline LLM alone: 67.0% pass@1. LUCID achieves 100% pass@3 by catching and fixing failures in the remediate loop.
SWE-bench: +32.7% relative improvement over baseline. LUCID finds issues missed by the base agent and patches them before the solution is submitted.
Cost: $0.003 per file verified. For a 1,000-file codebase, full verification costs $3.00.
Speed: Average 8 seconds per file. CI pipeline integration adds under 2 minutes to a typical build.

The benchmark improvements are not from a bigger model. They are from architecture. The same base LLM, constrained by a deterministic oracle, produces substantially better output.

Comparison Matrix

Property	LUCID	LLM-as-Judge	Static Analysis	Manual Review
Semantic claim extraction	✓	✓	✗	✓
Deterministic verdict	✓	✗	✓	Variable
Auto-remediation	✓	✗	✗	✗
Scales with codebase size	✓	✓	✓	✗
Correlated failure modes	No	Yes	No	Low
Cost per file (USD)	$0.003	~$0.005	~$0.00	~$2–5
Domain-agnostic	✓ (pluggable oracle)	✓	✗ (language-specific)	✓

LLM-as-judge is the closest competitor on semantic coverage, but it shares failure modes with the model it reviews. Two GPT-4 instances miss the same edge cases. LUCID uses a deterministic oracle with uncorrelated failure modes — the oracle fails differently than the LLM, which is what makes it useful.

Static analysis is deterministic but semantically blind. It cannot catch a function that handles null by returning an empty array when the caller expects an exception. LUCID extracts that claim and tests it explicitly.

Marcus's Four Steps

In “The Next Decade in AI” (2020), Marcus argued that progress requires four advances: hybrid architectures, robust knowledge representations, reasoning mechanisms, and cognitive models of the world. LUCID maps directly onto this framework.

Marcus Requirement	LUCID Implementation	Status
Hybrid architectures combining neural and symbolic	Extract-verify-remediate-regenerate loop	Delivered
Robust knowledge representations	Structured claim objects with domain, evidence, verdict fields	Delivered (code + legal domains)
Reasoning mechanisms beyond pattern matching	Iterative refinement constrained by deterministic feedback	Delivered
Cognitive models of the world	Domain-specific oracle networks modeling program semantics	In progress (code semantics only)

LUCID delivers steps 1, 2, and 3 in the code domain. Each oracle we add to step 2 extends the knowledge representation to a new domain. Step 4 — a general world model — is the long-term research direction. The architecture supports it. We are not there yet.

Limitations

The claim extractor can hallucinate. An LLM reading a function might identify a claim the function does not actually make, then pass or fail it for the wrong reasons. We mitigate this with extraction prompt calibration and by requiring the oracle to produce reproducible test cases — if a test cannot be written, the claim is marked unverifiable rather than passed.

The oracle is only as good as its test execution environment. Flaky tests produce flaky verdicts. We run each claim verification three times and require consistent results before recording a verdict. Persistent flakiness is reported as a separate finding (non-deterministic behavior is itself a claim violation for most code).

The loop can diverge. A model that fixes one claim and breaks another can cycle indefinitely. We cap iterations at 5 by default and report unfixed claims with their failure history. The system never blocks a deployment — it reports findings and returns an exit code for CI integration.

The architecture requires a deterministic oracle for each domain. For domains where no deterministic oracle exists (brand voice consistency, subjective design quality), LUCID cannot be applied in its current form. LLM-as-judge is the only option for those domains, and its limitations apply.

We have not yet evaluated whether the claim extractor generalizes across programming languages equally well. Python and TypeScript have the most validation data. Lower-resource languages may show reduced extraction quality.

References

Marcus, G. (2020). The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence
Kautz, H. (2020). The Third AI Summer. AAAI Robert S. Engelmore Memorial Award Lecture. (Neurosymbolic taxonomy, Types 1–6)
Garcez, A. & Lamb, L. (2023). Neurosymbolic AI: The 3rd Wave
Trinh, T., Wu, Y., Le, Q., He, H., & Luong, T. (2024). Solving olympiad geometry without human demonstrations (AlphaGeometry)
Chen, M. et al. (2021). Evaluating Large Language Models Trained on Code (HumanEval)
Jimenez, C. et al. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
LUCID benchmark data and live verification results: tryassay.ai/proof

Claim Extraction as a Neurosymbolic Primitive