A grammar and compile-fix loop for verifiable LLM code generation
What we found. A frontier coding agent's general-purpose code generation capability is both an asset and a liability for domain-specific tasks. The same fluency that lets it produce syntactically valid programs also lets it produce programs that silently compute incorrect answers that don't raise exceptions. We observed three things while designing our approach:
1. The compiler encodes domain knowledge outside the weights. Actuarial rules like "a deductible factor applies only to COMP and COLL" and "tables must be total functions over their enum inputs" are not things a frontier model reliably learns during training. The RSL compiler externalizes this knowledge: each compiler check is a domain fact delivered as an error message at exactly the moment it's relevant. §2
2. Code generation ability transfers to constrained grammars. RSL is designed to be close enough to familiar programming patterns that the model writes mostly syntax-valid RSL without any RSL-specific training data. A carefully designed grammar is learnable from a single reference file, not a training corpus. §4
3. Programmatic access to the domain model enables symbolic reasoning over long context tasks. Early in development, we observed that our agent's performance improved across all use cases when given a persistent code execution environment rather than a suite of tool schemas. We restructured the platform so every agent primarily operates a persistent Jupyter kernel where the rating algorithm (its AST, type system, and lookup tables) exists as Python objects the agent manipulates directly. Zhang et al.'s Recursive Language Model (paper, 2025) subsequently formalized this principle: give agents a REPL environment, make context available programmatically rather than fed directly into the context window, and let the model recursively invoke itself over programmatic slices of arbitrarily long prompts. Our architecture is a natural extension of this idea, going beyond text-based long context evals to domain-specific code generation for enterprise tasks.
What we're exploring. LLM-assisted design of constraint languages for new domains, and transferability of this pattern beyond insurance rating. More on these directions below. §5
A rating algorithm computes an insurance premium by composing a small number of primitive operations: multiplicative chaining (P = base × f₁ × f₂ × … × fₙ) where each fᵢ is resolved via a table lookup, a keyed function from input attributes to a factor value, additive charges (flat dollar amounts added to the chain), clamping (min/max premium constraints, rate caps against prior-term premiums), conditional branching (the computation structure changes based on input values), aggregation across sub-entities (sum over vehicles, exposures, or locations), self-referential steps where the premium depends on its own intermediate value (premium discounts, retrospective rating), and parallel indexed dimensions where the same chain is evaluated independently per coverage line with different factor subsets. Production algorithms compose these primitives in increasingly deep ways, and some products implement variations on these operations, such as continuous-valued interpolation between table lookup values.
RSL (Rating Specification Language) is a typed, compiled language purpose-built for insurance premium rating. An agent writes RSL instead of Python. Both compute the same rating algorithm. The difference is what happens when the implementation is wrong.
The RSL compiler validates domain constraints at compile time. The subset below is most relevant to the error modes observed in our experiments:
[c: {comp, coll}] applied to BI → compile error."25" not in {"0 15","0 25",...} → error.RSL compiles to three output formats from a single source: Python (2,846 lines for the 43-step rater), Excel (14 MB actuarial review workbook with per-step worksheets), and JSON (API integration). This eliminates the reimplementation boundary between actuarial review and engineering deployment. Because the compiler controls code generation, it can apply deterministic optimizations (vectorized table lookups, precomputed index structures) that an LLM generating Python from scratch would not reliably produce. There is no transcription step where errors can enter.
rsl compileRSLLib exposes the compiler's structural analysis as a query and mutation API. In a Python rater, determining which functions affect the premium requires tracing through arbitrary imperative code, usually function calls, dictionary lookups, or conditional branches. In RSL, it is a function call on the AST:
The API includes mutation functions that preserve comments, whitespace, and annotations through round-trip edits. Agents reason about rating structure at the semantic level (coverage chains, participation sets, stub inventories, etc.) rather than manipulating text.
RSL scenarios enable us to override natural-language stubs with known scalar values, hand-compute expected outputs, and assert results before any table data is extracted from the filing. This allows faster iteration to catch issues like operator errors (× vs. +), incorrect chain ordering, and coverage subset gaps at the skeleton stage.
If the agent mistakenly annotates deductible_factor as [c: {COMP, COLL, BI}], the BI assertion fails immediately: 200 × 1.10 × 0.95 = 209 ≠ 220. The agent gets a concrete failed assertion on a known value, not a mysterious premium discrepancy on real data with 43 interacting steps.
The RSL compiler's validation layer is a pluggable system of domain checks. Each check is a Python class that declares which AST node types it inspects. A runner walks the parsed program, dispatches to registered checks per node, and calls a finalization pass after the full traversal. New checks are added by subclassing Check and appending to the check list, with no changes to the compiler core.
The extensibility mechanism is a key architectural property. Adding a new domain rule means adding a class. The current check suite spans five categories, from pure structural analysis to insurance-specific domain heuristics:
| Category | Representative rule |
|---|---|
| Structural integrity | Every assignment connects to a terminal output (DataFlowCheck) |
| Dimensional types | Premium-typed output requires a Rate-valued anchor; multiplying only Factors produces a ratio, not dollars (DimensionCheck) |
| Coverage participation | Indexed assignments only reference attributes covering their element subset (ParticipationFlowCheck) |
| Table domain coverage | Enum-typed lookup columns cover the full domain: tables must be total functions over their inputs (DomainCoverageCheck) |
| Data quality heuristics | String-typed table columns that match a declared enum are flagged for reannotation (StrColumnEnumSuggestionCheck) |
| Scenario validation | Test instances provide all required inputs with correct types and values within table range bounds (ScenarioInputCheck) |
The majority of checks go beyond what a general-purpose type system can express. For example, DomainCoverageCheck loads the actual CSV lookup tables at compile time and verifies that every enum member in the type declaration appears as a row in the data, a table-completeness property. CrossTableEnumCoverageCheck compares enum coverage across sibling tables: if the territory factor table has 12 territory codes but the base rate table has only 11, the discrepancy is flagged. These are actuarial completeness properties verified against real data, not type errors in the traditional sense.
We measured the frontier coding agent's per-coverage accuracy across three task sizes derived from real carrier rate filings. The 43-step rater is deliberately less complex than production tasks; it represents a small niche auto carrier, chosen to isolate structural failure modes in a controlled setting. Larger filings span thousands to tens of thousands of pages across hundreds of PDF documents and Excel files.
At 17 steps (4 coverages, 68 factor applications), the agent implements the rater correctly: 40 of 40 per-coverage values match. At 43 steps (11 coverages, 473 factor applications), it produces errors that are structurally invisible without per-coverage validation. At 128 steps (15 coverages, 1,920 factor applications), two independent runs produce architecturally incompatible implementations: premiums diverge by 2-6× on the same test policies. The failure is in the structural decisions about which factors apply to which coverages.
Three identical runs of the 43-step task illustrate the stochasticity of frontier coding agent performance on this task:
| Run | Mean Error | Tool Calls | Lines | Outcome |
|---|---|---|---|---|
| Run 1 | 45.8% | 131 | 1,271 | Completed (58 functions) |
| Run 2 | 45.6% | 164 | 1,400 | Completed (69 functions) |
| Run 3 | — | 240+ | 0 | Context exhaustion. 0 write operations. |
We evaluated the frontier coding agent against the RSL agent on the 43-step rater under controlled conditions: same base model, same lookup tables, same 6 test policies. The frontier agent writes Python and validates by running the code against expected totals. The RSL agent writes RSL and receives compiler diagnostics before execution. We tested the frontier agent in two configurations: without a specification (the agent receives only the table directory, test cases, and a rate order of calculation CSV, which is closer to the real-world task where detailed specifications beyond the filing documents do not exist) and with a detailed step-by-step specification. The comparison between these two configurations separates instruction-following from code generation: the specification removes the need to infer algorithmic structure from the data, isolating the agent's ability to translate a known algorithm into correct code.
| Configuration | Mean per-policy error | Per-coverage accuracy | Tool calls | Time |
|---|---|---|---|---|
| Frontier agent, no specification | 103.7% | 4 of 11 coverages structurally correct | 237 | 97 min |
| Frontier agent, with specification | 10.0%* | 8 of 11 coverages structurally correct | 160 | 65 min |
| RSL agent, with compiler | 0.0% | 11 of 11 coverages correct (66/66 values) | 36 | ~15 min |
Every error produced by the frontier agent maps to a specific RSL compiler check. Three general failure modes account for all observed errors:
Frontier models have a tendency to use dict.get(key, default), which silently returns a default when the key doesn't match the table's format. This is a particularly dangerous class of error in document understanding and code generation tasks. The program runs without error, but the computation is incorrect. In RSL, table lookup keys are typed against the CSV data at compile time: a format mismatch is a compile error, not a silent $0.
In this task: the frontier agent formats a PD limit lookup key as "25" when the table stores "025". The mismatch silently zeroes out the PD coverage line for all 6 test policies. The RSL compiler's DomainCoverageCheck catches this before the rater ever runs.
When enum codes have transparent semantics (territory integers, limit strings), the model maps them correctly. When codes are opaque single-letter identifiers whose meaning cannot be inferred from context, the model hallucinates, and hallucinates differently on each run. Below, we illustrate results from two independent runs which agree on only 1 of 5 opaque ownership codes. The RSL compiler validates every enum value against the table's domain: a hallucinated code is a compile error.
In this task: of 17 enum mappings in the 43-step rater, 13 are correct (all transparent codes). The 4 incorrect mappings are ownership duration codes, represented as opaque single-letter values with semantics the model cannot infer from context.
| Mapping | Agent Output | Correct Code |
|---|---|---|
| Ownership: under 1 year | C | E |
| Ownership: 1-2 years | B | F |
| Ownership: 3-4 years | F | B |
| Ownership: 4+ years | E | C |
A function that computes a correction factor but is never called in the main computation chain has no observable effect; it doesn't raise an error, and its absence produces no warning. The agent cannot distinguish "this function exists but doesn't affect the output" from "this function is correctly integrated." In RSL, every step must connect to the terminal output node; disconnected steps are compile errors.
In the 43-step rater (no specification): three functions implementing a 30-percentage-point correction factor are written but never called. The frontier agent spent 73 minutes on 115 Bash debug commands without identifying the disconnection (final error: 103.7%). The RSL compiler's DataFlowCheck produces a single error: "pvp_dev references after_round which is structurally absent for {key, mdd}" which is resolved in one round.
Three independent bugs produce offsetting premium effects: on some inputs the errors cancel to 0.1% total, making the premium appear correct while three coverage lines are individually incorrect. We report per-coverage accuracy rather than aggregate premium error because total-premium accuracy is not a reliable signal for structural correctness in multi-coverage rating.
With the RSL compiler, the agent converges in 6 rounds. Each round addresses a single, located, described error. Without the compiler, the agent enters an open-ended diagnostic spiral.
| Round | Compiler Error | Fix Applied | Scope |
|---|---|---|---|
| 1 | Parse error (syntax) | Fixed RSL syntax | 1 line |
| 2 | Parser block error | Removed invalid import | 1 block |
| 3 | Structural absence: pvp chain for {key, mdd} | Fixed chain connectivity | 2 steps |
| 4 | Coverage participation: 17 tables wrong subsets | Corrected coverage annotations | 17 steps |
| 5 | Fold aggregation: sum includes wrong element | Excluded element from fold | 1 expression |
| 6 | 0 errors | Done | — |
The no-specification debug spiral: 237 tool calls, 97 minutes, 115 Bash debug commands, 1 edit applied, 103.7% final error. The agent investigated computation mismatches (using a product instead of an average), symbol lookups, chain ordering, and value clamping across multiple hypotheses. The debug surface is 473 factor applications; each print statement reveals one value at a time. The compiler addresses all structural errors in 6 messages.
Encoding more domain knowledge into the language. RSL captures rating algorithm structure across coverage subsets, entity hierarchies, and factor chains. We are extending the language to build richer representations of insurance entities and the operations that compose them. Each extension encodes new domain invariants as types and compiler checks rather than as instructions in a prompt or knowledge in the model's weights. We are also designing systems for automatically learning domain invariants from production rating builds, so the compiler's validation surface grows with the data rather than only through manual rule authoring.
LLM-assisted constraint language design. If RSL works for insurance rating, the meta-question is whether language models can participate in the design of constraint languages for other structured domains. The process of designing RSL required identifying which domain invariants are verifiable (coverage subset membership), which error modes are structural (disconnected chain steps), and what grammar would catch them at compile time. We are studying whether parts of this design process, particularly the identification of verifiable invariants from domain-specific corpora, can be automated or accelerated with LLM assistance.
Transferability beyond insurance. Rating algorithms are one instance of a broader pattern: structured computation where correctness has a formal definition but the implementation is derived from natural-language specifications. Tax computation, financial modeling, regulatory compliance workflows, and clinical trial protocols share this structure. We are researching whether the constrained grammar approach (domain-specific types, compile-time verification, programmatic agent access) transfers to these domains, and what the design constraints are for grammars that must be learnable from a single reference file rather than a training corpus.
Below are the exact task prompts given to the frontier coding agent for each experimental configuration, with carrier names redacted. All prompts are unedited except for the redaction. Each configuration also received a directory of CSV lookup tables and a JSON file of test policies with expected premiums.
Full algorithm specification with step-by-step instructions, entity hierarchy, and table lookup rules. 16 CSV tables and 8 test cases provided. The agent achieves 100% accuracy on this task.
Minimal description. The agent receives the table directory, test cases, and the ROC CSV, but no step-by-step breakdown, no coverage subset rules, no lookup format hints. Mean error: 103.7%.
Detailed step-by-step algorithm description including three-phase structure, coverage subset rules, bracket formulas, HRF aggregation logic, and table lookup format notes (e.g., PD limit "25" → "0 25"). Mean error: 10.0% (true: 15.1%).
The RSL agent receives the same tables and test cases. Instead of a text specification, it writes RSL source code and receives compiler diagnostics. The compiler enforces coverage subsets, chain connectivity, table-data consistency, and dimensional types. Mean error: 0.0%. See Appendix B for a skeleton of the RSL agent's system prompt.
Same carrier, expanded to full coverage set. Two independent runs diverge by 2-6× on the same test policies.
Below is a redacted skeleton of the RSL agent's system prompt, showing its identity, tools, APIs, and validation protocol. The full prompt includes runnable RSL reference files (CI-validated), type system documentation, step implementation patterns, and scenario syntax — approximately 15,000 tokens total. Carrier-specific content is injected at runtime via the orchestrator's dispatch message, not the system prompt.
If you are interested in working with us to make agent-generated code trustworthy enough for regulated industries, we have open roles on our team and we'd love for you to apply.
Competitive intelligence, product execution, and governed workflows for every team that touches your book. Book a demo to see it live.
Book a Demo