Towards Self-Verifying Domain Agents

What we found. A frontier coding agent's general-purpose code generation capability is both an asset and a liability for domain-specific tasks. The same fluency that lets it produce syntactically valid programs also lets it produce programs that silently compute incorrect answers that don't raise exceptions. We observed three things while designing our approach:

1. The compiler encodes domain knowledge outside the weights. Actuarial rules like "a deductible factor applies only to COMP and COLL" and "tables must be total functions over their enum inputs" are not things a frontier model reliably learns during training. The RSL compiler externalizes this knowledge: each compiler check is a domain fact delivered as an error message at exactly the moment it's relevant. §2

2. Code generation ability transfers to constrained grammars. RSL is designed to be close enough to familiar programming patterns that the model writes mostly syntax-valid RSL without any RSL-specific training data. A carefully designed grammar is learnable from a single reference file, not a training corpus. §4

3. Programmatic access to the domain model enables symbolic reasoning over long context tasks. Early in development, we observed that our agent's performance improved across all use cases when given a persistent code execution environment rather than a suite of tool schemas. We restructured the platform so every agent primarily operates a persistent Jupyter kernel where the rating algorithm (its AST, type system, and lookup tables) exists as Python objects the agent manipulates directly. Zhang et al.'s Recursive Language Model (paper, 2025) subsequently formalized this principle: give agents a REPL environment, make context available programmatically rather than fed directly into the context window, and let the model recursively invoke itself over programmatic slices of arbitrarily long prompts. Our architecture is a natural extension of this idea, going beyond text-based long context evals to domain-specific code generation for enterprise tasks.

What we're exploring. LLM-assisted design of constraint languages for new domains, and transferability of this pattern beyond insurance rating. More on these directions below. §5

Section 1

Rating algorithm structure

A rating algorithm computes an insurance premium by composing a small number of primitive operations: multiplicative chaining (P = base × f₁ × f₂ × … × fₙ) where each fᵢ is resolved via a table lookup, a keyed function from input attributes to a factor value, additive charges (flat dollar amounts added to the chain), clamping (min/max premium constraints, rate caps against prior-term premiums), conditional branching (the computation structure changes based on input values), aggregation across sub-entities (sum over vehicles, exposures, or locations), self-referential steps where the premium depends on its own intermediate value (premium discounts, retrospective rating), and parallel indexed dimensions where the same chain is evaluated independently per coverage line with different factor subsets. Production algorithms compose these primitives in increasingly deep ways, and some products implement variations on these operations, such as continuous-valued interpolation between table lookup values.

Figure 1. Coverage participation matrix for a simplified personal auto rater. Each column is a rating step; each row is a coverage line. The irregular sparsity is the source of structural errors: an agent must correctly assign each step to its coverage subset.

Section 2

RSL: a compiled language for rating

RSL (Rating Specification Language) is a typed, compiled language purpose-built for insurance premium rating. An agent writes RSL instead of Python. Both compute the same rating algorithm. The difference is what happens when the implementation is wrong.

RSL
// Coverage subset is typed and checked
step deductible_factor[c: {comp, coll}]:
    Rate = lookup(
        Table_09_Deductible,
        deductible[c], territory
    )
    [cite: rate_manual:p42]

Python
# Coverage subset is implicit
def get_deductible_factor(cov, ded, terr):
    return deductible_table.get(
        (ded, terr), 1.0
    )
# Incorrect coverage → returns 1.0 silently
# Wrong key format → returns 1.0 silently

The RSL compiler validates domain constraints at compile time. The subset below is most relevant to the error modes observed in our experiments:

✓

Coverage subsets: every factor declares which coverages it applies to. [c: {comp, coll}] applied to BI → compile error.

✓

Table-data consistency: lookup keys are validated against CSV data. "25" not in {"0 15","0 25",...} → error.

✓

Chain connectivity: every step must connect to the terminal output. Dead factors are flagged.

✓

Dimensional types: dollar-valued outputs must trace to dollar-valued inputs.

✓

Source citations: every step cites a page in the rate manual. Missing citation → compile error.

RSL compiles to three output formats from a single source: Python (2,846 lines for the 43-step rater), Excel (14 MB actuarial review workbook with per-step worksheets), and JSON (API integration). This eliminates the reimplementation boundary between actuarial review and engineering deployment. Because the compiler controls code generation, it can apply deterministic optimizations (vectorized table lookups, precomputed index structures) that an LLM generating Python from scratch would not reliably produce. There is no transcription step where errors can enter.

RSL Source

→

rsl compile

→

Python
2,846 lines

Excel
14 MB

JSON
API-ready

RSLLib: structural introspection as a Python API

RSLLib exposes the compiler's structural analysis as a query and mutation API. In a Python rater, determining which functions affect the premium requires tracing through arbitrary imperative code, usually function calls, dictionary lookups, or conditional branches. In RSL, it is a function call on the AST:

# Which coverages does this step touch?
step_participation(tree, "deductible_factor")  # → {"coverage": {"comp", "coll"}}

# Which steps affect BI? (trace the full chain)
steps_for_element(tree, "coverage", "bi")    # → ["base_rate", "tier_factor", "limit_factor", ...]

# What's still unimplemented?
list_nl_placeholders(source)               # → [Vehicle.territory_factor, Vehicle.veh_age_factor]

# Surgical AST mutation — replace a stub with a resolved expression
source = replace_nl(source, step="territory_factor", entity="Vehicle",
                    field="terr", new_expr="lookup!(TerritoryTable, self.zip).factor[c]")

The API includes mutation functions that preserve comments, whitespace, and annotations through round-trip edits. Agents reason about rating structure at the semantic level (coverage chains, participation sets, stub inventories, etc.) rather than manipulating text.

Scenario testing: structural validation before data extraction

RSL scenarios enable us to override natural-language stubs with known scalar values, hand-compute expected outputs, and assert results before any table data is extracted from the filing. This allows faster iteration to catch issues like operator errors (× vs. +), incorrect chain ordering, and coverage subset gaps at the skeleton stage.

Scenario = {
    override Vehicle = {
        base_rate[c: {BI}]: Rate = 200.0,  base_rate[c: {PD}]: Rate = 150.0,
        base_rate[c: {COMP}]: Rate = 100.0, base_rate[c: {COLL}]: Rate = 80.0,
        territory_factor[c]: Factor = 1.10,
        deductible_factor[c: {COMP, COLL}]: Factor = 0.95
    }
    instances = { Vehicle v1 = { territory: Num = 2 } }
    // Hand-computed: BI=200×1.10=220, PD=150×1.10=165
    // COMP=100×1.10×0.95=104.50, COLL=80×1.10×0.95=83.60
    assert = { v1.BI == 220.0, v1.PD == 165.0, v1.COMP == 104.50, v1.COLL == 83.60 }
}

If the agent mistakenly annotates deductible_factor as [c: {COMP, COLL, BI}], the BI assertion fails immediately: 200 × 1.10 × 0.95 = 209 ≠ 220. The agent gets a concrete failed assertion on a known value, not a mysterious premium discrepancy on real data with 43 interacting steps.

The compiler as knowledge repository

The RSL compiler's validation layer is a pluggable system of domain checks. Each check is a Python class that declares which AST node types it inspects. A runner walks the parsed program, dispatches to registered checks per node, and calls a finalization pass after the full traversal. New checks are added by subclassing Check and appending to the check list, with no changes to the compiler core.

class Check:
    """Base class for all validation checks."""
    fires_on: ClassVar[tuple[type, ...]] = ()

    def check(self, node, ctx: CheckContext) -> None:
        """Called per AST node matching fires_on."""

    def finalize(self, ctx: CheckContext) -> None:
        """Called once after full AST walk — cross-cutting analysis."""

# Adding a new domain rule = adding a class
def default_checks() -> list[Check]:
    return [
        StepStructureCheck(),       # steps have assignments
        DimensionCheck(),           # Premium requires Rate anchor
        ParticipationFlowCheck(),   # coverage subset consistency
        DomainCoverageCheck(),      # tables are total functions
        DataFlowCheck(),            # all paths reach terminal output
        CrossTableEnumCoverageCheck(),
        ScenarioInputCheck(),
        ... # 28 checks total
    ]

The extensibility mechanism is a key architectural property. Adding a new domain rule means adding a class. The current check suite spans five categories, from pure structural analysis to insurance-specific domain heuristics:

Each check corresponds to a failure mode observed in production. The suite grows with every new carrier build.
Category	Representative rule
Structural integrity	Every assignment connects to a terminal output (`DataFlowCheck`)
Dimensional types	Premium-typed output requires a Rate-valued anchor; multiplying only Factors produces a ratio, not dollars (`DimensionCheck`)
Coverage participation	Indexed assignments only reference attributes covering their element subset (`ParticipationFlowCheck`)
Table domain coverage	Enum-typed lookup columns cover the full domain: tables must be total functions over their inputs (`DomainCoverageCheck`)
Data quality heuristics	String-typed table columns that match a declared enum are flagged for reannotation (`StrColumnEnumSuggestionCheck`)
Scenario validation	Test instances provide all required inputs with correct types and values within table range bounds (`ScenarioInputCheck`)

The majority of checks go beyond what a general-purpose type system can express. For example, DomainCoverageCheck loads the actual CSV lookup tables at compile time and verifies that every enum member in the type declaration appears as a row in the data, a table-completeness property. CrossTableEnumCoverageCheck compares enum coverage across sibling tables: if the territory factor table has 12 territory codes but the base rate table has only 11, the discrepancy is flagged. These are actuarial completeness properties verified against real data, not type errors in the traditional sense.

# Real check: tables must be total functions over their enum inputs
class DomainCoverageCheck(Check):
    """Verify enum-typed table columns cover the full enum domain."""

    def finalize(self, ctx):
        for table in ctx.program.tables.values():
            # Resolve each enum-typed input column
            enum_keys = resolve_enum_columns(table, ctx.symtab)

            # Load the CSV, extract actual values
            csv_values = load_csv_domain(table, ctx.tables_dir)

            # Single-key: direct set difference
            # Multi-key: cross-product coverage with wildcard handling
            missing = check_coverage(enum_keys, csv_values)

            if missing:
                ctx.emit(
                    f"Table '{table.name}' missing rows for: {missing}. "
                    f"Table must be a total function over its enum inputs."
                )

Section 3

Experiment 1: accuracy vs. task complexity

We measured the frontier coding agent's per-coverage accuracy across three task sizes derived from real carrier rate filings. The 43-step rater is deliberately less complex than production tasks; it represents a small niche auto carrier, chosen to isolate structural failure modes in a controlled setting. Larger filings span thousands to tens of thousands of pages across hundreds of PDF documents and Excel files.

Per-coverage accuracy vs. task complexity (steps × coverages)

At 17 steps (4 coverages, 68 factor applications), the agent implements the rater correctly: 40 of 40 per-coverage values match. At 43 steps (11 coverages, 473 factor applications), it produces errors that are structurally invisible without per-coverage validation. At 128 steps (15 coverages, 1,920 factor applications), two independent runs produce architecturally incompatible implementations: premiums diverge by 2-6× on the same test policies. The failure is in the structural decisions about which factors apply to which coverages.

Three identical runs of the 43-step task illustrate the stochasticity of frontier coding agent performance on this task:

Run	Mean Error	Tool Calls	Lines	Outcome
Run 1	45.8%	131	1,271	Completed (58 functions)
Run 2	45.6%	164	1,400	Completed (69 functions)
Run 3	—	240+	0	Context exhaustion. 0 write operations.

Section 3b

Experiment 2: frontier agent vs. RSL agent

We evaluated the frontier coding agent against the RSL agent on the 43-step rater under controlled conditions: same base model, same lookup tables, same 6 test policies. The frontier agent writes Python and validates by running the code against expected totals. The RSL agent writes RSL and receives compiler diagnostics before execution. We tested the frontier agent in two configurations: without a specification (the agent receives only the table directory, test cases, and a rate order of calculation CSV, which is closer to the real-world task where detailed specifications beyond the filing documents do not exist) and with a detailed step-by-step specification. The comparison between these two configurations separates instruction-following from code generation: the specification removes the need to infer algorithmic structure from the data, isolating the agent's ability to translate a known algorithm into correct code.

All three configurations use the same base LLM. *The 10.0% reported error masks 15.1% true error: three bugs cancel across coverage lines (see error decomposition below).
Configuration	Mean per-policy error	Per-coverage accuracy	Tool calls	Time
Frontier agent, no specification	103.7%	4 of 11 coverages structurally correct	237	97 min
Frontier agent, with specification	10.0%*	8 of 11 coverages structurally correct	160	65 min
RSL agent, with compiler	0.0%	11 of 11 coverages correct (66/66 values)	36	~15 min

Every error produced by the frontier agent maps to a specific RSL compiler check. Three general failure modes account for all observed errors:

Silent default returns: lookup mismatches produce incorrect values instead of errors

Frontier models have a tendency to use dict.get(key, default), which silently returns a default when the key doesn't match the table's format. This is a particularly dangerous class of error in document understanding and code generation tasks. The program runs without error, but the computation is incorrect. In RSL, table lookup keys are typed against the CSV data at compile time: a format mismatch is a compile error, not a silent $0.

In this task: the frontier agent formats a PD limit lookup key as "25" when the table stores "025". The mismatch silently zeroes out the PD coverage line for all 6 test policies. The RSL compiler's DomainCoverageCheck catches this before the rater ever runs.

Per-coverage error heatmap (with specification): 4 active coverages × 6 policies

0% error 5-20% 20-50% 100% (=$0)

Stochastic hallucinations: opaque enum codes are invented differently on every run

When enum codes have transparent semantics (territory integers, limit strings), the model maps them correctly. When codes are opaque single-letter identifiers whose meaning cannot be inferred from context, the model hallucinates, and hallucinates differently on each run. Below, we illustrate results from two independent runs which agree on only 1 of 5 opaque ownership codes. The RSL compiler validates every enum value against the table's domain: a hallucinated code is a compile error.

In this task: of 17 enum mappings in the 43-step rater, 13 are correct (all transparent codes). The 4 incorrect mappings are ownership duration codes, represented as opaque single-letter values with semantics the model cannot infer from context.

The ownership mapping is a mirror reversal: the agent maps E↔C and F↔B symmetrically. `DomainCoverageCheck` catches all four.
Mapping	Agent Output	Correct Code
Ownership: under 1 year	C	E
Ownership: 1-2 years	B	F
Ownership: 3-4 years	F	B
Ownership: 4+ years	E	C

Dead code in computation graphs: implemented functions never connected to output

A function that computes a correction factor but is never called in the main computation chain has no observable effect; it doesn't raise an error, and its absence produces no warning. The agent cannot distinguish "this function exists but doesn't affect the output" from "this function is correctly integrated." In RSL, every step must connect to the terminal output node; disconnected steps are compile errors.

In the 43-step rater (no specification): three functions implementing a 30-percentage-point correction factor are written but never called. The frontier agent spent 73 minutes on 115 Bash debug commands without identifying the disconnection (final error: 103.7%). The RSL compiler's DataFlowCheck produces a single error: "pvp_dev references after_round which is structurally absent for {key, mdd}" which is resolved in one round.

Error decomposition: Policy 5029 ($2,174.48 expected)

Three independent bugs produce offsetting premium effects: on some inputs the errors cancel to 0.1% total, making the premium appear correct while three coverage lines are individually incorrect. We report per-coverage accuracy rather than aggregate premium error because total-premium accuracy is not a reliable signal for structural correctness in multi-coverage rating.

Section 4

The compile-fix loop vs. the debug spiral

With the RSL compiler, the agent converges in 6 rounds. Each round addresses a single, located, described error. Without the compiler, the agent enters an open-ended diagnostic spiral.

Timeline comparison: compile-fix (top) vs. debug spiral (bottom)

Each error is located (step name, line), described (what violated what), and actionable (change X to Y). Total: 36 tool calls, ~15 minutes.
Round	Compiler Error	Fix Applied	Scope
1	Parse error (syntax)	Fixed RSL syntax	1 line
2	Parser block error	Removed invalid import	1 block
3	Structural absence: pvp chain for {key, mdd}	Fixed chain connectivity	2 steps
4	Coverage participation: 17 tables wrong subsets	Corrected coverage annotations	17 steps
5	Fold aggregation: sum includes wrong element	Excluded element from fold	1 expression
6	0 errors	Done	—

The no-specification debug spiral: 237 tool calls, 97 minutes, 115 Bash debug commands, 1 edit applied, 103.7% final error. The agent investigated computation mismatches (using a product instead of an average), symbol lookups, chain ordering, and value clamping across multiple hypotheses. The debug surface is 473 factor applications; each print statement reveals one value at a time. The compiler addresses all structural errors in 6 messages.

Section 5

Where we're going

Encoding more domain knowledge into the language. RSL captures rating algorithm structure across coverage subsets, entity hierarchies, and factor chains. We are extending the language to build richer representations of insurance entities and the operations that compose them. Each extension encodes new domain invariants as types and compiler checks rather than as instructions in a prompt or knowledge in the model's weights. We are also designing systems for automatically learning domain invariants from production rating builds, so the compiler's validation surface grows with the data rather than only through manual rule authoring.

LLM-assisted constraint language design. If RSL works for insurance rating, the meta-question is whether language models can participate in the design of constraint languages for other structured domains. The process of designing RSL required identifying which domain invariants are verifiable (coverage subset membership), which error modes are structural (disconnected chain steps), and what grammar would catch them at compile time. We are studying whether parts of this design process, particularly the identification of verifiable invariants from domain-specific corpora, can be automated or accelerated with LLM assistance.

Transferability beyond insurance. Rating algorithms are one instance of a broader pattern: structured computation where correctness has a formal definition but the implementation is derived from natural-language specifications. Tax computation, financial modeling, regulatory compliance workflows, and clinical trial protocols share this structure. We are researching whether the constrained grammar approach (domain-specific types, compile-time verification, programmatic agent access) transfers to these domains, and what the design constraints are for grammars that must be learnable from a single reference file rather than a training corpus.

Appendix

A. Task prompts

Below are the exact task prompts given to the frontier coding agent for each experimental configuration, with carrier names redacted. All prompts are unedited except for the redaction. Each configuration also received a directory of CSV lookup tables and a JSON file of test policies with expected premiums.

17-step rater (4 coverages) — 205 lines

Full algorithm specification with step-by-step instructions, entity hierarchy, and table lookup rules. 16 CSV tables and 8 test cases provided. The agent achieves 100% accuracy on this task.

# [Carrier] PPA Rating Engine

## Task

Build a Python rating engine (`rating.py`) for a personal auto insurance product.
The engine rates 4 coverages per vehicle (BI, PD, COMP, COLL) plus an Uninsured Motorist (UM) 
premium at the policy level.

Your `rating.py` must:
1. Read the 16 CSV lookup tables from the `tables/` directory
2. Accept a JSON test case (schema described below) 
3. Return per-coverage developed premiums for each vehicle, plus UM premium and policy total
4. When run as `python rating.py cases.json`, output results for all test cases

## Rating Algorithm

The rating follows a multiplicative chain across three entity levels: Driver → Vehicle → Policy.

### Entity Hierarchy
- **Policy**: top-level; has attributes like tier, homeowner status, etc.
- **Driver**: belongs to a policy; multiple drivers possible per policy
- **Vehicle**: belongs to a policy; multiple vehicles possible per policy

### Coverages
The four **indexed coverages** are: `BI`, `PD`, `COMP`, `COLL`. Each is computed separately 
per vehicle. Factors looked up from tables often have separate columns for each coverage.

**UM (Uninsured Motorist)** is computed once at the policy level — it is NOT a per-vehicle 
coverage.

### Driver Rating (Steps 1–4)

**Step 1a — Years Licensed Factor (YLF):**
Look up from `SA_D02_A_Years_Licensed_Factor.csv` using driver's age and years_licensed.
Returns a factor for each of the 4 coverages: BI, PD, COMP, COLL.

**Step 1b — Driving Record Points Factor (DRP):**
Look up from `SA_D02_B_Driving_Record_Points_Factor.csv` using driver's driving_record_points.
**IMPORTANT: This factor applies to BI, PD, and COLL only — NOT to COMP.** The DRP table 
has columns for BI, PD, COLL only (no COMP column).

**Step 1 — Driver Classification Factor (DCF):**
This is a bracket computation combining YLF and DRP:
- For **BI, PD, COLL**: `DCF = YLF + DRP - 1.0`
- For **COMP**: `DCF = YLF` (driving record points do NOT apply to COMP)

**Step 2 — Youthful Driver Discount:**
Look up from `SA_D03_Youthful_Driver_Discount.csv` using driver's age and clean_driver status.
Returns a discount percentage for each coverage.
`after_youth = DCF × (1 - discount)`

**Steps 3–4 — Household Member Factor → Developed Driver Factor:**
Look up from `SA_D04_Household_Member_Factor.csv` using the *policy's* vehicle_count, 
the *policy's* hh_member_count, and the *driver's* age.

**Note:** The table has columns `age_bracket_min` and `age_bracket_max` (not `age_min`/`age_max`).
These define the driver age bracket for the lookup.

`dev_driver = after_youth × hm_factor` (per coverage)

### Vehicle Rating (Steps 5–12)

**Step 5 — Household Risk Factor (HRF):**
This aggregates driver factors at the vehicle level. The HRF is computed as:
1. Rank ALL drivers on the policy by their `dev_driver[BI]` value (descending)
2. Select the top N drivers, where N = policy's `vehicle_count`
3. For EACH coverage, HRF = average of the selected drivers' `dev_driver[coverage]`

Note: The same top-N selection (based on BI ranking) applies to ALL coverages. You don't 
re-rank per coverage — you select by BI, then average across each coverage independently.

**Step 6 — Base Rate:**
Look up from `SA_V02_Base_Rate.csv` using vehicle's territory.
`after_base = HRF × base_rate` (per coverage)

**Step 7 — Tier Factor:**
Look up from `SA_V03_Tier_Factor.csv` using policy's tier.
`after_tier = after_base × tier_factor` (per coverage)

**Step 8a — Limit Factor (BI and PD only):**
- BI: look up from `SA_V04_A_BI_Limit_Factor.csv` using vehicle's limit_bi
- PD: look up from `SA_V04_A_PD_Limit_Factor.csv` using vehicle's limit_pd
`after_limit[BI] = after_tier[BI] × bi_limit_factor`
`after_limit[PD] = after_tier[PD] × pd_limit_factor`

**Step 8b — Deductible Factor (COMP and COLL only):**
- COMP: look up from `SA_V04_B_COMP_Deductible_Factor.csv` using vehicle's deductible_comp
- COLL: look up from `SA_V04_B_COLL_Deductible_Factor.csv` using vehicle's deductible_coll
`after_ded[COMP] = after_tier[COMP] × comp_ded_factor`
`after_ded[COLL] = after_tier[COLL] × coll_ded_factor`

**Step 9 — Garaging Territory Factor (MERGE POINT):**
Look up from `SA_V05_Garaging_Factor.csv` using vehicle's territory.
This is where BI/PD (from limit step) and COMP/COLL (from deductible step) merge:
- `after_terr[BI] = after_limit[BI] × garaging[BI]`
- `after_terr[PD] = after_limit[PD] × garaging[PD]`
- `after_terr[COMP] = after_ded[COMP] × garaging[COMP]`
- `after_terr[COLL] = after_ded[COLL] × garaging[COLL]`

**Step 10a — Homeowner/Multi-Car Discount:**
Look up from `SA_V06_A_Homeowner_Discount.csv` using policy's homeowner and multi_car.
`after_ho = after_terr × (1 - discount)` (per coverage)

**Step 10b — Safe Driver Discount:**
Look up from `SA_V06_B_Safe_Driver_Discount.csv` using policy's three_year_clean.
`after_sd = after_ho × (1 - discount)` (per coverage)

**Step 11 — Developed Premium:**
`dev_prem = round(after_sd, 2)` (per coverage, round to 2 decimal places)

**Step 12 — Vehicle Total Premium:**
`vehicle_total = sum of dev_prem across all 4 coverages`

### Policy Rating — UM (Steps 13–17)

**Step 13 — UM Base Rate:**
For EACH vehicle on the policy, look up the UM base rate from `SA_V02_UM_Base_Rate.csv` 
using that vehicle's territory. Then average across all vehicles:
`um_base = average(UM base rates for all vehicles)`

**Step 14 — UM Driver Count Factor:**
Look up from `SA_P03_UM_Driver_Count_Factor.csv` using policy's driver_count.
`after_um_drv = um_base × um_driver_count_factor`

**Step 15 — UM Average Garaging Factor:**
For EACH vehicle, look up from `SA_P04_UM_Garaging_Factor.csv` using that vehicle's territory.
Average the factors across all vehicles:
`um_avg_gar = average(UM garaging factors for all vehicles)`
`after_um_gar = after_um_drv × um_avg_gar`

**Step 16 — UM Developed Premium:**
`um_dev_prem = round(after_um_gar, 2)`

**Step 17 — Total Policy Premium:**
`policy_total = sum(all vehicle totals) + um_dev_prem`

## Table Lookup Rules

All table lookups use **first-match** semantics:
- For range columns (e.g., `age_min`/`age_max`), the input must fall within the range (inclusive)
- For exact columns (e.g., `territory`, `tier`), the input must match exactly
- Scan rows top to bottom; return the first matching row

## Input Format (cases.json)

```json
[
  {
    "scenario": 1,
    "policy": {
      "tier": "standard",
      "homeowner": "N",
      "multi_car": "N",
      "three_year_clean": "Y",
      "vehicle_count": 1,
      "hh_member_count": 1,
      "driver_count": 1
    },
    "drivers": [
      {"age": 35, "years_licensed": 17, "driving_record_points": 0, "clean_driver": "Y"}
    ],
    "vehicles": [
      {"territory": 4, "limit_bi": "50/100", "limit_pd": "50", "deductible_comp": 250, "deductible_coll": 500}
    ],
    "expected_totals": {
      "vehicle_totals": [835.11],
      "um_premium": 62.00,
      "policy_total": 897.11
    }
  }
]
```

## Required Output Format

For each test case, output a JSON object:

```json
{
  "scenario": 1,
  "vehicles": [
    {"bi": 247.94, "pd": 166.80, "comp": 136.37, "coll": 284.00, "vehicle_total": 835.11}
  ],
  "um_premium": 62.00,
  "policy_total": 897.11,
  "matches_expected": true
}
```

When run as `python rating.py cases.json`, print the JSON results array to stdout.

## Test Cases

The file `cases.json` contains 8 test cases. The `expected_totals` field provides the 
correct total premium per vehicle, UM premium, and policy total. Use these to validate 
your implementation.

Your output must include per-coverage breakdowns (`bi`, `pd`, `comp`, `coll`) for each 
vehicle — these are NOT provided in the test cases but are required in your output.

## Files Available

- `tables/` — 16 CSV lookup tables
- `cases.json` — 8 test cases with expected totals
- This file (`TASK.md`) — algorithm specification

43-step rater, no specification — 56 lines

Minimal description. The agent receives the table directory, test cases, and the ROC CSV, but no step-by-step breakdown, no coverage subset rules, no lookup format hints. Mean error: 103.7%.

# Task: Implement [Carrier] PPA Rating Algorithm

Implement a Python rating algorithm for [Carrier] PPA (Colorado Private Passenger Auto).

## Requirements

Write a file called `rating.py` that contains a function:

```python
def rate_policy(policy: dict, drivers: list, vehicles: list) -> dict:
```

The function must return a dict with:
- `total_policy_premium`: float — the total premium for the policy
- `per_coverage`: dict mapping coverage name to its premium amount

The 11 coverages are: BI, PD, COMP, COLL, LOAN, MED, UMUIM, UMPD, RENT, TOW, ACPE

## Input Data

- **Rating tables** are CSV files in: `./tables/`
- **Test cases** with expected total premiums are in: `./cases.json`
- **Rate Order of Calculation** is at: `./tables/ROC_Rate_Order_of_Calculation.csv`

## Validation

After implementing, run your rating function against ALL 6 test cases in cases.json.
Print each policy name, expected premium, actual premium, and percentage error.
Also print the per-coverage breakdown for each policy.

Save the per-coverage results to a file called `coverage_results.json` with this structure:
```json
{
  "policies": [
    {
      "name": "Policy XXXX",
      "expected_total": 1234.56,
      "actual_total": 1234.56,
      "per_coverage": {
        "BI": 123.45,
        "PD": 67.89,
        ...
      }
    }
  ]
}
```

## Important Notes

- Read ALL CSV tables carefully — column names and value formats matter
- The ROC (Rate Order of Calculation) CSV defines the step-by-step rating procedure
- Each coverage has its own base rate and factor chain
- Pay attention to rounding rules (typically round to 2 decimal places)
- Some coverages may not apply to all vehicles (check coverage limits/deductibles in the test cases)

43-step rater, with specification — 147 lines

Detailed step-by-step algorithm description including three-phase structure, coverage subset rules, bracket formulas, HRF aggregation logic, and table lookup format notes (e.g., PD limit "25" → "0 25"). Mean error: 10.0% (true: 15.1%).

# Task: Implement [Carrier] PPA Rating Algorithm

Build a Python rating engine (`rating.py`) that computes auto insurance premiums for [Carrier]'s PPA (Private Passenger Auto) program.

## What You Have

In your working directory:
- `tables/` — 43 CSV lookup tables containing all rating factors
- `cases.json` — 6 test policies with expected total premiums
- `ROC_Rate_Order_of_Calculation.csv` — The master algorithm specification (in `tables/`)

## Requirements

1. Create `rating.py` with a function: `rate_policy(policy, drivers, vehicles) -> dict`
   - Returns `{"total_policy_premium": float}`
   - Input format matches `cases.json` (see structure below)

2. Your output must pass validation: for each test case, compute the total premium and compare to `expected_premium`. Target: **< 5% mean absolute error** across all 6 cases.

## Algorithm Overview

The rating algorithm is a **57-step multiplicative factor chain** applied per-coverage, per-vehicle.
The Rate Order of Calculation (ROC) CSV in `tables/ROC_Rate_Order_of_Calculation.csv` defines every step.

### Coverage Types
There are 11 coverages: BI, PD, COMP, COLL, LOAN, MED, UMUIM, UMPD, RENT, TOW, ACPE.
Not every step applies to every coverage — the ROC CSV shows which steps apply to which coverages
(`x` = multiply, `+` = add, `-1` = subtract 1, `=` = result, blank = skip).

### Three Phases

**Phase 1: Driver Factors (Steps 1-11)** — Per-driver, per-coverage
- Steps 1-4: `[Driver Classification × Years Licensed + Driving Record Points - 1]`
  - This is a bracket: `(DCF × YLF) + DRP - 1.0`
- Steps 5-10: Multiply by additional driver factors (youthful discount, senior discount, household member, etc.)
- Step 11: The result is the **Developed Driver Factor (DDF)** for each driver × coverage
- **UMUIM and ACPE do not participate in driver factors** (they enter the chain later)

**Phase 2: Vehicle/Policy Factors (Steps 12-44)** — Per-vehicle, per-coverage
- **Step 12: Household Risk Factor (HRF)** = Average of the top-N Developed Driver Factors, where N = vehicle count
  - Sort all DDFs for this coverage descending, take the top N, average them
  - Example: 2 drivers, 1 vehicle → HRF = max(DDF₁, DDF₂) (top-1 average)
  - Example: 2 drivers, 2 vehicles → HRF = (DDF₁ + DDF₂) / 2 (top-2 average)
- Step 13: Base Rate (from `11_Base_Rate.csv`)
- Steps 14-43: Multiply by factors from their respective tables
  - Each factor has a corresponding CSV file (numbered to match the step)
  - Some steps are discounts: `(1 - discount_value)` — the ROC item name tells you
  - Some steps are surcharges: `(1 + surcharge_value)`
  - Step 18 (Full Coverage Factor): Only applies to BI, PD, MED
  - Step 27 (Limit Factor): Applies to all coverages EXCEPT COMP and COLL
  - Step 28 (Deductible Factor): Applies ONLY to COMP and COLL
  - Step 43 (CO Driver Count Factor): Applies ONLY to UMUIM
  - **UMUIM enters the chain at Step 15** (FR Tier Factor), not at Step 13
- Step 44: Round to 0.01 → this is the **Developed Premium**

**Phase 3: Premium Split (Steps 45-55)** — Per-vehicle, per-coverage
This splits the Developed Premium into Fixed + Variable (mileage-based) components:
- Step 45-47: Fixed Premium = `round(Developed_Premium × Fixed_Portion_Factor, 0.01)`
  - For BI only: add Expense Load = `round(Developed_Premium × expense_load_pct, 0.01)`
  - `expense_load_pct` = -3.1% (from actuarial support, for expense_load_line = 6)
- Steps 48-52: Per-Mile Variable Premium
  - `per_mile_rate = Developed_Premium × (1 - Fixed_Portion_Factor) / Base_Miles_Divisor`
  - Round per_mile_rate to 3 decimal places, minimum $0.001
  - `variable_premium = per_mile_rate × actual_miles_driven`
- **Actual Miles Driven** = `annual_mileage × policy_term / 12`
  - For a 6-month policy with annual_mileage=8000: actual_miles = 8000 × 6/12 = 4000
- Step 55: Vehicle Premium = Fixed + Variable (round to 0.01)

### Policy Premium
Sum all vehicle premiums across all coverages. A coverage is "purchased" if:
- For limit-based coverages (BI, PD, LOAN, MED, UMUIM, UMPD, RENT, TOW, ACPE): limit ≠ "NONE"
- For deductible-based coverages (COMP, COLL): deductible ≠ "NONE"

### Key Derivations
- **Prior Insurance Classification**: All test cases have no lapse → classification = "A"
- **Full Coverage Status**: Policy-level determination. "A" if ALL vehicles have both COMP and COLL deductibles. "S" if SOME but not all. "N" if none.
- **Multi-Car**: "Y" if vehicle_count > 1, else "N"
- **Trend Months**: `months_between(effective_date, "2025-02-01")` — for all test cases with effective_date 2025-03-01, this equals 1
- **Actual Miles Driven**: `annual_mileage × policy_term / 12` (NOT the raw annual mileage)
- **Continuous Insurance Discount Level**: Derived from months_continuous_coverage and prior_carrier_is_mileauto
- **Household Member Factor**: Two source tables exist:
  - `06_Household_Member_1_Factor.csv` — for the Primary Named Insured (PNI, first rated driver). Keys: Vehicle Count, Household Member Count
  - `07_Household_Member_2_Factor.csv` — for all other rated drivers. Keys: Vehicle Count, Household Member Count, Driver Age

### Table Lookup Notes
- Many tables use **integer range columns** (e.g., "21 ... 125" means 21 through 125 inclusive). Parse these as `[lo, hi]` and check if the input falls within the range.
- The `*` wildcard in a table cell matches any value.
- **PD Limit values** in `26_Limit_Factor.csv` use a special format: the value for PD limit 25 is stored as `"0 25"` in the table (not just `"25"`). Map input pd_limit "25" → lookup key "0 25".
- **Deductible values** in `27_Deductible_Factor.csv` use formatted strings like "1,000 DED", "500 DED". Map input deductible integers accordingly.
- The **Vehicle Symbol Factor** table (`44_Vehicle_Symbol_Factor.csv`) uses make/model/style codes. Test cases use placeholder values (make=69, model=69, style=69) that may not match any row — default to factor 1.0 if no match.

## Input Format (from cases.json)

```json
{
  "cases": [
    {
      "name": "Policy 5029",
      "expected_premium": 2174.48,
      "policy": {
        "household_member_count": 2,
        "financial_responsibility_tier": "M1",
        "tier": "2B",
        "advance_shop_days": 21,
        "policy_term": 6,
        "effective_date": "2025-03-01",
        "prior_bi_limit_text": "50/100",
        "months_continuous_coverage": 24,
        ...
      },
      "drivers": [
        {"gender": "M", "marital_status": "S", "years_licensed": 20, ...},
        ...
      ],
      "vehicles": [
        {"bi_limit": "50/100", "pd_limit": "25", "comp_deductible": "1000", ...},
        ...
      ]
    }
  ]
}
```

## Validation

After creating `rating.py`, validate it:

```python
import json
from rating import rate_policy

with open("cases.json") as f:
    data = json.load(f)

total_err = 0
for case in data["cases"]:
    result = rate_policy(case["policy"], case["drivers"], case["vehicles"])
    computed = result["total_policy_premium"]
    expected = case["expected_premium"]
    err = abs(computed - expected) / expected * 100
    total_err += err
    print(f"{case['name']}: expected={expected}, computed={computed}, error={err:.1f}%")

print(f"\nMean error: {total_err / len(data['cases']):.2f}%")
```

Target: Mean error < 5%.

43-step rater, with RSL compiler

The RSL agent receives the same tables and test cases. Instead of a text specification, it writes RSL source code and receives compiler diagnostics. The compiler enforces coverage subsets, chain connectivity, table-data consistency, and dimensional types. Mean error: 0.0%. See Appendix B for a skeleton of the RSL agent's system prompt.

128-step rater (15 coverages) — 151 lines

Same carrier, expanded to full coverage set. Two independent runs diverge by 2-6× on the same test policies.

# Task: Implement [Carrier] PPA Rating Algorithm

Implement a Python rating algorithm for [Carrier] Private Passenger Auto.

## Your Working Directory Contains

- `tables/` — 43 CSV rate factor tables (numbered 01-44, plus reference tables)
- `tables/ROC_Rate_Order_of_Calculation.csv` — the official Rate Order of Calculation
  showing which factor applies to which coverage at each step
- `cases.json` — 6 test policies with expected total premiums

## Requirements

Write a file called `rating.py` that contains a function:

```python
def rate_policy(policy: dict, drivers: list[dict], vehicles: list[dict]) -> dict:
```

The function must return a dict with:
- `total_policy_premium`: float — the total premium for the policy
- `vehicles`: list of dicts, each with:
  - `vehicle_total_premium`: float
  - `coverages`: dict mapping coverage name → premium (float)

The 11 coverages are: BI, PD, COMP, COLL, LOAN, MED, UMUIM, UMPD, RENT, TOW, ACPE

## Algorithm Overview

The ROC CSV (`tables/ROC_Rate_Order_of_Calculation.csv`) is the authoritative guide.
Each row is a rating step. Columns BI through ACPE show which coverages participate
in that step (marked with 'x', '+', '-1', '÷', or '=').

### Phase 1: Developed Driver Factor (Steps 1-11, per driver per coverage)

For each rated driver, compute a per-coverage Developed Driver Factor (DDF):

```
Steps 1-3: [Classification Factor × Years Licensed Factor + DRP Factor - 1]
Step 5:    × (1 - Youthful Driver Discount)
Step 6:    × (1 - CO Senior Safe Driver Discount)
Step 7:    × Household Member Factor
Step 8:    × Driver Age Point Factor
Step 9:    × Financial Responsibility by Clean Factor
Step 10:   × Driver's License Type Factor
Step 11:   = Developed Driver Factor
```

Note: Steps 1-4 use a BRACKET formula: (Class × YearsLicensed + DRP - 1).
The bracket formula result is then multiplied by subsequent factors.

UMUIM and ACPE have NO driver-level factors (empty columns in ROC for steps 1-11).

### Phase 2: Vehicle Premium (Steps 12-55, per vehicle per coverage)

```
Step 12: Household Risk Factor = average of top-N Developed Driver Factors
         where N = number of vehicles on the policy
         (For each coverage, take the N highest DDFs and average them)
Step 13: × Base Rate (from table 11_Base_Rate.csv)
Step 14-43: × each factor in sequence per the ROC
Step 44: = Developed Premium (round to $0.01)
Steps 45-52: Split into Fixed + Variable (per-mile) portions
Step 55: Vehicle Premium = Fixed + Variable (round to $0.01)
```

### Coverage Entry Points (CRITICAL)

Not all coverages enter the chain at Step 12:

- **BI, PD, COMP, COLL, LOAN, MED, UMPD, RENT, TOW**: Full chain from Step 12
  Formula: HRF × Base Rate × [Step 14 through 43 factors]

- **UMUIM**: Enters at Step 15. No HRF (Step 12), no Base Rate × HRF multiplication,
  no Tier Factor (Step 14). Instead: Base Rate × FR Tier Factor (Step 15) × [Step 16+]

- **ACPE**: Enters at Step 13 with Base Rate only. No HRF multiplication.
  Most steps are pass-through (factor = 1.0) for ACPE per the ROC.

### Per-Mile Premium Split (Steps 45-55)

After computing the Developed Premium (Step 44):
1. Fixed Portion = Developed Premium × Fixed Portion Factor (table 42)
2. For BI only: add Expense Load = Developed Premium × expense_load_pct (table S2D2)
3. Variable per-mile = Developed Premium × (1 - Fixed Portion Factor) ÷ Base Mile Divisor (table 43)
4. Variable per-mile = max(rounded to 3 decimals, $0.001)
5. Variable Premium = per-mile rate × actual_miles_driven
6. Vehicle Premium = Fixed Portion [+ Expense Load for BI] + Variable Premium

### Actual Miles Driven

actual_miles_driven = annual_mileage × policy_term / 12

### Coverage Selection

Each vehicle has per-coverage limit fields (e.g., `bi_limit`, `pd_limit`, etc.):
- If a coverage limit is "NONE" (or absent), that coverage premium = $0
- For COMP/COLL: check `comp_deductible`/`coll_deductible` — "NONE" means not selected
- The Limit Factor (table 26) and Deductible Factor (table 27) convert limits/deductibles
  to factors in the rating chain

### Key Table Lookup Notes

- Many tables use range columns (e.g., `driver_age_min`, `driver_age_max`).
  Find the row where input falls within [min, max] inclusive.
- Boolean columns use various formats: "Y"/"N", "True"/"False", etc.
- Some tables have per-coverage output columns (BI, PD, COMP, ...).
  Others have a single Factor column that applies to all coverages.
- The Limit Factor table (26) has columns: Coverage, Prior_Insurance, Limit, Factor.
  Match coverage type AND the exact limit string from the vehicle.
- Rounding: use banker's rounding (round half to even) throughout.

### Discount Steps

Steps marked with `(1 - ...)` in the ROC are DISCOUNTS. Apply as:
```
running_premium × (1 - discount_percentage)
```

Steps marked with `(1 + ...)` are SURCHARGES. Apply as:
```
running_premium × (1 + surcharge_percentage)
```

## Validation

After implementing, validate against ALL 6 test cases in cases.json:
1. Call `rate_policy(case['policy'], case['drivers'], case['vehicles'])`
2. Compare `total_policy_premium` to `case['expected_premium']`
3. Print: policy name, expected premium, actual premium, percentage error
4. Print the per-coverage breakdown for each vehicle

Save results to `coverage_results.json`:
```json
{
  "policies": [
    {
      "name": "Policy XXXX",
      "expected_total": 1234.56,
      "actual_total": 1234.56,
      "error_pct": 0.5,
      "vehicles": [
        {
          "coverages": {"BI": 123.45, "PD": 67.89, ...},
          "vehicle_total": 456.78
        }
      ]
    }
  ]
}
```

B. RSL agent system prompt skeleton

Below is a redacted skeleton of the RSL agent's system prompt, showing its identity, tools, APIs, and validation protocol. The full prompt includes runnable RSL reference files (CI-validated), type system documentation, step implementation patterns, and scenario syntax — approximately 15,000 tokens total. Carrier-specific content is injected at runtime via the orchestrator's dispatch message, not the system prompt.

# RSL Editor Agent — System Prompt Skeleton

## Identity
You are an RSL Editor — a specialized agent for creating and updating
RSL (Rating Specification Language) files from rate filing specifications.

## Tools
| Tool | Description |
|------|-------------|
| python_notebook | Persistent Jupyter kernel — RSLLib queries, compilation, AST analysis |
| bash | Shell — rsl compile, rsl fmt, rsl test, rsl export, file operations |
| text_editor | Create / edit RSL source files with str_replace, insert, create |

## RSLLib Python API (available in python_notebook)
| Function | Returns |
|----------|---------|
| step_participation(tree, step) | {axis: {elements}} — which coverages a step covers |
| steps_for_element(tree, axis, elem) | [step_names] — which steps affect a coverage |
| list_nl_placeholders(source) | [placeholders] — unimplemented stubs |
| get_structure_map(source) | StructureMap — entities, steps, skip status |
| replace_nl(source, step, entity, ...) | Updated source — surgical AST mutation |
| add_step / remove_step / reorder_step | AST mutations preserving comments & annotations |
| apply_mutations(source, [mutations]) | Batch mutation application |

## Validation Protocol
1. After every edit: run `rsl compile` — fix all errors before proceeding
2. After completing a step: run `rsl eval` if scenario exists
3. Coverage participation: verify step participates in correct coverage subset
4. Chain connectivity: verify step connects to terminal output
5. Table-data consistency: verify lookup keys exist in CSV domain

## Knowledge Sections
[RSL language reference, scenario reference, type system documentation,
 error pattern guide, step implementation patterns — injected at construction time,
 ~15,000 tokens total. Content is CI-validated and carrier-agnostic.]