Project controls environment

A Primavera P6 replica with an agent-native scheduling environment: a from-scratch CPM engine brute-forced to P6 parity, where the agent edits by writing XER directly.

This post is still being written · Posted May 2026

What it is

The project-controls environment is a Primavera P6 replica wired up so an agent can work inside a real operational domain instead of a generic chat task. Construction controls has structured files, high-cost mistakes, explainable calculations, and a natural review loop, which makes it a good place to measure capability and, eventually, to generate training data. Agents-as-schedulers maps directly onto an agent's ability to direct and manage 3 to 5 year projects, some of the longest-horizon tasks the labs crave. The hosted demo runs on Claude's Agent SDK for operations and is capped at 100 turns per session.

Why project controls

Schedules are both formal and messy. CPM gives you verifiable structure, while real workflows still require judgment: changed calendars, bad logic ties, missing constraints, delayed procurement, and arguments about what actually moved the critical path. It's a domain where you can grade an answer and still need a real one. Project-controls professionals also routinely cross-reference change orders against contractual stipulations, monitor systems to schedule work activities during plant maintenance windows, and far more.

Brute-forced to P6 parity

Rather than wrap a black box, the environment ships a from-scratch CPM engine, and it was brute-forced to numerical parity with P6 rather than optimized for speed. Forward and backward pass, calendar-aware working-hour arithmetic, lag semantics across predecessor / successor / project calendars, constraints, total float and critical path. Across a 60-schedule benchmark (up to ~23k activities) it matches P6 on early/late dates and lands around ~98% on total float; recomputed on a common basis it matches throughout.

Agents write XER

Same posture as the rest of my tools: keep the surface low-level so you're testing the model, not a convenience API. The agent gets rich read tools (CPM, critical path, delay impact, float distribution, health) and the CPM oracle, but there is no fleet of helper functions for mutation. To change the schedule, the agent writes XER text directly and reloads it. That deliberately makes the task harder, and it isolates what's being measured: can the model actually manipulate the real schedule format and keep it valid, rather than call set_duration() and hope.

The episode

The useful unit is an episode. The agent receives a schedule and a task, takes actions, and produces a memo or correction. The whole run is captured as a structured artifact: the tool trajectory, per-turn reasoning, before/after schedule hashes, the CPM summary, verification signals, and cost. Whether the schedule actually changed is derived deterministically by hashing the graph, independent of what the agent claims. The same environment is ready for on-policy RL training, a future extension as it coalesces.

What comes next

Currently working with a select group of project-controls professionals to accumulate additional traces, building toward a corpus large enough to evaluate models on real controls work.