elidukes.xyz

[ note ] // capability indexing

Capability to eval feed

A work-in-progress baseline: how eval-based workflow assessment, joined to company-specific data, can underwrite M&A, durability, and theoretical opex reduction.

01 Decompose

Split a firm into operating segments and score each across two axes: cognitive (judgment, planning, R&D) and physical (robotics, fulfillment, field ops).

axescog · phys
unitsegment
02 Assess

Map each workflow to public eval evidence (real frontier benchmark scores) and separate genuine coverage from proxy.

publiceval rows
lineCT 0.70
03 Anticipate

Layer in human-data vendor priorities (where the labs are buying training data) as a forward signal of where capability is about to move.

signalvendor demand
horizonforward

This is a WIP baseline · Posted May 2026

What it is

This is an early, deliberately conservative baseline for one question: can eval-based workflow assessment, joined to company-specific data, be turned into an underwriting input? The use cases I care about are pricing M&A, judging the durability of a business as AI re-prices its work, and estimating theoretical opex reduction segment by segment. The capability index (VCI) decomposes a public company into its operating segments and scores each across cognitive and physical axes, so the analysis is per-segment, not a single company-level guess.

Two inputs to the same estimate

The workflow-to-capability index is only one part of the equation. It answers "given today's frontier models, how much of this segment's work is actually within reach, on the evidence?" The second part is forward-looking: human-data vendor priorities (what the labs are paying to collect), which is a leading indicator of where model progress is heading next. Together they give you a snapshot and a trajectory; underwriting needs both.

From job to evidence

The public-facing front of this is a search box. You type a job (underwriting renewal, nurse triage, production-bug triage, drawing review) and the query resolves to an operating surface (sector × workflow family × capability domain). The result answers three things: which public benchmarks are relevant, how closely they match the real workflow, and whether there's active recruiting signal for human data in the same area. Benchmark rows carry real frontier scores (SWE-bench, LiveCodeBench, tau-bench, CUAD, AEC-Bench and the like) normalized 0–100, with a default clearance threshold of 0.70 separating "covered" from "still a gap."

What counts as a signal

Active means a human-data vendor appears to be recruiting for a role that matches the surface. It is not proof that any specific lab is training on that task. It's a demand signal, and it belongs next to eval coverage, not above it. Domains are tiered: the high-pressure ones (AEC, energy, aerospace, biopharma, medical) weight heaviest, because that's where the substrate is being bought.

The honest part

It's a baseline, and it shows its seams on purpose. The construction deep-dive makes the point: of ~246 workflows, roughly half have no real eval coverage and only ~32% clear the threshold. So the index publishes its components separately (ceiling, captured, residual, coverage) rather than collapsing them into one confident number that hides the uncertainty. The durable version needs citations on every row, source freshness, and a clean line between public benchmark scores and private workflow traces before any of this should anchor a real underwrite.

← back to projects