How benchmark.ai works

A precise account of the data, scoring rubric, and inference process behind every report — and an honest map of where the method is strong and where it is not.

Data sources and what they measure

Live web search (Tavily)

Tavily is a search API built for language models; it retrieves and ranks recent web content on demand. At report time benchmark.ai queries it per company and collects up to 20 signals — news, earnings commentary, official announcements, and analyst coverage — from the trailing 12 months. It supplies the live, company-specific evidence that the static sector benchmarks cannot.

Limitation: Coverage skews toward English-language, digitally visible organisations, and retrieval reflects how a company is discussed publicly rather than what it does internally.

IMD Future Readiness Indicator (2026)

Published annually by IMD Business School, this index scores 67 economies across four pillars — technology, future readiness, connectivity, and knowledge — derived from enterprise survey data and public investment metrics. benchmark.ai reads its sector-level readiness aggregates. Selected for its rigorous multi-pillar construction and consistent year-on-year methodology.

Limitation: Scores reflect national sector averages and cannot be disaggregated to individual company performance.

Stanford HAI AI Index (2025)

Stanford’s Institute for Human-Centered AI publishes this annual report synthesising AI investment, model performance, adoption, and labour-demand data from academic, governmental, and commercial datasets. benchmark.ai uses its sector-level investment and adoption indicators. Selected for the breadth and independence of its underlying data.

Limitation: It measures aggregate ecosystem and sector trends, not the maturity of any individual firm.

McKinsey State of AI (2025)

McKinsey’s annual global executive survey reports AI and generative-AI adoption rates and self-assessed value capture by function and industry. It contributes a practitioner view of where AI is deployed and where it is reported to create value. Selected for its large recurring sample and functional granularity.

Limitation: Responses are self-reported and subject to selection and response bias; results describe industries, not named companies.

OECD AI and Work (2024)

The OECD’s work on AI and labour markets analyses adoption, automation exposure, and workforce impact across member economies using official statistics and employer surveys. It anchors the workforce-transformation dimension at sector level. Selected for its methodological independence and labour-market focus.

Limitation: Updated less frequently than the other sources, and reports occupational and sectoral exposure rather than firm-level workforce change.

Scoring dimensions and behavioural anchors

Each company is scored 0–10 on three dimensions. The anchors below define what distinguishes one band from the next, so a 6 and a 7 reflect materially different evidence rather than impression.

AI Integration Depth

0–2No publicly disclosed AI initiatives beyond general technology investment.
3–4Announced AI pilots or proof-of-concept deployments in isolated functions.
5–6AI deployed across multiple business units with documented operational integration.
7–8AI embedded in core product or service delivery with measurable process change.
9–10AI is a stated strategic differentiator with documented competitive advantage and ongoing investment.

Quantified Business Impact

0–2No quantified outcomes disclosed; AI value described only in aspirational terms.
3–4Qualitative benefit claims (efficiency, experience) without supporting figures.
5–6Partial metrics reported for specific use cases or functions.
7–8Material gains attributed to AI in defined areas, reported in earnings or audited disclosures.
9–10Sustained enterprise-level revenue, margin, or cost impact explicitly attributed to AI across periods.

Workforce Transformation

0–2No disclosed change to roles, skills, or operating model arising from AI.
3–4Isolated reskilling or hiring signals, or stated intent without evidence of execution.
5–6Structured reskilling programmes or process redesign underway in some functions.
7–8Documented role redesign and operating-model change across multiple functions.
9–10Enterprise-wide workforce restructuring around AI with sustained investment and governance.

From signals to scores

For each company the model receives up to 20 retrieved web signals alongside the relevant sector benchmark figures. It assigns a 0–10 score on each dimension by matching the available evidence to the behavioural anchors above, and is required to cite the specific signals that justify each score.

Scores reflect the weight and specificity of evidence, not its volume alone: a company with three corroborating earnings-call references scores higher than one supported by a single press release. The model is instructed to score conservatively where evidence is thin — defaulting to 3 or below when a dimension is unsupported — rather than inflate sparse signals. These scores are not validated against any ground truth; they are a structured, reproducible reading of public information, not a measurement.

Reading the strategic positioning chart

Each company is plotted on two fixed axes — AI Integration Depth (horizontal) and Quantified Business Impact (vertical) — with dot size encoding the third dimension, Workforce Transformation. The quadrants are diagnostic rather than evaluative: a position describes a posture, not a verdict.

Leaders

Deep integration converting into measurable business impact.

Optimisers

Measurable impact from focused, less extensive deployment.

Experimenters

Broad AI activity not yet translated into measurable returns.

Laggards

Limited integration and little quantified impact to date.

Evidence weighting and confidence thresholds

Every company carries a confidence rating that signals how much weight its scores should bear.

HighThree or more independent, recent (within 12 months) sources containing specific quantitative or operational data.
MediumOne or two sources, or sources that are indirect, dated, or qualitative rather than quantitative.
LowNo company-specific sources found — score inferred from sector benchmark positioning only.

Limitations and interpretive caveats

  1. 1

    Benchmark granularity: All four institutional benchmarks publish sector-level data only; individual company scores are not available from them. Company positioning within a sector is inferred from live signal analysis, not directly measured.

  2. 2

    Evidence asymmetry: Listed companies with active investor-relations programmes generate more signals than private firms or those with limited English-language digital presence, creating a systematic bias toward larger, more communicative organisations.

  3. 3

    Temporal inconsistency: Live signals reflect the past 12 months while the benchmark sources span 2024 to 2026, so cross-source comparisons should account for this gap.

  4. 4

    LLM non-determinism: Model outputs are probabilistic: two reports on identical inputs may differ by ±1–2 points. Reports should not be used to draw fine-grained distinctions between companies with similar scores.

  5. 5

    Signal-volume constraint: Each report draws on at most 20 web signals per company — sufficient for orientation and hypothesis generation, but not a comprehensive audit.

  6. 6

    Self-reported data: Many signals originate from press releases, earnings calls, and newsrooms, which carry promotional bias. The model weights third-party and analyst sources more heavily but cannot fully eliminate it.

Reports are generated by Claude Sonnet 4.6 (Anthropic, 2025), a large language model with a knowledge cutoff of early 2025. Web signals are retrieved in real time via the Tavily Search API. Sector benchmark data is sourced from static datasets updated annually. benchmark.ai v1. The authors recommend treating outputs as structured research orientation rather than definitive competitive intelligence. All findings should be independently verified before informing strategic decisions.