How benchmarkai works

A precise account of the data, scoring rubric, and inference process behind every report — and an honest map of where the method is strong and where it is not.

Data sources and what they measure

Live web search (Tavily)

Tavily is a search API built for language models; it retrieves and ranks recent web content on demand. At report time benchmarkai queries it per company and collects up to 20 signals — news, earnings commentary, official announcements, and analyst coverage — from the trailing 12 months. It supplies the live, company-specific evidence that the static sector benchmarks cannot.

Limitation: Coverage skews toward English-language, digitally visible organisations, and retrieval reflects how a company is discussed publicly rather than what it does internally.

IMD Future Readiness Indicator (2026)

Published annually by IMD Business School, this index scores 67 economies across four pillars — technology, future readiness, connectivity, and knowledge — derived from enterprise survey data and public investment metrics. benchmarkai reads its sector-level readiness aggregates. Selected for its rigorous multi-pillar construction and consistent year-on-year methodology.

Limitation: Scores reflect national sector averages and cannot be disaggregated to individual company performance.

Stanford HAI AI Index (2025)

Stanford’s Institute for Human-Centered AI publishes this annual report synthesising AI investment, model performance, adoption, and labour-demand data from academic, governmental, and commercial datasets. benchmarkai uses its sector-level investment and adoption indicators. Selected for the breadth and independence of its underlying data.

Limitation: It measures aggregate ecosystem and sector trends, not the maturity of any individual firm.

McKinsey State of AI (2025)

McKinsey’s annual global executive survey reports AI and generative-AI adoption rates and self-assessed value capture by function and industry. It contributes a practitioner view of where AI is deployed and where it is reported to create value. Selected for its large recurring sample and functional granularity.

Limitation: Responses are self-reported and subject to selection and response bias; results describe industries, not named companies.

OECD AI and Work (2024)

The OECD’s work on AI and labour markets analyses adoption, automation exposure, and workforce impact across member economies using official statistics and employer surveys. It anchors the workforce-transformation dimension at sector level. Selected for its methodological independence and labour-market focus.

Limitation: Updated less frequently than the other sources, and reports occupational and sectoral exposure rather than firm-level workforce change.

Scoring dimensions and behavioural anchors

Each company is scored 0–10 on three dimensions. The anchors below define what distinguishes one band from the next, so a 6 and a 7 reflect materially different evidence rather than impression.

AI Integration Depth

0–2No publicly disclosed AI initiatives beyond general technology investment.

3–4Announced AI pilots or proof-of-concept deployments in isolated functions.

5–6AI deployed across multiple business units with documented operational integration.

7–8AI embedded in core product or service delivery with measurable process change.

9–10AI is a stated strategic differentiator with documented competitive advantage and ongoing investment.

Quantified Business Impact

0–2No quantified outcomes disclosed; AI value described only in aspirational terms.

3–4Qualitative benefit claims (efficiency, experience) without supporting figures.

5–6Partial metrics reported for specific use cases or functions.

7–8Material gains attributed to AI in defined areas, reported in earnings or audited disclosures.

9–10Sustained enterprise-level revenue, margin, or cost impact explicitly attributed to AI across periods.

Workforce Transformation

0–2No disclosed change to roles, skills, or operating model arising from AI.

3–4Isolated reskilling or hiring signals, or stated intent without evidence of execution.

5–6Structured reskilling programmes or process redesign underway in some functions.

7–8Documented role redesign and operating-model change across multiple functions.

9–10Enterprise-wide workforce restructuring around AI with sustained investment and governance.

From signals to scores

For each company the model receives up to 20 retrieved web signals alongside the relevant sector benchmark figures. It assigns a 0–10 score on each dimension by matching the available evidence to the behavioural anchors above, and is required to cite the specific signals that justify each score.

Scores reflect the weight and specificity of evidence, not its volume alone: a company with three corroborating earnings-call references scores higher than one supported by a single press release. The model is instructed to score conservatively where evidence is thin — defaulting to 3 or below when a dimension is unsupported — rather than inflate sparse signals. These scores are not validated against any ground truth; they are a structured, reproducible reading of public information, not a measurement.

Reading the AI Strategic Positioning Matrix

Each company is plotted on two fixed 0–10 axes — AI Integration Depth runs horizontally (left: Experimental → right: Embedded) and Quantified Business Impact runs vertically (bottom: Anecdotal → top: Measurable). Dot size encodes the third dimension, Workforce Transformation: small marks a score of 0–3, medium 4–6, and large 7–10. The quadrants are diagnostic rather than evaluative: a position describes a posture, not a verdict.

The grid below mirrors the matrix: the top row carries higher business impact, and the right column carries deeper integration.

Optimisers

Top-left: measurable impact from focused, less extensive deployment.

Leaders

Top-right: deep integration converting into measurable business impact.

Laggards

Bottom-left: limited integration and little quantified impact to date.

Experimenters

Bottom-right: broad AI activity not yet translated into measurable returns.

Evidence weighting and confidence thresholds

Every company carries a confidence rating that signals how much weight its scores should bear.

HighThree or more independent, recent (within 12 months) sources containing specific quantitative or operational data.

MediumOne or two sources, or sources that are indirect, dated, or qualitative rather than quantitative.

LowNo company-specific sources found — score inferred from sector benchmark positioning only.

Reading the Signal Strength bars

The Signal Strength bars at the top of every report show, per company, how much independent public evidence the analysis found — a measure of how well-covered a company is, not of its AI maturity. A longer, greener bar means the report's read on that company rests on more, and more credible, sources; a short red bar means the company is thinly covered, so its scores should be treated as tentative.

Each bar's length is the number of distinct sources drawn on for that company, scaled relative to the best-covered company in the same report — the company with the most sources fills the bar and the others are shown in proportion, with the raw source count printed alongside it. Its colour is that company's overall confidence rating (green High, amber Medium, red Low), assigned under the thresholds above, which weighs not just the number of sources but their recency, independence, and quality. Length and colour can therefore diverge — a company supported by several weak or dated sources can still show a long amber or red bar. Because the scale is relative to the strongest-covered peer, Signal Strength compares evidence coverage within a single report and is not an absolute count across reports.

Limitations and interpretive caveats

1
Benchmark granularity: All four institutional benchmarks publish sector-level data only; individual company scores are not available from them. Company positioning within a sector is inferred from live signal analysis, not directly measured.
2
Evidence asymmetry: Listed companies with active investor-relations programmes generate more signals than private firms or those with limited English-language digital presence, creating a systematic bias toward larger, more communicative organisations.
3
Temporal inconsistency: Live signals reflect the past 12 months while the benchmark sources span 2024 to 2026, so cross-source comparisons should account for this gap.
4
LLM non-determinism: Model outputs are probabilistic: two reports on identical inputs may differ by ±1–2 points. Reports should not be used to draw fine-grained distinctions between companies with similar scores.
5
Signal-volume constraint: Each report draws on at most 20 web signals per company — sufficient for orientation and hypothesis generation, but not a comprehensive audit.
6
Self-reported data: Many signals originate from press releases, earnings calls, and newsrooms, which carry promotional bias. The model weights third-party and analyst sources more heavily but cannot fully eliminate it.

Reports are generated by Claude Sonnet 4.6 (Anthropic, 2026), a large language model with a knowledge cutoff of early 2026. Web signals are retrieved in real time via the Tavily Search API. Sector benchmark data is sourced from static datasets updated annually. benchmarkai v1. The authors recommend treating outputs as structured research orientation rather than definitive competitive intelligence. All findings should be independently verified before informing strategic decisions.