Compute Costs logo

ComputeCosts.nl

arXiv-based soft sensor.

How it works

The arXiv archive functions as a soft sensor for how compute infrastructure is described in scientific writing.

Metadata and abstracts of recently published arXiv papers are collected deterministically and treated as indirect measurements of what researchers explicitly mention when summarising their computational work.

The archive is not intended to measure deployment adoption. It measures vocabulary signals in titles and abstracts, including references to cloud computing, self hosted execution, local inference, operational constraints, and workstation grade hardware such as consumer GPUs. In this sense the dataset acts as a sensor for infrastructure language within the research ecosystem.

The archive is accompanied by a complete local PDF collection for the full dataset. The present baseline focuses on abstract level signals, but depending on the results those PDFs can be used for full paper analysis, context reconstruction and evidence snippets, enabling more detailed conclusions about the circumstances under which local and cloud computing are performed.

Epoch 1 observations

Observations are collected deterministically, deduplicated into canonical records and stored in an append-only dataset. Dataset state hashes are anchored on the Algorand blockchain through the COSTS asset so that the archive can be independently verified. The public anchor can be inspected at https://explorer.perawallet.app/asset/3455286633/

The first observation window, epoch 1, spans 1 January 2025 00:00 UTC to 19 February 2026 and contains 6690 unique papers. The analysis collapses the entire period into a robust baseline snapshot rather than attempting to interpret short term variation.

Abstracts already contain measurable operational signals even though detailed deployment narratives are not typically expected in scientific papers. Mentions of self hosted and local execution vocabulary indicate that local compute set ups are explicitly used in a measurable subset of the dataset. Hardware references to consumer GPUs, specifically NVIDIA RTX class devices including RTX 4090, demonstrate that workstation grade hardware is used for scientific computing in published research.

One dominant signal in this dataset is privacy related terminology. In this environment privacy language frequently refers to algorithmic frameworks such as differential privacy and privacy preserving training, rather than infrastructure level concerns. This makes the signal informative but ambiguous, and it is therefore a strong candidate for deeper inspection using full paper PDFs to distinguish meanings, find concrete operational contexts and extract evidence level details.

These abstract level signals are sufficiently promising that a deeper analysis of the full PDFs is justified. If practices appear in an abstract, they are usually not marginal implementation details. Full text analysis can therefore be expected to recover additional operational detail about compute environments, limitations, constraints and deployment choices that are not visible at abstract level.

Reporting model

To keep the analysis reproducible the project publishes reports in discrete epochs. Each epoch freezes the dataset at a specific moment and produces a deterministic analysis snapshot anchored on chain.

Epoch 1 is only the starting point. The initial signals from the arXiv archive are strong enough to justify a rapid sequence of follow up analyses on the same epoch 1 dataset, including deeper inspection of the full PDF collection and additional infrastructure measurements.

In parallel the agents continue mining new material, meaning later epochs will appear naturally over time and will form a year over year record of how compute infrastructure language evolves.

Epoch 1 report

The full methodology, dataset description and analysis code are documented in the epoch 1 report.

Download the report: computecosts_observatory_report_002.pdf