The compute arms race has entered a new phase where infrastructure control determines market position, and the companies that can lock in power, compress hardware deployment timelines, and absorb regulatory complexity are already separating from those that cannot. Meta's commitment to ten natural gas plants powering its Hyperion data center and Cognichip's $60 million raise on claims of 75% cost reduction and 50% timeline compression in chip design are not marginal improvements but structural bets that the next cycle belongs to whoever can compress the cost and time to deploy compute at scale. Poolside's stalled Texas project and forced negotiations with Google and other cloud providers make the calculus explicit: independent infrastructure plays are losing leverage to companies that can absorb the capital requirements and regulatory complexity of massive data center deployments. The economics have shifted decisively.
This consolidation around compute is matched by a bifurcation in how companies are deploying AI. OpenAI's Gradient Labs is shipping concrete product, AI account managers in banking using GPT-4.1 and purpose-built smaller models for latency-sensitive workflows, the kind of deployment that generates revenue and defensible customer relationships. GitHub's /fleet feature for parallel agent dispatch and Hugging Face's Holo3 announcement signal the same momentum: builders are moving past single-agent orchestration into systems that distribute work across multiple models and processes. By contrast, IBM and AMD are playing a different game, announcing FedRAMP authorizations, decade-long research initiatives, and detailed MLPerf submissions that establish them as trusted intermediaries in regulated environments and foundational research. Consumer-facing applications are racing ahead through product velocity and direct customer relationships. Infrastructure players are establishing credibility through partnerships, compliance achievements, and benchmark credentials.
The technical frontier has shifted from model architecture to the systems that deploy and secure them, and that shift exposes unresolved tensions. Anthropic's accidental mass takedown of GitHub repositories containing leaked Claude Code source, followed by the discovery that Claude Code can uncover zero-day exploits in Vim and GNU Emacs in seconds, reveals that the same tool making security research trivial also makes security risk trivial to create. Anthropic's response was operational damage control, not a technical solution. Meanwhile, Slack's repositioning of Slackbot as an orchestration layer for agentic workflows, Asana's emphasis on multiplayer AI agents, and Meta's semi-formal reasoning technique for code review achieving 93% accuracy show enterprises moving past chatbots toward systems that make decisions and execute tasks across multiple applications and teams. What remains unresolved is whether the tools that make this possible can be secured, and whether the companies deploying them have any meaningful way to audit what those agents are actually doing once they are turned loose on company infrastructure.
Benchmark volatility and developer momentum toward agentic tools underscore the stakes. Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, a 12.3-point gain from its prior ranking, yet the two major benchmarks diverge materially in their orderings, implying that they capture different failure modes in code generation or apply stricter evaluation criteria around execution correctness. Reasoning-focused variants like Kimi K2.5 are closing gaps on general-purpose models in software engineering tasks. On GitHub, Claude Code and related agentic tooling have moved from research artifact to daily driver, while infrastructure projects like Pixeltable, Rerun, and Qdrant represent a category shift toward unsexy but essential data plumbing for multimodal pipelines. Simultaneously, projects like PicoLLM signal a countercurrent toward running models locally without cloud dependencies. The field is scaling inference infrastructure while constraining it, building both the pipes and the guardrails.
Grant Calloway
Traditional cryptography, rooted in problems, e.g., integer factorisation or discrete log, is inevitably vulnerable to a fully operational quantum computer. Although it remains an engineering frontier, the looming threat extends to encrypted data stored today, which could be decrypted in the future with quantum capabilities. To safeguard against this eventuality, the backbone of the modern quantum-safe cryptography is the Shortest Vector Problem (SVP). We enhance Laarhoven's treatment of Ajtai et al.'s sieving as a genetic algorithm (GA) for the SVP by incorporating domain-informed SVP representation and crossover while naturally extending application to the module lattices.
This paper presents a survey and taxonomy of LLM fingerprinting and watermarking for identity, ownership verification, provenance, and generated-content attribution. Large language models (LLMs) require substantial investments in data, computation, and expertise, and are increasingly deployed in high-stakes settings, making it critical to protect LLM-related assets and trace their origins. Existing work has rapidly expanded across dataset provenance, model ownership, and generated-content detection, but the field remains fragmented: fingerprinting and watermarking are often used inconsistently, and methods are typically studied within isolated asset-specific settings. To address this gap, we introduce implicit identity as a unifying abstraction for verifiable but not directly observable identity signals in LLM systems. We distinguish fingerprinting as non-intrusive identity derived from intrinsic characteristics, and watermarking as intrusive identity deliberately embedded into data, models, or generated content. We then propose a lifecycle-based taxonomy that organises techniques across datasets, models, and generated content, and further separates them by verification semantics: similarity-based attribution and keyed verification. Finally, we establish an evaluation framework centred on identifiability, robustness, and deployability, summarising representative metrics under realistic access and transformation regimes. By unifying terminology, lifecycle stages, and evaluation objectives, this survey provides a structured foundation for studying LLM identity technologies and for developing more reliable mechanisms for asset protection and provenance.
LLM-powered coding agents increasingly participate in software development workflows by generating code, selecting dependencies, and producing package installation commands. This creates a new software supply chain risk: when an agent hallucinates a non-existent package, an attacker may register the hallucinated name and later compromise users who install it. Existing package hallucination attacks and defenses primarily focus on naturally occurring hallucinations, targeted dependency steering, or post-hoc package validation. In this paper, we introduce \emph{Neutral Prompting Attack} (NPA), a highly stealthy attack paradigm in which semantically benign instructions, such as encouraging imagination and exhaustiveness, increase package hallucination propensity without containing explicit malicious intent. Unlike targeted dependency steering, NPA does not specify an attacker-chosen package. Instead, it shifts the model's dependency generation behavior toward more speculative package names. We evaluate NPA across multiple coding-oriented LLMs and package hallucination benchmarks. Our results show that NPA increases both \emph{Hallucination ASR} and \emph{Pip Install ASR}, changes the distribution of hallucinated package names, and evades existing static-analysis, LLM-based, and agent-based Skill defenses. These findings reveal that harmless-looking prompts can covertly manipulate hallucination behavior and create downstream software supply chain risks.
Existing sentence-level watermarking methods enhance robustness to paraphrasing by anchoring watermarks in sentence semantics. However, their prefix-based designs remain vulnerable to structural perturbations, such as sentence splitting and merging, which commonly arise under strong paraphrasers like DIPPER and GPT-3.5. To mitigate this issue, we propose AliMark, a framework that reformulates sentence-level watermarking as a bit sequence encoding and alignment problem between a potentially watermarked text and a secret bit sequence. Notably, our approach adopts a two-stage detection strategy: we generate multiple restructured text variants and adaptively align their extracted bit sequences with the secret bit sequence to minimize alignment cost. This multi-candidate alignment design naturally improves robustness to sentence merges and splits. Extensive experiments demonstrate that AliMark substantially outperforms state-of-the-art baselines under diverse paraphrasing attacks.
Large language models (LLMs) are increasingly used to support scientific work, but it is unclear whether they uphold responsible conduct of research (RCR) norms or help undermine them. We introduce SciIntBench, an adversarial benchmark of 810 prompts across ten RCR categories and three scientific domains. Each scenario appears as an Overt Adversarial, Covert Adversarial, and Benign version, allowing us to jointly measure framing-sensitive refusal of misconduct and helpfulness on legitimate requests. We evaluate 16 commercial and open-weight LLMs from six providers (2024--2026), producing 12,960 responses. We find that scientific integrity alignment is strongly framing-sensitive: models refuse explicit misconduct far more reliably than covert violations, especially failing when misconduct is presented as a pressure-driven shortcut. Refusals vary by RCR category, with weaker boundaries around transparency, plagiarism, and fabrication.
Relay and reseller APIs increasingly intermediate access to large language models (LLMs), but users have no direct way to verify that a claimed endpoint is actually serving the advertised model. We introduce KBF, a low-cost black-box auditing protocol that fingerprints model APIs using stable numerical recall near the knowledge boundary. Across 16 production LLM endpoints, KBF flags all 155 economically relevant substitutions without rejecting any same-model controls, remains stable under deployment variation, detects high-separation mixed-routing attacks when only 5-10% of traffic is substituted, and finds that 7 of 27 platform model cells in a six-platform shadow API audit are statistically inconsistent with their reference endpoints, with inconsistencies concentrated on premium Claude endpoints.
Composite score across coding, math, and reasoning
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.4 | 57.2 | 75 | $5.63 |
| 2 | Gemini 3.1 Pro Preview | 57.2 | 117 | $4.50 |
| 3 | GPT-5.3 Codex | 54 | 67 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 51 | $10.00 |
| 5 | Claude Sonnet 4.6 | 51.7 | 54 | $6.00 |
Agentic coding on real-world software engineering tasks
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | Gemini 3.1 Pro Preview | 62.3% |
Claude Code is an agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster by executing routine tasks, explaining complex code, and handling git workflows - all through natural language commands.
Open-Source Frontier Voice AI
TimesFM (Time Series Foundation Model) is a pretrained time-series foundation model developed by Google Research for time-series forecasting.
A visual, example-driven guide to Claude Code — from basic concepts to advanced agents, with copy-paste templates that bring immediate value.
Promise based HTTP client for the browser and node.js
Data Infrastructure providing a declarative, incremental approach for multimodal AI workloads.
A lightweight, developer-focused database management tool. Supports MySQL, PostgreSQL and SQLite. Hackable with plugins. Built for speed, security, and aesthetics.
Home Assistant LLM integration for local OpenAI-compatible services (llamacpp, vllm, etc)
On-device LLM Inference Powered by X-Bit Quantization
Synthetic Patient Population Simulator