Close Menu
    What's Hot

    Vaping With Style: How to Choose a Setup That Matches Your Routine

    February 1, 2026

    Colmi R12 Smart Ring – The Subsequent-Era Smart Ring Constructed for Efficiency & Precision

    November 21, 2025

    Integrating Holistic Approaches in Finish-of-Life Care

    November 18, 2025
    Facebook X (Twitter) Instagram
    Glam-fairy Accessories
    Facebook X (Twitter) Instagram
    Subscribe
    • Home
      • Get In Touch
    • Featured
    • Missed by You
    • Europe & UK
    • Markets
      • Economy
    • Lifetsyle & Health

      Vaping With Style: How to Choose a Setup That Matches Your Routine

      February 1, 2026

      Integrating Holistic Approaches in Finish-of-Life Care

      November 18, 2025

      2025 Vacation Present Information for tweens

      November 16, 2025

      Lumebox assessment and if it is value it

      November 16, 2025

      11.14 Friday Faves – The Fitnessista

      November 16, 2025
    • More News
    Glam-fairy Accessories
    Home » Terminal-Bench 2.0 launches alongside Harbor, a brand new framework for testing brokers in containers
    Lifestyle Tech

    Terminal-Bench 2.0 launches alongside Harbor, a brand new framework for testing brokers in containers

    Emily TurnerBy Emily TurnerNovember 8, 2025No Comments4 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Telegram Email Copy Link
    Follow Us
    Google News Flipboard
    Terminal-Bench 2.0 launches alongside Harbor, a brand new framework for testing brokers in containers
    Share
    Facebook Twitter LinkedIn Pinterest Email

    Terminal-Bench 2.0 launches alongside Harbor, a brand new framework for testing brokers in containers

    The builders of Terminal-Bench, a benchmark suite for evaluating the efficiency of autonomous AI brokers on real-world terminal-based duties, have launched version 2.0 alongside Harbor, a brand new framework for testing, enhancing and optimizing AI brokers in containerized environments.

    The twin launch goals to handle long-standing ache factors in testing and optimizing AI brokers, significantly these constructed to function autonomously in practical developer environments.

    With a tougher and rigorously verified job set, Terminal-Bench 2.0 replaces model 1.0 as the usual for assessing frontier mannequin capabilities.

    Harbor, the accompanying runtime framework, allows builders and researchers to scale evaluations throughout hundreds of cloud containers and integrates with each open-source and proprietary brokers and coaching pipelines.

    “Harbor is the bundle we want we had had whereas making Terminal-Bench," wrote co-creator Alex Shaw on X. "It’s for agent, mannequin, and benchmark builders and researchers who need to consider and enhance brokers and fashions."

    Larger Bar, Cleaner Information

    Terminal-Bench 1.0 noticed fast adoption after its release in May 2025, turning into a default benchmark for evaluating agent efficiency throughout the sphere of AI-powered brokers working in developer-style terminal environments. These brokers work together with programs by means of the command line, mimicking how builders work behind the scenes of the graphical person interface.

    Nevertheless, its broad scope got here with inconsistencies. A number of duties have been recognized by the group as poorly specified or unstable because of exterior service modifications.

    Model 2.0 addresses these points immediately. The up to date suite consists of 89 duties, every subjected to a number of hours of guide and LLM-assisted validation. The emphasis is on making duties solvable, practical, and clearly specified, elevating the problem ceiling whereas enhancing reliability and reproducibility.

    A notable instance is the download-youtube job, which was eliminated or refactored in 2.0 because of its dependence on unstable third-party APIs.

    “Astute Terminal-Bench followers could discover that SOTA efficiency is corresponding to TB1.0 regardless of our declare that TB2.0 is more durable,” Shaw noted on X. “We consider it’s because job high quality is considerably increased within the new benchmark.”

    Harbor: Unified Rollouts at Scale

    Alongside the benchmark replace, the workforce launched Harbor, a brand new framework for operating and evaluating brokers in cloud-deployed containers.

    Harbor helps large-scale rollout infrastructure, with compatibility for main suppliers like Daytona and Modal.

    Designed to generalize throughout agent architectures, Harbor helps:

    • Analysis of any container-installable agent

    • Scalable supervised fine-tuning (SFT) and reinforcement studying (RL) pipelines

    • Customized benchmark creation and deployment

    • Full integration with Terminal-Bench 2.

    Harbor was used internally to run tens of hundreds of rollouts in the course of the creation of the brand new benchmark. It’s now publicly out there by way of harborframework.com, with documentation for testing and submitting brokers to the general public leaderboard.

    Early Outcomes: GPT-5 Leads in Activity Success

    Preliminary outcomes from the Terminal-Bench 2.0 leaderboard present OpenAI's Codex CLI (command line interface), a GPT-5 powered variant, within the lead, with a 49.6% success charge — the very best amongst all brokers examined thus far.

    Shut behind are different GPT-5 variants and Claude Sonnet 4.5-based brokers.

    High 5 Agent Outcomes (Terminal-Bench 2.0):

    1. Codex CLI (GPT-5) — 49.6%

    2. Codex CLI (GPT-5-Codex) — 44.3%

    3. OpenHands (GPT-5) — 43.8%

    4. Terminus 2 (GPT-5-Codex) — 43.4%

    5. Terminus 2 (Claude Sonnet 4.5) — 42.8%

    The shut clustering amongst prime fashions signifies lively competitors throughout platforms, with no single agent fixing greater than half the duties.

    Submission and Use

    To check or submit an agent, customers set up Harbor and run the benchmark utilizing easy CLI instructions. Submissions to the leaderboard require 5 benchmark runs, and outcomes may be emailed to the builders together with job directories for validation.

    harbor run -d terminal-bench@2.0 -m "<mannequin>" -a "<agent>" –n-attempts 5 –jobs-dir <path/to/output>

    Terminal-Bench 2.0 is already being built-in into analysis workflows targeted on agentic reasoning, code era, and power use. Based on co-creator Mike Merrill, a postdoctoral researcher at Stanford, an in depth preprint is in progress protecting the verification course of and design methodology behind the benchmark.

    Aiming for Standardization

    The mixed launch of Terminal-Bench 2.0 and Harbor marks a step towards extra constant and scalable agent analysis infrastructure. As LLM brokers proliferate in developer and operational environments, the necessity for managed, reproducible testing has grown.

    These instruments provide a possible basis for a unified analysis stack — supporting mannequin enchancment, surroundings simulation, and benchmark standardization throughout the AI ecosystem.

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Emily Turner
    • Website

    Related Posts

    Vaping With Style: How to Choose a Setup That Matches Your Routine

    February 1, 2026

    Colmi R12 Smart Ring – The Subsequent-Era Smart Ring Constructed for Efficiency & Precision

    November 21, 2025

    How Deductive AI saved DoorDash 1,000 engineering hours by automating software program debugging

    November 12, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Economy News

    Vaping With Style: How to Choose a Setup That Matches Your Routine

    By Emily TurnerFebruary 1, 2026

    Vaping isn’t just about “what’s popular” anymore—it’s about what fits your daily life. Some adult…

    Colmi R12 Smart Ring – The Subsequent-Era Smart Ring Constructed for Efficiency & Precision

    November 21, 2025

    Integrating Holistic Approaches in Finish-of-Life Care

    November 18, 2025
    Top Trending

    Vaping With Style: How to Choose a Setup That Matches Your Routine

    By Emily TurnerFebruary 1, 2026

    Vaping isn’t just about “what’s popular” anymore—it’s about what fits your daily…

    Colmi R12 Smart Ring – The Subsequent-Era Smart Ring Constructed for Efficiency & Precision

    By Emily TurnerNovember 21, 2025

    The world of wearable expertise is shifting quick, and smart rings have…

    Integrating Holistic Approaches in Finish-of-Life Care

    By Emily TurnerNovember 18, 2025

    Photograph: RDNE Inventory ventureKey Takeaways- A holistic strategy to end-of-life care addresses…

    Subscribe to News

    Get the latest sports news from NewsSite about world, sports and politics.

    Advertisement
    Demo
    Facebook X (Twitter) Pinterest Vimeo WhatsApp TikTok Instagram

    News

    • World
    • US Politics
    • EU Politics
    • Business
    • Opinions
    • Connections
    • Science

    Company

    • Information
    • Advertising
    • Classified Ads
    • Contact Info
    • Do Not Sell Data
    • GDPR Policy
    • Media Kits

    Services

    • Subscriptions
    • Customer Support
    • Bulk Packages
    • Newsletters
    • Sponsored News
    • Work With Us

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    © 2026. All Rights Reserved Glam-fairy Accessories.
    • Privacy Policy
    • Terms
    • Accessibility

    Type above and press Enter to search. Press Esc to cancel.