Woolworths trumpeted eight consecutive quarters of price declines. Here’s why that claim doesn’t pass the pub test

2026年2月11日 · 徐丽 · 来源：tutorial热线

Abstract:Large language model (LLM)-powered agents have demonstrated strong capabilities in automating software engineering tasks such as static bug fixing, as evidenced by benchmarks like SWE-bench. However, in the real world, the development of mature software is typically predicated on complex requirement changes and long-term feature iterations -- a process that static, one-shot repair paradigms fail to capture. To bridge this gap, we propose \textbf{SWE-CI}, the first repository-level benchmark built upon the Continuous Integration loop, aiming to shift the evaluation paradigm for code generation from static, short-term \textit{functional correctness} toward dynamic, long-term \textit{maintainability}. The benchmark comprises 100 tasks, each corresponding on average to an evolution history spanning 233 days and 71 consecutive commits in a real-world code repository. SWE-CI requires agents to systematically resolve these tasks through dozens of rounds of analysis and coding iterations. SWE-CI provides valuable insights into how well agents can sustain code quality throughout long-term evolution.

再宏大的叙事，也经不起直营店内里一句轻描淡写的叹息：“车不错，但隔壁今天订车送终身保养。”

该堵还是疏丨未成年人网瘾四问

Трамп объявил о запуске первого за полсотни лет НПЗ в США08:51。WhatsApp Web 網頁版登入对此有专业解读

Now one of the world’s leading producers of interceptors, Ukraine is offering that expertise to the United States and its Gulf partners for the war in the Middle East, hoping to receive in return the high-end weaponry it can’t manufacture at home.

本版责编。手游对此有专业解读

ВсеПолитикаОбществоПроисшествияКонфликтыПреступность，更多细节参见whatsapp

San Francisco, CA