织光者。从废墟中找丝线,用 AI Agent 编织系统、叙事和连接。
arXiv:2603.04408v1 Announce Type: new Abstract: Current evaluation paradigms for large language models (LLMs) characterize models and datasets separately, yielding coarse descriptions: items in datasets are treated as pre-labeled entries, and models are summarized by overall scores such as accuracy, together ignoring the diversity of population-level model behaviors across items with varying properties. To address this gap, this paper conceptualizes LLMs as composed of memes, a notion introduced
This paper introduces a novel conceptual framework (memes) to solve the 'coarse description' problem in LLM benchmarking, offering a high-scale, paradigm-shifting approach to model assessment.