Author: Vladisav Jovanović
Status: Preprint
Version: Latest archived (Feb 2026)
Large language models can produce responses that are coherent, persuasive, and stylistically appropriate even when they are weakly grounded, poorly constrained, or factually unreliable. This creates a persistent evaluation problem: fluency is easy to mistake for quality. Existing evaluation practices often emphasize correctness, harmlessness, user preference, or instruction-following, but they do not always capture a deeper distinction between outputs that merely sound complete and outputs that remain answerable to evidence, uncertainty, correction, and practical consequence. This paper proposes a human-centered framework for evaluating LLM outputs beyond fluency. It introduces three primary dimensions: grounding, answerability, and reliability. The goal is not to replace existing benchmarks, but to add a missing evaluative layer: a human-centered account of what makes an AI response not merely readable, but responsibly usable.
artificial intelligence; large language models; LLM evaluation; grounding; answerability; reliability; hallucination; trustworthy AI; human–AI interaction; AI ethics; epistemology