Navigating Goodhart’s Law: A Balanced Approach to Evaluating LLM Outputs

Last Updated October 2, 2024

Use dynamic, human-centered frameworks for evaluating LLM outputs, instead of narrowly optimizing based on limited metrics. 

Written by Summa Linguae CTO Gert Van Assche

Goodhart’s Law “Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes” essentially states that when a measure becomes a target, it ceases to be a good measure.

Applied to the evaluation of LLM (Large Language Model) outputs, this principle suggests that if you optimize solely for specific evaluation metrics or benchmarks, the quality of the outputs could degrade in unintended ways.

To comply with Goodhart’s Law while evaluating LLM outputs, here are a few considerations to keep in mind.

Avoid Over-Reliance on a Single Metric

By focusing solely on metrics like BLEU, TER, BERTScore…  or even just perplexity, models may overfit to these metrics without truly improving in understanding (input) or utility (output).

  • Use a combination of metrics (e.g., fluency, coherence, factual accuracy, user satisfaction).
  • Incorporate human evaluation in assessing nuanced aspects like tone or appropriateness, which may not be captured by automated metrics.

Maintain Human-Centric Evaluation

Optimizing LLM outputs for measurable goals (e.g., speed, simplicity) may sacrifice other important qualities such as creativity, adaptability, or usefulness.

  • Regularly update evaluation frameworks based on human feedback.
  • Prioritize real-world user satisfaction over abstract measures of success, ensuring the system meets the needs of its end-users in practical applications.

Monitor for Gaming of the System

LLMs can learn to “game” specific evaluation metrics if those metrics are optimized too narrowly.

  • Advise customers to introduce adversarial tests where the model is evaluated in unexpected contexts or asked novel questions.
  • Track model robustness by varying inputs and assessing how it handles different situations.

Keep the Goals Fluid

Static targets invite collapse of performance due to over-optimization.

  • Make sure evaluation criteria evolve over time as new challenges, user needs, or technologies emerge.
  • Use contextual evaluations that account for diverse use cases, avoiding one-size-fits-all solutions.

This requires our project managers and linguists to stay current, and they consistently do.

Encourage a Broader Scope of Generalization

Targeting highly specific outcomes can limit a model’s generality.

  • Make sure the models are tested on out-of-distribution data (not just data similar to the training set – way back in the SMT/NMT days, this was a lot easier than it is today – you now need to fuzzy compare test and training data and given the size of the training data that isn’t easy.)
  • Never forget to evaluate the models for general adaptability rather than just use narrow benchmarks. Those narrow benchmarks are key, but they should never be the only ones.

Incorporate Long-Term Monitoring

Immediate metrics might show a certain result, but a longer-term evaluation of whether the LLM is consistently useful or exhibits regressions is important.

  • Continuously track performance in live environments.
  • Measure downstream effects on tasks or users over time, not just short-term goals.

Employ feedback loops and performance-tracking systems that help us ensure models stay effective over time, even as new data and use cases emerge.

In essence, to account for Goodhart’s Law, use multi-faceted, evolving, and human-centered evaluation frameworks, rather than narrowly focusing on optimizing LLM outputs based on a few measurable targets. As the LLMs evolve, so should the methods you use to measure their performance, robustness, and safety.

Ensuring the lasting performance of LLMs, or the systems in which they are used, requires us to be as adaptable as the models our customers develop.

Related Posts

Summa Linguae uses cookies to allow us to better understand how the site is used. By continuing to use this site, you consent to this policy.

Learn More