Evluting Model Performnce Evlute AI Evaluation critera for models in use 68% Reliability 67% Performance 62% Security 54% Safety 6% N/A As foundation models grow in Despite this focus on evalua- ganizations are moving towards capability and impact, compre- tion, developing robust evalua- comprehensive private test suites hensive model evaluation has tion frameworks is an evolving that probe model behavior across become paramount whether you challenge. Models must be assessed diverse domains and capabilities. are building or applying models. holistically, accounting for perfor- Universally agreed upon 3rd party In contrast to common headlines, mance on real-world use cases as benchmarks are crucial for objec- assessing foundation models is not well as potential risks. Traditional tively evaluating and comparing just about safety. In fact, perfor- academic benchmarks are generally the performance of large language mance, reliability, and security were not representative of production models. Researchers, develop- indicated as the top three reasons scenarios, and models have been ers, and users can select models survey respondents evaluate overfitted to these existing bench- based on standardized transparent models - with safety ranking as a marks due to their presence in metrics. lower priority. the public domain. Leading or- 36 37
