From MMLU to GLUE, the AI world suffers no dearth of LLM benchmarks. These important tools are designed to rigorously evaluate AI models like GPT-4 and Claude to determine which one generates more accurate outputs for a given task. Typically, that task revolves around something rather specific, like solving grade-school math problems, or coding in Python. While these kinds of tests yield valuable performance metrics used to rank LLMs, they’re not particularly illuminating for business users who simply need to understand whether an AI tool can handle real-world, day-to-day work.
At Salesforce AI Research, we recognized this shortfall as a serious obstacle for business users navigating their adoption of enterprise AI. To bridge this critical gap, we worked in collaboration with the AI Frontier team led by Clara Shih to develop the world’s first LLM benchmark purpose-built for generative AI applications in CRM. Simply put, this benchmark represents a first
Read the full article on Salesforce.org blog.
Leave a Reply