These innovations are fueling AI breakthroughs that tackle today’s most pressing challenges for CIOs and IT leaders.
From improving data quality to setting new standards in measuring agentic performance, Salesforce AI Research is giving businesses the trust and tools they need to evolve into agentic enterprises – organizations that embrace digital labor and use AI to work alongside humans.
Simulating enterprise environments with CRMArena-Pro
Pilots don't learn to fly in a storm; they train in flight simulators that push them to prepare for the most extreme challenges.
Similarly, surgeons test their skills in high-risk procedures on synthetic models and cadavers before ever operating on a human, and athletes perfect their plays in drills and scrimmages ahead of a big game.
In every high-stakes field, skills and consistencies are honed not through live action but through deliberate preparation in a space where failure is a learning tool, not a costly mistake.
AI agents also benefit from simulation testing and training, preparing them to handle the unpredictability of daily business scenarios before they’re deployed.
Building on the original CRMArena, which focused on single-turn B2C service tasks, Salesforce AI Research launched CRMArena-Pro, which tests agent performance in complex, multi-turn, multi-agent scenarios, such as sales forecasting, service case triage, and CPQ processes.
CRMArena-Pro creates a rigorous, context-rich simulated enterprise environment framework with synthetic data, where it can safely evaluate API calls to relevant systems, as well as the ability to safeguard PII data.
Here, businesses can test an agent’s accuracy, efficiency, and consistency at scale across enterprise-specific use cases.
Acting much like a digital twin of a business, these environments go beyond simple test beds, capturing the full complexity of enterprise operations.
Salesforce AI Research is advancing AI agent training with these simulations, enabling businesses to test agents in scenarios such as customer service escalations or supply chain disruptions before the agents go live.
By incorporating real-world “noise” into the test environment, enterprises can better evaluate performance, strengthen resilience against edge cases, and bridge the gap between training and live operations.
The result is AI agents that are capable, consistent, trustworthy, and agentic enterprise-ready.
Measuring agent readiness with the agentic benchmark for CRM
With new AI models and updates emerging daily, enterprises face a growing dilemma of which model — or combination of models — is best suited to help power agents in real-world business settings.
The answer can’t come from hype cycles or raw size alone; it requires a rigorous way to measure how agents perform within specific business workflows.
This imperative led Salesforce to introduce the new Agentic Benchmark for CRM, the first benchmarking tool designed to evaluate AI agents not on generic capabilities, but in the contexts that matter most to businesses, including customer service, field service, marketing, and sales.
The benchmark measures agents across five essential enterprise metrics — accuracy, cost, speed, trust and safety, and sustainability — which together build a comprehensive, data-driven assessment of their readiness for real-world deployment.
Sustainability, the newest metric in the agentic measurement tool, is especially important to track.
This measure highlights the relative environmental impact of AI systems, which can demand significant computational resources.
Businesses can minimize their environmental footprint and determine their AI sustainability — while achieving the performance they need — by aligning model size with the specific level of intelligence required to complete an enterprise-specific task.
With new models emerging nearly every week, businesses are overwhelmed by which to implement and use to help power their AI agents.
This benchmark helps them pair the right models with the right agents for reliable, enterprise-grade performance.
MCP-Eval and MCP-Universe are two additional, complementary benchmarks published by Salesforce AI Research this quarter.
They are designed to measure agents at different levels of rigor and track LLMs as they interact with MCP servers in the real-world use case environments.
● MCP-Eval provides scalable, automatic evaluation through synthetic tasks, making it well-suited for testing across a wide range of MCP servers.
● MCP-Universe introduces challenging real-world tasks with execution-based evaluators that stress-test agents in complex scenarios and offers an extendable framework for building and evaluating agents.
Together, they form a powerful toolkit: MCP-Eval for broad, initial assessments and MCP-Universe for deeper diagnosis and debugging.
This dual approach is critical for enterprises. Salesforce’s research found most state-of-the-art LLMs on the market today face key limitations to enterprise-grade performance — from long-context challenges (where models lose track of information in complex inputs) to unknown-tool challenges (where they fail to adapt seamlessly to unfamiliar systems).
By leveraging MCP-Universe and MCP-Eval, enterprises can understand where agents break down and refine their frameworks or tool integrations accordingly.
And with a platform that layers in context, enhanced reasoning, and trust guardrails, organizations can move beyond DIY experimentation to deliver agents ready for real-world business impact.
Consolidating data with Account Matching
High-quality, unified data is at the heart of reliable, scalable AI agent performance.
It enables context-aware, accurate, and compliant decision-making.
Unified data allows agents to understand context, follow business rules, and make decisions that align with organizational goals.
But enterprise data is rarely clean or well-organized — a perennial challenge for businesses.
Customer records are often duplicated across departments, fields are incomplete, and inconsistent formatting and naming conventions make it difficult to reconcile data across systems.
To tackle this, the Salesforce AI Research and product teams partnered to fine-tune large and small language models and power Account Matching, a capability that autonomously identifies and unifies accounts across scattered, inconsistent datasets. Instead of treating “The Example Company, Inc.” and “Example Co.” as separate entities, the system can now use AI to consolidate them into a single, authoritative record.
Unlike static, rule-based systems that require heavy manual setup, Account Matching accurately reconciles millions of records.
These are the kinds of breakthroughs driving real ROI for customers today.
In just the first month, one customer’s proprietary tool that utilizes Account Matching unified more than a million accounts with a 95% match success rate, reducing average handling time by 30 minutes.
The tool automatically matched details like account names, websites, addresses, or phone numbers across business units, surfaced a workflow in each org for sellers to connect, and routed only the top 5% of complex cases to humans.
By helping sellers quickly find counterparts covering the same or similar accounts, the solution helped eliminate duplicative work, accelerate sales cycles, and prevent missed opportunities.
Best of all, the entire solution was implemented without the need for hard-coding, lowering costs and dramatically improving efficiency.
With Account Matching, businesses have access to clean, unified data that powers AI agents with confidence, enabling smarter automation, richer personalization, and faster decisions at scale.
Learn more:
● Learn about Salesforce’s latest research advancements at SalesforceAIResearch.com
● See how the MCP-Universe tracks LLM performance of real-world tasks to evaluate agents and support developers
● Dive deeper into how CRMArena-Pro is setting the enterprise-standard way to evaluate and improve agent behavior
● Read how Account Matching can autonomously reconcile millions of scattered records into a single, unified profile
● Learn more about how enterprise AI agents trained in comprehensive simulation enterprise environments will demonstrate capabilities that exceed traditional approaches.