Galileo Launches Agentic Evaluations to Empower Developers to Build Reliable AI Agents

News provided by

Jan 23, 2025, 08:00 ET

The comprehensive solution gives developers the tools and insights needed to optimize agent performance and ensure readiness for real-world deployment

SAN FRANCISCO, Jan. 23, 2025 /PRNewswire/ -- Galileo, the leading AI Evaluation Platform, today unveiled Agentic Evaluations, a transformative solution for evaluating the performance of AI agents powered by large language models (LLMs). With Agentic Evaluations, developers gain the tools and insights needed to optimize agent performance and reliability at every step—ensuring readiness for real-world deployment.

"AI agents are unlocking a new era of innovation, but their complexity has made it difficult for developers to understand where failures occur and why," said Vikram Chatterji, CEO and co-founder of Galileo. "With LLMs driving decision-making, teams need tools to pinpoint and understand an agent's failure modes. Agentic Evaluations delivers unprecedented visibility into every action, across entire workflows, empowering developers to build, ship, and scale reliable, trustworthy AI solutions."

The Age of AI Agents

AI agents—autonomous systems that use LLM-driven planning to perform a wide range of tasks — are reshaping industries by automating complex, multi-step workflows. They are rapidly gaining traction for their ability to drive material ROI across sectors such as customer service, education, and telecommunications. A recent study shows that nearly half of companies have adopted AI agents with another 33% actively exploring solutions. Companies like Twilio, ServiceTitan, and Chegg are leveraging agents to create dynamic, multi-step interactions that drive measurable value.

However, building and evaluating agents introduces novel challenges for developers, which existing evaluation tools fail to address:

Non-deterministic paths: LLM planners can choose more than one sequence of action to respond to a user request, complexities beyond traditional LLM-as-a-Judge frameworks.
Increased failure points: Complex workflows require visibility across multi-step and parallel processes, with holistic evaluation of entire sessions.
Cost management: With agents relying on multiple calls to different LLMs, balancing performance with cost efficiency is a critical priority.

As agents take on more complex and impactful workflows, the stakes—and the potential impact of errors—grow significantly.

The Solution: Galileo's Agentic Evaluations

Galileo's Agentic Evaluations offers an end-to-end framework that offers both system-level and step-by-step evaluation, enabling developers to build reliable, resilient, and high-performing AI agents.

Key capabilities include:

Complete Visibility into Agent Workflows: Gain a clear view of entire multi-step agent completions, from input to final action, with comprehensive tracing and simple visualizations that help developers quickly pinpoint inefficiencies and errors in agent sessions.
Agent-Specific Metrics: Measure agent performance at every level with proprietary, research-backed metrics built to evaluate agents at multiple levels.
- LLM Planner: Assess tool selection quality and passing on the right instructions.
- Tool Calls: Assess errors in individual tool completions.
- Overall session success: Measure overall task completion and successful agentic interactions.
Granular Cost and Latency Tracking: Optimize the cost-effectiveness of agents with aggregate tracking for cost, latency, and errors across sessions and spans.
Seamless Integrations: Support for popular AI frameworks like LangGraph and CrewAI.
Proactive Insights: Alerts and dashboards help developers identify systemic issues and uncover actionable insights for continuous improvement such as failed tool calls or misalignment between the final action and initial instructions.

Accelerating Industry Adoption

With Agentic Evaluations, Galileo's enterprise and startup partners are already seeing transformative results. "Launching AI agents without proper measurement is risky for any organization," said Vijoy Pandey, SVP/GP of Outshift at Cisco. "This important work Galileo has done gives developers the tools to measure agent behavior, optimize performance, and ensure reliable operations - helping teams move to production faster and with more confidence."

"End-to-end visibility into agent completions is a game changer," said Surojit Chatterjee, Co-founder and CEO of Ema. "With agents taking multiple steps and paths, this feature makes debugging and improving them faster and easier. Developers know that AI agents need to be tested and refined over time. Galileo makes that easier and faster with end-to-end visibility and agent-specific evaluation metrics."

Availability

Agentic Evaluations is now available to all Galileo users. Learn more or request a demo at www.galileo.ai.

About Galileo

San Francisco based Galileo is the leading platform for enterprise GenAI evaluation and observability. Powered by Evaluation Foundation Models (EFMs), Galileo's platform supports AI teams across the development lifecycle—from building and iterating to monitoring and protection—with powerful, research-backed metrics. Galileo is used by AI teams from startups to Fortune 500 companies to accelerate AI development. Visit galileo.ai to learn more about the Galileo Evaluation Platform.

SOURCE Galileo