观测与评测
评测
运行、评测、调优闭环支撑,助力 Agent 更加智能
VeADK 构建一套完整的自动化评测(Evaluation)流程,其主要功能包括:
- 运行时数据采集:自动捕获 Agent 的运行时数据
- 测试集文件生成:Agent 运行后自动将运行时数据作为测试数据集(Eval Set)导出到本地
- 评测:通过多种评测器(Evaluator)进行评测
- 反馈优化:根据评测结果(如评测得分,原因分析等)自动优化 prompts
运行时数据采集
在 Agent 运行结束后,通过调用 runner.save_eval_set()
将运行时数据构成的评测集文件保存在默认路径。
import asyncio
from veadk import Agent, Runner
from veadk.memory.short_term_memory import ShortTermMemory
from veadk.tools.demo_tools import get_city_weather
agent = Agent(tools=[get_city_weather])
session_id = "session_id"
runner = Runner(agent=agent, short_term_memory=ShortTermMemory())
prompt = "How is the weather like in Beijing? Besides, tell me which tool you invoked."
asyncio.run(runner.run(messages=prompt, session_id=session_id))
# 运行时数据采集
dump_path = asyncio.run(runner.save_eval_set(session_id=session_id))
print(f"Evaluation file path: {dump_path}")
评测集文件
评测集文件格式兼容 Google Evaluation 标准,详见评测集文件格式。评测集本地保存过程中,会考虑所有会话。
文件结构说明:
eval_cases
:所有对话轮次conversation
:一轮对话user_content
:用户的输入final_response
:Agent 的最后输出intermediate_data
:中间数据tool_uses
:Agent 使用的工具intermediate_responses
:Agent 的中间会话
开展评测
VeADK 目前支持 DeepEval 评测器和 ADKEval。
DeepEval 评估器
import asyncio
from veadk import Agent, Runner
from veadk.memory.short_term_memory import ShortTermMemory
from veadk.tools.demo_tools import get_city_weather
from veadk.prompts.prompt_evaluator import eval_principle_prompt
from veadk.evaluation.deepeval_evaluator import DeepevalEvaluator
from deepeval.metrics import GEval, ToolCorrectnessMetric
from deepeval.test_case import LLMTestCaseParams
app_name = "veadk_playground_app"
user_id = "veadk_playground_user"
session_id = "veadk_playground_session"
agent = Agent(tools=[get_city_weather])
runner = Runner(agent=agent, short_term_memory=ShortTermMemory(), app_name=app_name, user_id=user_id)
asyncio.run(runner.run(messages="How is the weather like in Beijing?", session_id=session_id))
eval_set_path = asyncio.run(runner.save_eval_set(session_id=session_id))
# 使用 DeepEval 评测器
evaluator = DeepevalEvaluator(agent=agent)
judge_model = evaluator.judge_model
metrics = [
GEval(
threshold=0.8,
name="Base Evaluation",
criteria=eval_principle_prompt,
evaluation_params=[
LLMTestCaseParams.INPUT,
LLMTestCaseParams.ACTUAL_OUTPUT,
LLMTestCaseParams.EXPECTED_OUTPUT,
],
model=judge_model,
),
ToolCorrectnessMetric(threshold=0.5),
]
asyncio.run(evaluator.evaluate(eval_set_file_path=eval_set_path, metrics=metrics))
Google Evaluation 评估器
import asyncio
from veadk import Agent, Runner
from veadk.memory.short_term_memory import ShortTermMemory
from veadk.tools.demo_tools import get_city_weather
from veadk.evaluation.adk_evaluator import ADKEvaluator
app_name = "veadk_playground_app"
user_id = "veadk_playground_user"
session_id = "veadk_playground_session"
agent = Agent(tools=[get_city_weather])
runner = Runner(agent=agent, short_term_memory=ShortTermMemory(), app_name=app_name, user_id=user_id)
asyncio.run(runner.run(messages="How is the weather like in Beijing?", session_id=session_id))
eval_set_path = asyncio.run(runner.save_eval_set(session_id=session_id))
# 使用 ADKEval 评测器
evaluator = ADKEvaluator(agent=agent)
asyncio.run(evaluator.evaluate(
eval_set_file_path=eval_set_path,
response_match_score_threshold=0.8,
tool_score_threshold=0.5
))
数据上报
评测结果可以自动上报至火山引擎的托管 Prometheus (VMP)平台,只需要在定义评估器的时候传入 Prometheus pushgateway 等相关参数即可,可在 config.yaml
中进行配置并从环境变量中自动读取:
from veadk.evaluation.utils.prometheus import PrometheusPushgatewayConfig
# Load Prometheus configuration (can be read from environment variables)
prometheus_config = PrometheusPushgatewayConfig()
# Pass config into evaluator
evaluator = DeepevalEvaluator(
...,
prometheus_config=prometheus_config,
)