观测与评测

评测

运行、评测、调优闭环支撑，助力 Agent 更加智能

VeADK 构建一套完整的自动化评测（Evaluation）流程，其主要功能包括：

运行时数据采集：自动捕获 Agent 的运行时数据
测试集文件生成：Agent 运行后自动将运行时数据作为测试数据集（Eval Set）导出到本地
评测：通过多种评测器（Evaluator）进行评测
反馈优化：根据评测结果（如评测得分，原因分析等）自动优化 prompts

运行时数据采集

在 Agent 运行结束后，通过调用 runner.save_eval_set() 将运行时数据构成的评测集文件保存在默认路径。

import asyncio

from veadk import Agent, Runner
from veadk.memory.short_term_memory import ShortTermMemory
from veadk.tools.demo_tools import get_city_weather

agent = Agent(tools=[get_city_weather])

session_id = "session_id"

runner = Runner(agent=agent, short_term_memory=ShortTermMemory())

prompt = "How is the weather like in Beijing? Besides, tell me which tool you invoked."

asyncio.run(runner.run(messages=prompt, session_id=session_id))

# 运行时数据采集
dump_path = asyncio.run(runner.save_eval_set(session_id=session_id))

print(f"Evaluation file path: {dump_path}")

评测集文件

评测集文件格式兼容 Google Evaluation 标准，详见评测集文件格式。评测集本地保存过程中，会考虑所有会话。

文件结构说明：

eval_cases：所有对话轮次
conversation：一轮对话
- user_content：用户的输入
- final_response：Agent 的最后输出
- intermediate_data：中间数据
  - tool_uses：Agent 使用的工具
  - intermediate_responses：Agent 的中间会话

开展评测

VeADK 目前支持 DeepEval 评测器和 ADKEval。

DeepEval 评估器

import asyncio

from veadk import Agent, Runner
from veadk.memory.short_term_memory import ShortTermMemory
from veadk.tools.demo_tools import get_city_weather
from veadk.prompts.prompt_evaluator import eval_principle_prompt
from veadk.evaluation.deepeval_evaluator import DeepevalEvaluator

from deepeval.metrics import GEval, ToolCorrectnessMetric
from deepeval.test_case import LLMTestCaseParams

app_name = "veadk_playground_app"
user_id = "veadk_playground_user"
session_id = "veadk_playground_session"

agent = Agent(tools=[get_city_weather])

runner = Runner(agent=agent, short_term_memory=ShortTermMemory(), app_name=app_name, user_id=user_id)

asyncio.run(runner.run(messages="How is the weather like in Beijing?", session_id=session_id))

eval_set_path = asyncio.run(runner.save_eval_set(session_id=session_id))

# 使用 DeepEval 评测器
evaluator = DeepevalEvaluator(agent=agent)

judge_model = evaluator.judge_model

metrics = [
    GEval(
        threshold=0.8,
        name="Base Evaluation",
        criteria=eval_principle_prompt,
        evaluation_params=[
            LLMTestCaseParams.INPUT,
            LLMTestCaseParams.ACTUAL_OUTPUT,
            LLMTestCaseParams.EXPECTED_OUTPUT,
        ],
        model=judge_model,
    ),
    ToolCorrectnessMetric(threshold=0.5),
]

asyncio.run(evaluator.evaluate(eval_set_file_path=eval_set_path, metrics=metrics))

Google Evaluation 评估器

import asyncio

from veadk import Agent, Runner
from veadk.memory.short_term_memory import ShortTermMemory
from veadk.tools.demo_tools import get_city_weather
from veadk.evaluation.adk_evaluator import ADKEvaluator

app_name = "veadk_playground_app"
user_id = "veadk_playground_user"
session_id = "veadk_playground_session"

agent = Agent(tools=[get_city_weather])

runner = Runner(agent=agent, short_term_memory=ShortTermMemory(), app_name=app_name, user_id=user_id)

asyncio.run(runner.run(messages="How is the weather like in Beijing?", session_id=session_id))

eval_set_path = asyncio.run(runner.save_eval_set(session_id=session_id))

# 使用 ADKEval 评测器
evaluator = ADKEvaluator(agent=agent)

asyncio.run(evaluator.evaluate(
    eval_set_file_path=eval_set_path,
    response_match_score_threshold=0.8,
    tool_score_threshold=0.5
))

数据上报

评测结果可以自动上报至火山引擎的托管 Prometheus (VMP)平台，只需要在定义评估器的时候传入 Prometheus pushgateway 等相关参数即可，可在 config.yaml 中进行配置并从环境变量中自动读取：

from veadk.evaluation.utils.prometheus import PrometheusPushgatewayConfig

# Load Prometheus configuration (can be read from environment variables)
prometheus_config = PrometheusPushgatewayConfig()

# Pass config into evaluator
evaluator = DeepevalEvaluator(
    ...,
    prometheus_config=prometheus_config,
)

Edit this pageorReport an issue

数据追踪

火山引擎云平台提供了多种 Tracing 平台

从脚手架开始

从零开始创建一个企业级 Agent 模板