agents or workflows.
Test Types
At present, we support a single type of test,type: consistency, which
measures the consistency between two results. Within agents, this can be
implemented as follows:
task_description field is the question that you want to test the LLM’s
performance on (note: we don’t call this prompt because we are nesting this
task_description within a separate prompt that runs the evaluation, so
prompt in this situation would be ambiguous). n indicates the number of
times to run the agent to produce a response to the task_description request.
For workflows, task_description is not required, but instead a task_ref
value should be provided, as shown below:
task_ref field indicates the task name that is to be tested. No
task_description is required because the given prompt will be used for
evaluation.
Running Tests
Basic Usage
These tests can be run by running either, for an agent:Output Formats
Theoxy test command supports two output formats for flexibility in different environments:
Pretty Format (Default)
The default format provides colored, human-readable output with detailed metrics:JSON Format (CI/CD)
For continuous integration and automated pipelines, use the--format json flag to get machine-readable output:
- CI/CD pipelines
- Automated quality gates
- Parsing with tools like
jq
Accuracy Thresholds
You can enforce minimum accuracy requirements using the--min-accuracy flag. This is useful for CI/CD pipelines to prevent regressions:
- Exit with code 0 if accuracy meets or exceeds 80%
- Exit with code 1 if accuracy falls below 80%
- Output results to stdout regardless of pass/fail
Threshold Modes for Multiple Tests
When your test file contains multiple tests, you can control how the threshold is evaluated:Average Mode (Default)
Checks if the average of all test accuracies meets the threshold:[0.85, 0.92, 0.78] average to 0.85, which passes the 0.8 threshold.
All Mode
Requires every individual test to meet the threshold:[0.85, 0.92, 0.78] would fail because Test 3 (0.78) is below the threshold.
Error output:
Quiet Mode
Suppress progress bars and detailed output during test execution:- Clean CI logs
- Parsing output programmatically
- Reducing noise in automated environments
CLI Reference
oxy test Command
Syntax:
<file>- Path to the.agent.ymlor.workflow.ymlfile to test (required)
| Flag | Short | Description | Default |
|---|---|---|---|
--format <format> | Output format: pretty or json | pretty | |
--min-accuracy <threshold> | Minimum accuracy threshold (0.0-1.0). Exit code 1 if below threshold | None | |
--threshold-mode <mode> | Threshold evaluation mode: average or all | average | |
--quiet | -q | Suppress detailed output and show only results summary | false |
CI/CD Integration Examples
GitHub Actions
GitLab CI
Docker
Parsing JSON Output
Extract specific metrics usingjq:
Best Practices
- Multiple Tests: Write multiple tests to cover different aspects of your agent’s behavior
- Threshold Mode: Use
--threshold-mode allfor critical quality gates,averagefor overall performance monitoring - Version Control: Commit your test files (
.agent.yml,.workflow.yml) to track test definitions - CI Integration: Always use
--format jsonin CI pipelines for reliable parsing - Quiet Mode: Combine
--quietwith--format jsonin automated environments for clean logs
Error Handling
- Execution Errors: If tests fail to run (e.g., connection issues), they are written to stderr and don’t affect the JSON output on stdout
- Threshold Failures: Only exit with code 1 when
--min-accuracyis specified and the threshold isn’t met - Missing Metrics: If no accuracy metrics are found but
--min-accuracyis specified, a warning is displayed but the command succeeds (exit code 0)