Skip to main content
Tests are a core part of oxy. Tests can be written either as a part of agents or workflows.

Test Types

At present, we support a single type of test, type: consistency, which measures the consistency between two results. Within agents, this can be implemented as follows:
tests:
  - type: consistency
    n: 5  # number of runs to test
    task_description: "how many users do we have?"
The task_description field is the question that you want to test the LLM’s performance on (note: we don’t call this prompt because we are nesting this task_description within a separate prompt that runs the evaluation, so prompt in this situation would be ambiguous). n indicates the number of times to run the agent to produce a response to the task_description request. For workflows, task_description is not required, but instead a task_ref value should be provided, as shown below:
tests:
  - type: consistency
    task_ref: task_name
    n: 5  # number of runs to test
The task_ref field indicates the task name that is to be tested. No task_description is required because the given prompt will be used for evaluation.

Running Tests

Basic Usage

These tests can be run by running either, for an agent:
oxy test agent-name.agent.yml
Or, for a workflow:
oxy test workflow-name.workflow.yml

Output Formats

The oxy test command supports two output formats for flexibility in different environments:

Pretty Format (Default)

The default format provides colored, human-readable output with detailed metrics:
oxy test agent.yml
Output:
✅Eval finished with metrics:
Accuracy: 85.50%
Recall: 72.30%

JSON Format (CI/CD)

For continuous integration and automated pipelines, use the --format json flag to get machine-readable output:
oxy test agent.yml --format json
Output:
{"accuracy": 0.855, "recall": 0.723}
When running multiple tests in a single file, the JSON output contains arrays:
{"accuracy": [0.85, 0.92, 0.78]}
This format is ideal for:
  • CI/CD pipelines
  • Automated quality gates
  • Parsing with tools like jq

Accuracy Thresholds

You can enforce minimum accuracy requirements using the --min-accuracy flag. This is useful for CI/CD pipelines to prevent regressions:
oxy test agent.yml --format json --min-accuracy 0.8
This command will:
  • Exit with code 0 if accuracy meets or exceeds 80%
  • Exit with code 1 if accuracy falls below 80%
  • Output results to stdout regardless of pass/fail

Threshold Modes for Multiple Tests

When your test file contains multiple tests, you can control how the threshold is evaluated:

Average Mode (Default)

Checks if the average of all test accuracies meets the threshold:
oxy test agent.yml --min-accuracy 0.8 --threshold-mode average
Example: Tests with accuracies [0.85, 0.92, 0.78] average to 0.85, which passes the 0.8 threshold.

All Mode

Requires every individual test to meet the threshold:
oxy test agent.yml --min-accuracy 0.8 --threshold-mode all
Example: Tests with accuracies [0.85, 0.92, 0.78] would fail because Test 3 (0.78) is below the threshold. Error output:
2 test(s) below threshold 0.8000: Test 3: 0.7800

Quiet Mode

Suppress progress bars and detailed output during test execution:
oxy test agent.yml --quiet --format json
This is useful for:
  • Clean CI logs
  • Parsing output programmatically
  • Reducing noise in automated environments

CLI Reference

oxy test Command

Syntax:
oxy test <file> [OPTIONS]
Arguments:
  • <file> - Path to the .agent.yml or .workflow.yml file to test (required)
Options:
FlagShortDescriptionDefault
--format <format>Output format: pretty or jsonpretty
--min-accuracy <threshold>Minimum accuracy threshold (0.0-1.0). Exit code 1 if below thresholdNone
--threshold-mode <mode>Threshold evaluation mode: average or allaverage
--quiet-qSuppress detailed output and show only results summaryfalse

CI/CD Integration Examples

GitHub Actions

- name: Run Tests
  run: |
    oxy test agent.yml --format json --min-accuracy 0.8 > results.json

- name: Extract Accuracy
  run: |
    ACCURACY=$(jq -r '.accuracy' results.json)
    echo "Test accuracy: $ACCURACY"

GitLab CI

test:
  script:
    - oxy test agent.yml --format json --min-accuracy 0.85
  artifacts:
    paths:
      - results.json

Docker

RUN oxy test agent.yml --format json --quiet --min-accuracy 0.8

Parsing JSON Output

Extract specific metrics using jq:
# Get accuracy score
oxy test agent.yml --format json | jq .accuracy

# Get all metrics
oxy test agent.yml --format json | jq .

# Check if accuracy > 80%
ACCURACY=$(oxy test agent.yml --format json | jq -r '.accuracy')
if (( $(echo "$ACCURACY < 0.8" | bc -l) )); then
  echo "Accuracy too low: $ACCURACY"
  exit 1
fi

Best Practices

  1. Multiple Tests: Write multiple tests to cover different aspects of your agent’s behavior
  2. Threshold Mode: Use --threshold-mode all for critical quality gates, average for overall performance monitoring
  3. Version Control: Commit your test files (.agent.yml, .workflow.yml) to track test definitions
  4. CI Integration: Always use --format json in CI pipelines for reliable parsing
  5. Quiet Mode: Combine --quiet with --format json in automated environments for clean logs

Error Handling

  • Execution Errors: If tests fail to run (e.g., connection issues), they are written to stderr and don’t affect the JSON output on stdout
  • Threshold Failures: Only exit with code 1 when --min-accuracy is specified and the threshold isn’t met
  • Missing Metrics: If no accuracy metrics are found but --min-accuracy is specified, a warning is displayed but the command succeeds (exit code 0)

Examples

Local Development

# Run tests with pretty output
oxy test my-agent.agent.yml

# Test with quiet mode for cleaner output
oxy test my-agent.agent.yml --quiet

CI/CD Pipeline

# Fail build if average accuracy < 85%
oxy test my-agent.agent.yml \
  --format json \
  --min-accuracy 0.85 \
  --threshold-mode average

# Require all tests to pass 80% threshold
oxy test my-agent.agent.yml \
  --format json \
  --min-accuracy 0.8 \
  --threshold-mode all \
  --quiet

Monitoring

# Run tests and save results for tracking
oxy test agent.yml --format json > results-$(date +%Y%m%d).json

# Compare against baseline
BASELINE=0.85
CURRENT=$(oxy test agent.yml --format json | jq -r '.accuracy')
echo "Current: $CURRENT, Baseline: $BASELINE"