n8n-enterprise-unlocked/packages/@n8n/ai-workflow-builder.ee/evaluations/README.md

# AI Workflow Builder Evaluations

This module provides a evaluation framework for testing the AI Workflow Builder's ability to generate correct n8n workflows from natural language prompts.

## Architecture Overview

The evaluation system is split into two distinct modes:
1. **CLI Evaluation** - Runs predefined test cases locally with progress tracking
2. **Langsmith Evaluation** - Integrates with Langsmith for dataset-based evaluation and experiment tracking

### Directory Structure

```
evaluations/
├── cli/                 # CLI evaluation implementation
│   ├── runner.ts       # Main CLI evaluation orchestrator
│   └── display.ts      # Console output and progress tracking
├── langsmith/          # Langsmith integration
│   ├── evaluator.ts    # Langsmith-compatible evaluator function
│   └── runner.ts       # Langsmith evaluation orchestrator
├── core/               # Shared evaluation logic
│   ├── environment.ts  # Test environment setup and configuration
│   └── test-runner.ts  # Core test execution logic
├── types/              # Type definitions
│   ├── evaluation.ts   # Evaluation result schemas
│   ├── test-result.ts  # Test result interfaces
│   └── langsmith.ts    # Langsmith-specific types and guards
├── chains/             # LLM evaluation chains
│   ├── test-case-generator.ts  # Dynamic test case generation
│   └── workflow-evaluator.ts   # LLM-based workflow evaluation
├── utils/              # Utility functions
│   ├── evaluation-calculator.ts  # Metrics calculation
│   ├── evaluation-helpers.ts     # Common helper functions
│   ├── evaluation-reporter.ts    # Report generation
└── index.ts            # Main entry point
```

## Implementation Details
### Core Components

#### 1. Test Runner (`core/test-runner.ts`)

The core test runner handles individual test execution:
- Generates workflows using the WorkflowBuilderAgent
- Validates generated workflows using type guards
- Evaluates workflows against test criteria
- Returns structured test results with error handling

#### 2. Environment Setup (`core/environment.ts`)

Centralizes environment configuration:
- LLM initialization with API key validation
- Langsmith client setup
- Node types loading
- Concurrency and test generation settings

#### 3. Langsmith Integration

The Langsmith integration provides two key components:

**Evaluator (`langsmith/evaluator.ts`):**
- Converts Langsmith Run objects to evaluation inputs
- Validates all data using type guards before processing
- Safely extracts usage metadata without type coercion
- Returns structured evaluation results

**Runner (`langsmith/runner.ts`):**
- Creates workflow generation functions compatible with Langsmith
- Validates message content before processing
- Extracts usage metrics safely from message metadata
- Handles dataset verification and error reporting

#### 4. CLI Evaluation

The CLI evaluation provides local testing capabilities:

**Runner (`cli/runner.ts`):**
- Orchestrates parallel test execution with concurrency control
- Manages test case generation when enabled
- Generates detailed reports and saves results

**Display (`cli/display.ts`):**
- Progress bar management for real-time feedback
- Console output formatting
- Error display and reporting

### Evaluation Metrics

The system evaluates workflows across five categories:

1. **Functionality** (30% weight)
   - Does the workflow achieve the intended goal?
   - Are the right nodes selected?

2. **Connections** (25% weight)
   - Are nodes properly connected?
   - Is data flow logical?

3. **Expressions** (20% weight)
   - Are n8n expressions syntactically correct?
   - Do they reference valid data paths?

4. **Node Configuration** (15% weight)
   - Are node parameters properly set?
   - Are required fields populated?

5. **Structural Similarity** (10% weight, optional)
   - How closely does the structure match a reference workflow?
   - Only evaluated when reference workflow is provided

### Violation Severity Levels

Violations are categorized by severity:
- **Critical** (-40 to -50 points): Workflow-breaking issues
- **Major** (-15 to -25 points): Significant problems affecting functionality
- **Minor** (-5 to -15 points): Non-critical issues or inefficiencies

## Running Evaluations

### CLI Evaluation

```bash
# Run with default settings
pnpm eval

# With additional generated test cases
GENERATE_TEST_CASES=true pnpm eval

# With custom concurrency
EVALUATION_CONCURRENCY=10 pnpm eval
```

### Langsmith Evaluation

```bash
# Set required environment variables
export LANGSMITH_API_KEY=your_api_key
# Optionally specify dataset
export LANGSMITH_DATASET_NAME=your_dataset_name

# Run evaluation
pnpm eval:langsmith
```

## Configuration

### Required Files

#### nodes.json
**IMPORTANT**: The evaluation framework requires a `nodes.json` file in the evaluations root directory (`evaluations/nodes.json`).

This file contains all n8n node type definitions and is used by the AI Workflow Builder agent to:
- Know what nodes are available in n8n
- Understand node parameters and their schemas
- Generate valid workflows with proper node configurations

**Why is this required?**
The AI Workflow Builder agent needs access to node definitions to generate workflows. In a normal n8n runtime, these definitions are loaded automatically. However, since the evaluation framework instantiates the agent without a running n8n instance, we must provide the node definitions manually via `nodes.json`.

**How to generate nodes.json:**
1. Run your n8n instance
2. Download the node definitions from locally running n8n instance(http://localhost:5678/types/nodes.json)
3. Save the node definitions to `evaluations/nodes.json`
` curl -o evaluations/nodes.json http://localhost:5678/types/nodes.json`

The evaluation will fail with a clear error message if `nodes.json` is missing.

### Environment Variables

- `N8N_AI_ANTHROPIC_KEY` - Required for LLM access
- `LANGSMITH_API_KEY` - Required for Langsmith evaluation
- `USE_LANGSMITH_EVAL` - Set to "true" to use Langsmith mode
- `LANGSMITH_DATASET_NAME` - Override default dataset name
- `EVALUATION_CONCURRENCY` - Number of parallel test executions (default: 5)
- `GENERATE_TEST_CASES` - Set to "true" to generate additional test cases
- `LLM_MODEL` - Model identifier for metadata tracking

## Output

### CLI Evaluation Output

- **Console Display**: Real-time progress, test results, and summary statistics
- **Markdown Report**: `results/evaluation-report-[timestamp].md`
- **JSON Results**: `results/evaluation-results-[timestamp].json`

### Langsmith Evaluation Output

- Results are stored in Langsmith dashboard
- Experiment name format: `workflow-builder-evaluation-[date]`
- Includes detailed metrics for each evaluation category

## Adding New Test Cases

Test cases are defined in `chains/test-case-generator.ts`. Each test case requires:
- `id`: Unique identifier
- `name`: Descriptive name
- `prompt`: Natural language description of the workflow to generate
- `referenceWorkflow` (optional): Expected workflow structure for comparison

## Extending the Framework

To add new evaluation metrics:
1. Update the `EvaluationResult` schema in `types/evaluation.ts`
2. Modify the evaluation logic in `chains/workflow-evaluator.ts`
3. Update the evaluator in `langsmith/evaluator.ts` to include new metrics
4. Adjust weight calculations in `utils/evaluation-calculator.ts`