AI Workflow Builder Evaluations
This module provides a evaluation framework for testing the AI Workflow Builder's ability to generate correct n8n workflows from natural language prompts.
Architecture Overview
The evaluation system is split into two distinct modes:
- CLI Evaluation - Runs predefined test cases locally with progress tracking
- Langsmith Evaluation - Integrates with Langsmith for dataset-based evaluation and experiment tracking
Directory Structure
evaluations/
├── cli/ # CLI evaluation implementation
│ ├── runner.ts # Main CLI evaluation orchestrator
│ └── display.ts # Console output and progress tracking
├── langsmith/ # Langsmith integration
│ ├── evaluator.ts # Langsmith-compatible evaluator function
│ └── runner.ts # Langsmith evaluation orchestrator
├── core/ # Shared evaluation logic
│ ├── environment.ts # Test environment setup and configuration
│ └── test-runner.ts # Core test execution logic
├── types/ # Type definitions
│ ├── evaluation.ts # Evaluation result schemas
│ ├── test-result.ts # Test result interfaces
│ └── langsmith.ts # Langsmith-specific types and guards
├── chains/ # LLM evaluation chains
│ ├── test-case-generator.ts # Dynamic test case generation
│ └── workflow-evaluator.ts # LLM-based workflow evaluation
├── utils/ # Utility functions
│ ├── evaluation-calculator.ts # Metrics calculation
│ ├── evaluation-helpers.ts # Common helper functions
│ ├── evaluation-reporter.ts # Report generation
└── index.ts # Main entry point
Implementation Details
Core Components
1. Test Runner (core/test-runner.ts)
The core test runner handles individual test execution:
- Generates workflows using the WorkflowBuilderAgent
- Validates generated workflows using type guards
- Evaluates workflows against test criteria
- Returns structured test results with error handling
2. Environment Setup (core/environment.ts)
Centralizes environment configuration:
- LLM initialization with API key validation
- Langsmith client setup
- Node types loading
- Concurrency and test generation settings
3. Langsmith Integration
The Langsmith integration provides two key components:
Evaluator (langsmith/evaluator.ts):
- Converts Langsmith Run objects to evaluation inputs
- Validates all data using type guards before processing
- Safely extracts usage metadata without type coercion
- Returns structured evaluation results
Runner (langsmith/runner.ts):
- Creates workflow generation functions compatible with Langsmith
- Validates message content before processing
- Extracts usage metrics safely from message metadata
- Handles dataset verification and error reporting
4. CLI Evaluation
The CLI evaluation provides local testing capabilities:
Runner (cli/runner.ts):
- Orchestrates parallel test execution with concurrency control
- Manages test case generation when enabled
- Generates detailed reports and saves results
Display (cli/display.ts):
- Progress bar management for real-time feedback
- Console output formatting
- Error display and reporting
Evaluation Metrics
The system evaluates workflows across five categories:
-
Functionality (30% weight)
- Does the workflow achieve the intended goal?
- Are the right nodes selected?
-
Connections (25% weight)
- Are nodes properly connected?
- Is data flow logical?
-
Expressions (20% weight)
- Are n8n expressions syntactically correct?
- Do they reference valid data paths?
-
Node Configuration (15% weight)
- Are node parameters properly set?
- Are required fields populated?
-
Structural Similarity (10% weight, optional)
- How closely does the structure match a reference workflow?
- Only evaluated when reference workflow is provided
Violation Severity Levels
Violations are categorized by severity:
- Critical (-40 to -50 points): Workflow-breaking issues
- Major (-15 to -25 points): Significant problems affecting functionality
- Minor (-5 to -15 points): Non-critical issues or inefficiencies
Running Evaluations
CLI Evaluation
# Run with default settings
pnpm eval
# With additional generated test cases
GENERATE_TEST_CASES=true pnpm eval
# With custom concurrency
EVALUATION_CONCURRENCY=10 pnpm eval
Langsmith Evaluation
# Set required environment variables
export LANGSMITH_API_KEY=your_api_key
# Optionally specify dataset
export LANGSMITH_DATASET_NAME=your_dataset_name
# Run evaluation
pnpm eval:langsmith
Configuration
Required Files
nodes.json
IMPORTANT: The evaluation framework requires a nodes.json file in the evaluations root directory (evaluations/nodes.json).
This file contains all n8n node type definitions and is used by the AI Workflow Builder agent to:
- Know what nodes are available in n8n
- Understand node parameters and their schemas
- Generate valid workflows with proper node configurations
Why is this required?
The AI Workflow Builder agent needs access to node definitions to generate workflows. In a normal n8n runtime, these definitions are loaded automatically. However, since the evaluation framework instantiates the agent without a running n8n instance, we must provide the node definitions manually via nodes.json.
How to generate nodes.json:
- Run your n8n instance
- Download the node definitions from locally running n8n instance(http://localhost:5678/types/nodes.json)
- Save the node definitions to
evaluations/nodes.jsoncurl -o evaluations/nodes.json http://localhost:5678/types/nodes.json
The evaluation will fail with a clear error message if nodes.json is missing.
Environment Variables
N8N_AI_ANTHROPIC_KEY- Required for LLM accessLANGSMITH_API_KEY- Required for Langsmith evaluationUSE_LANGSMITH_EVAL- Set to "true" to use Langsmith modeLANGSMITH_DATASET_NAME- Override default dataset nameEVALUATION_CONCURRENCY- Number of parallel test executions (default: 5)GENERATE_TEST_CASES- Set to "true" to generate additional test casesLLM_MODEL- Model identifier for metadata tracking
Output
CLI Evaluation Output
- Console Display: Real-time progress, test results, and summary statistics
- Markdown Report:
results/evaluation-report-[timestamp].md - JSON Results:
results/evaluation-results-[timestamp].json
Langsmith Evaluation Output
- Results are stored in Langsmith dashboard
- Experiment name format:
workflow-builder-evaluation-[date] - Includes detailed metrics for each evaluation category
Adding New Test Cases
Test cases are defined in chains/test-case-generator.ts. Each test case requires:
id: Unique identifiername: Descriptive nameprompt: Natural language description of the workflow to generatereferenceWorkflow(optional): Expected workflow structure for comparison
Extending the Framework
To add new evaluation metrics:
- Update the
EvaluationResultschema intypes/evaluation.ts - Modify the evaluation logic in
chains/workflow-evaluator.ts - Update the evaluator in
langsmith/evaluator.tsto include new metrics - Adjust weight calculations in
utils/evaluation-calculator.ts