feat: Evaluation framework for AI Workflow Builder (#18016)

2025-12-16 17:46:45 +00:00 · 2025-08-20 11:11:14 +02:00
parent 309048ae3c
commit fb3a2ae216
20 changed files with 2586 additions and 4 deletions
--- a/packages/@n8n/ai-workflow-builder.ee/evaluations/README.md
+++ b/packages/@n8n/ai-workflow-builder.ee/evaluations/README.md
@@ -0,0 +1,205 @@
+# AI Workflow Builder Evaluations
+
+This module provides a evaluation framework for testing the AI Workflow Builder's ability to generate correct n8n workflows from natural language prompts.
+
+## Architecture Overview
+
+The evaluation system is split into two distinct modes:
+1. **CLI Evaluation** - Runs predefined test cases locally with progress tracking
+2. **Langsmith Evaluation** - Integrates with Langsmith for dataset-based evaluation and experiment tracking
+
+### Directory Structure
+
+```
+evaluations/
+├── cli/                 # CLI evaluation implementation
+│   ├── runner.ts       # Main CLI evaluation orchestrator
+│   └── display.ts      # Console output and progress tracking
+├── langsmith/          # Langsmith integration
+│   ├── evaluator.ts    # Langsmith-compatible evaluator function
+│   └── runner.ts       # Langsmith evaluation orchestrator
+├── core/               # Shared evaluation logic
+│   ├── environment.ts  # Test environment setup and configuration
+│   └── test-runner.ts  # Core test execution logic
+├── types/              # Type definitions
+│   ├── evaluation.ts   # Evaluation result schemas
+│   ├── test-result.ts  # Test result interfaces
+│   └── langsmith.ts    # Langsmith-specific types and guards
+├── chains/             # LLM evaluation chains
+│   ├── test-case-generator.ts  # Dynamic test case generation
+│   └── workflow-evaluator.ts   # LLM-based workflow evaluation
+├── utils/              # Utility functions
+│   ├── evaluation-calculator.ts  # Metrics calculation
+│   ├── evaluation-helpers.ts     # Common helper functions
+│   ├── evaluation-reporter.ts    # Report generation
+└── index.ts            # Main entry point
+```
+
+## Implementation Details
+### Core Components
+
+#### 1. Test Runner (`core/test-runner.ts`)
+
+The core test runner handles individual test execution:
+- Generates workflows using the WorkflowBuilderAgent
+- Validates generated workflows using type guards
+- Evaluates workflows against test criteria
+- Returns structured test results with error handling
+
+#### 2. Environment Setup (`core/environment.ts`)
+
+Centralizes environment configuration:
+- LLM initialization with API key validation
+- Langsmith client setup
+- Node types loading
+- Concurrency and test generation settings
+
+#### 3. Langsmith Integration
+
+The Langsmith integration provides two key components:
+
+**Evaluator (`langsmith/evaluator.ts`):**
+- Converts Langsmith Run objects to evaluation inputs
+- Validates all data using type guards before processing
+- Safely extracts usage metadata without type coercion
+- Returns structured evaluation results
+
+**Runner (`langsmith/runner.ts`):**
+- Creates workflow generation functions compatible with Langsmith
+- Validates message content before processing
+- Extracts usage metrics safely from message metadata
+- Handles dataset verification and error reporting
+
+#### 4. CLI Evaluation
+
+The CLI evaluation provides local testing capabilities:
+
+**Runner (`cli/runner.ts`):**
+- Orchestrates parallel test execution with concurrency control
+- Manages test case generation when enabled
+- Generates detailed reports and saves results
+
+**Display (`cli/display.ts`):**
+- Progress bar management for real-time feedback
+- Console output formatting
+- Error display and reporting
+
+### Evaluation Metrics
+
+The system evaluates workflows across five categories:
+
+1. **Functionality** (30% weight)
+   - Does the workflow achieve the intended goal?
+   - Are the right nodes selected?
+
+2. **Connections** (25% weight)
+   - Are nodes properly connected?
+   - Is data flow logical?
+
+3. **Expressions** (20% weight)
+   - Are n8n expressions syntactically correct?
+   - Do they reference valid data paths?
+
+4. **Node Configuration** (15% weight)
+   - Are node parameters properly set?
+   - Are required fields populated?
+
+5. **Structural Similarity** (10% weight, optional)
+   - How closely does the structure match a reference workflow?
+   - Only evaluated when reference workflow is provided
+
+### Violation Severity Levels
+
+Violations are categorized by severity:
+- **Critical** (-40 to -50 points): Workflow-breaking issues
+- **Major** (-15 to -25 points): Significant problems affecting functionality
+- **Minor** (-5 to -15 points): Non-critical issues or inefficiencies
+
+## Running Evaluations
+
+### CLI Evaluation
+
+```bash
+# Run with default settings
+pnpm eval
+
+# With additional generated test cases
+GENERATE_TEST_CASES=true pnpm eval
+
+# With custom concurrency
+EVALUATION_CONCURRENCY=10 pnpm eval
+```
+
+### Langsmith Evaluation
+
+```bash
+# Set required environment variables
+export LANGSMITH_API_KEY=your_api_key
+# Optionally specify dataset
+export LANGSMITH_DATASET_NAME=your_dataset_name
+
+# Run evaluation
+pnpm eval:langsmith
+```
+
+## Configuration
+
+### Required Files
+
+#### nodes.json
+**IMPORTANT**: The evaluation framework requires a `nodes.json` file in the evaluations root directory (`evaluations/nodes.json`).
+
+This file contains all n8n node type definitions and is used by the AI Workflow Builder agent to:
+- Know what nodes are available in n8n
+- Understand node parameters and their schemas
+- Generate valid workflows with proper node configurations
+
+**Why is this required?**
+The AI Workflow Builder agent needs access to node definitions to generate workflows. In a normal n8n runtime, these definitions are loaded automatically. However, since the evaluation framework instantiates the agent without a running n8n instance, we must provide the node definitions manually via `nodes.json`.
+
+**How to generate nodes.json:**
+1. Run your n8n instance
+2. Download the node definitions from locally running n8n instance(http://localhost:5678/types/nodes.json)
+3. Save the node definitions to `evaluations/nodes.json`
+
+The evaluation will fail with a clear error message if `nodes.json` is missing.
+
+### Environment Variables
+
+- `N8N_AI_ANTHROPIC_KEY` - Required for LLM access
+- `LANGSMITH_API_KEY` - Required for Langsmith evaluation
+- `USE_LANGSMITH_EVAL` - Set to "true" to use Langsmith mode
+- `LANGSMITH_DATASET_NAME` - Override default dataset name
+- `EVALUATION_CONCURRENCY` - Number of parallel test executions (default: 5)
+- `GENERATE_TEST_CASES` - Set to "true" to generate additional test cases
+- `LLM_MODEL` - Model identifier for metadata tracking
+
+## Output
+
+### CLI Evaluation Output
+
+- **Console Display**: Real-time progress, test results, and summary statistics
+- **Markdown Report**: `results/evaluation-report-[timestamp].md`
+- **JSON Results**: `results/evaluation-results-[timestamp].json`
+
+### Langsmith Evaluation Output
+
+- Results are stored in Langsmith dashboard
+- Experiment name format: `workflow-builder-evaluation-[date]`
+- Includes detailed metrics for each evaluation category
+
+## Adding New Test Cases
+
+Test cases are defined in `chains/test-case-generator.ts`. Each test case requires:
+- `id`: Unique identifier
+- `name`: Descriptive name
+- `prompt`: Natural language description of the workflow to generate
+- `referenceWorkflow` (optional): Expected workflow structure for comparison
+
+## Extending the Framework
+
+To add new evaluation metrics:
+1. Update the `EvaluationResult` schema in `types/evaluation.ts`
+2. Modify the evaluation logic in `chains/workflow-evaluator.ts`
+3. Update the evaluator in `langsmith/evaluator.ts` to include new metrics
+4. Adjust weight calculations in `utils/evaluation-calculator.ts`
--- a/packages/@n8n/ai-workflow-builder.ee/evaluations/chains/test-case-generator.ts
+++ b/packages/@n8n/ai-workflow-builder.ee/evaluations/chains/test-case-generator.ts
@@ -0,0 +1,145 @@
+import type { BaseChatModel } from '@langchain/core/language_models/chat_models';
+import { SystemMessage } from '@langchain/core/messages';
+import { ChatPromptTemplate, HumanMessagePromptTemplate } from '@langchain/core/prompts';
+import { OperationalError } from 'n8n-workflow';
+import { z } from 'zod';
+
+import type { TestCase } from '../types/evaluation';
+
+const testCasesSchema = z.object({
+	testCases: z.array(
+		z.object({
+			id: z.string(),
+			name: z.string(),
+			summary: z.string(),
+			prompt: z.string(),
+		}),
+	),
+});
+
+const systemPrompt = `You are an expert at generating diverse test cases for an n8n workflow builder AI. Create test cases that cover various real-world scenarios and complexity levels.
+
+## Test Case Requirements:
+
+1. **Simple Test Cases**: Single operation workflows
+   - API calls
+   - Data transformations
+   - File operations
+   - Basic integrations
+
+2. **Medium Test Cases**: Multi-step workflows with logic
+   - Conditional logic (IF nodes)
+   - Data filtering and transformation
+   - Multiple API integrations
+   - Error handling
+
+3. **Complex Test Cases**: Advanced workflows
+   - Parallel execution branches
+   - Complex error handling and retry logic
+   - Multiple integrations with data synchronization
+   - Webhooks and event-driven flows
+
+## Guidelines:
+- Create realistic business scenarios
+- Include specific requirements that can be evaluated
+- Vary the domains (e-commerce, HR, marketing, DevOps, etc.)
+- Include both common and edge-case scenarios
+- Make prompts clear and unambiguous
+- Specify expected node types when possible
+
+## Output Format:
+Each test case should have:
+- Unique ID (e.g., "test_001")
+- Descriptive name
+- Brief description
+- Clear prompt that a user would give
+- Expected node types (array of node names)
+- Complexity level
+- Relevant tags`;
+
+const humanTemplate = `Generate {count} diverse test cases for workflow generation evaluation.
+
+Focus on:
+{focus}
+
+Ensure a good mix of complexity levels and use cases.`;
+
+export function createTestCaseGeneratorChain(llm: BaseChatModel) {
+	if (!llm.bindTools) {
+		throw new OperationalError("LLM doesn't support binding tools");
+	}
+
+	const prompt = ChatPromptTemplate.fromMessages([
+		new SystemMessage(systemPrompt),
+		HumanMessagePromptTemplate.fromTemplate(humanTemplate),
+	]);
+
+	const llmWithStructuredOutput = llm.withStructuredOutput(testCasesSchema);
+	return prompt.pipe(llmWithStructuredOutput);
+}
+
+export async function generateTestCases(
+	llm: BaseChatModel,
+	count: number = 10,
+	focus: string = 'balanced mix of API integrations, data processing, and automation scenarios',
+): Promise<TestCase[]> {
+	const chain = createTestCaseGeneratorChain(llm);
+
+	const result = (await chain.invoke({
+		count,
+		focus,
+	})) as z.infer<typeof testCasesSchema>;
+
+	return result.testCases;
+}
+
+export const basicTestCases: TestCase[] = [
+	{
+		id: 'invoice-pipeline',
+		name: 'Invoice processing pipeline',
+		prompt:
+			'Create an invoice parsing workflow using n8n forms. Extract key information (vendor, date, amount, line items) using AI, validate the data, and store structured information in Airtable. Generate a weekly spending report every Sunday at 6 PM using AI analysis and send via email.',
+	},
+	{
+		id: 'ai-news-digest',
+		name: 'Daily AI news digest',
+		prompt:
+			'Create a workflow that fetches the latest AI news every morning at 8 AM. It should aggregate news from multiple sources, use LLM to summarize the top 5 stories, generate a relevant image using AI, and send everything as a structured Telegram message with article links. I should be able to chat about the news with the LLM so at least 40 last messages should be stored.',
+	},
+	{
+		id: 'rag-assistant',
+		name: 'RAG knowledge assistant',
+		prompt:
+			'Build a pipeline that accepts PDF, CSV, or JSON files through an n8n form. Chunk documents into 1000-token segments, generate embeddings, and store in a vector database. Use the filename as the document key and add metadata including upload date and file type. Include a chatbot that can answer questions based on a knowledge base.',
+	},
+	{
+		id: 'email-summary',
+		name: 'Summarize emails with AI',
+		prompt:
+			'Build a workflow that retrieves the last 50 emails from multiple email accounts. Merge all emails, perform AI analysis to identify action items, priorities, and sentiment. Generate a brief summary and send to Slack with categorized insights and recommended actions.',
+	},
+	{
+		id: 'youtube-auto-chapters',
+		name: 'YouTube video chapters',
+		prompt:
+			"I want to build an n8n workflow that automatically creates YouTube chapter timestamps by analyzing the video captions. When I trigger it manually, it should take a video ID as input, fetch the existing video metadata and captions from YouTube, use an AI language model like Google Gemini to parse the transcript into chapters with timestamps, and then update the video's description with these chapters appended. The goal is to save time and improve SEO by automating the whole process.",
+	},
+	{
+		id: 'pizza-delivery',
+		name: 'Pizza delivery chatbot',
+		prompt:
+			"I need an n8n workflow that creates a chatbot for my pizza delivery service. The bot should be able to answer customer questions about our pizza menu, take their orders accurately by capturing pizza type, quantity, and customer details, and also provide real-time updates when customers ask about their order status. It should use OpenAI's gpt-4.1-mini to handle conversations and integrate with HTTP APIs to get product info and manage orders. The workflow must maintain conversation context so the chatbot feels natural and can process multiple user queries sequentially.",
+	},
+	{
+		id: 'lead-qualification',
+		name: 'Lead qualification and call scheduling',
+		prompt:
+			'Create a form with fields for email, company, and role. Build an automation that processes form submissions, enrich with company data from their website, uses AI to qualify the lead, sends data to Google Sheets. For high-score leads it should also schedule a 15-min call in a free slot in my calendar and send a confirmation email to both me and the lead.',
+	},
+	{
+		id: 'multi-agent-research',
+		name: 'Multi-agent research workflow',
+		prompt:
+			'Create a multi-agent AI workflow where different AI agents collaborate to research a topic, fact-check information, and compile comprehensive reports.',
+	},
+];
--- a/packages/@n8n/ai-workflow-builder.ee/evaluations/chains/workflow-evaluator.ts
+++ b/packages/@n8n/ai-workflow-builder.ee/evaluations/chains/workflow-evaluator.ts
@@ -0,0 +1,408 @@
+import type { BaseChatModel } from '@langchain/core/language_models/chat_models';
+import { SystemMessage } from '@langchain/core/messages';
+import { ChatPromptTemplate, HumanMessagePromptTemplate } from '@langchain/core/prompts';
+import { OperationalError } from 'n8n-workflow';
+import type { z } from 'zod';
+
+import { evaluationResultSchema, type EvaluationInput } from '../types/evaluation';
+
+const systemPrompt = `You are an expert n8n workflow evaluator. Your task is to evaluate a generated n8n workflow against a user's requirements and compare it to a reference workflow. Score the workflow across multiple categories and identify specific violations.
+
+## Inputs Provided:
+1. **User Prompt**: The original request describing what the workflow should do
+2. **Reference Workflow**: An example workflow (optional)
+3. **Generated Workflow**: The workflow to evaluate
+
+## Understanding n8n AI Node Architecture
+
+### AI Sub-nodes vs Main Nodes
+n8n has two types of connections:
+1. **Main connections**: Carry actual data between nodes (use "main" type)
+2. **AI connections**: Provide capabilities to AI nodes (use "ai_*" types like ai_document, ai_textSplitter, ai_embedding, ai_tool, ai_languageModel, ai_memory)
+
+### Important: AI Sub-nodes Are NOT Part of Main Data Flow
+- Document Loader, Token Splitter, Embeddings nodes are AI sub-nodes
+- They connect via ai_* connections to provide capabilities, NOT to process data
+- Example: Document Loader -> Vector Store via "ai_document" provides document processing capability
+- The actual data flows through main connections: Form -> Vector Store via "main"
+
+### Valid AI Connection Patterns:
+- Token Splitter -> Document Loader [ai_textSplitter]
+- Document Loader -> Vector Store [ai_document]
+- Embeddings -> Vector Store [ai_embedding]
+- Tool nodes -> AI Agent [ai_tool]
+- These nodes do NOT need main connections from data sources
+
+### Tool Nodes and $fromAI Expressions:
+- ANY node ending with "Tool" that has ai_tool connections supports $fromAI expressions
+- $fromAI allows the AI Agent to dynamically populate parameters at runtime
+- Format: {{ $fromAI('parameterName', 'description', 'type', defaultValue) }}
+- This is the CORRECT pattern for tool nodes connected to AI Agents
+
+## Evaluation Categories and Scoring
+
+### 1. Functional Correctness (35% weight)
+Evaluate whether the workflow correctly implements what the user EXPLICITLY requested.
+
+**DO NOT penalize for:**
+- Missing optimizations not requested by user
+- Missing features that would be "nice to have" but weren't specified
+- Alternative valid approaches to solve the same problem
+
+**Check for these violations:**
+- **Critical (-40 to -50 points)**:
+  - Missing core functionality explicitly requested
+  - Incorrect operation logic that prevents the workflow from working
+  - Workflows missing a trigger node when they need to start automatically or by some external event
+- **Major (-15 to -25 points)**:
+  - Missing explicitly required data transformations
+  - Incomplete implementation of requested features
+  - Using completely wrong node type for the task (e.g., Set node when HTTP Request is clearly needed)
+  - Workflows that would fail immediately on first execution due to structural issues
+- **Minor (-5 to -10 points)**:
+  - Missing optional features explicitly mentioned by user
+  - Using less optimal but functional node choices
+
+**Questions to consider:**
+- Does the workflow perform all EXPLICITLY requested operations?
+- Are the operations in the correct logical sequence?
+- Does it handle all scenarios mentioned in the user prompt?
+- Are data transformations implemented as requested?
+
+### 2. Connections (25% weight)
+Evaluate whether nodes are properly connected with correct data flow.
+
+**Understanding AI connections:**
+- AI sub-nodes (Document Loader, Token Splitter, Embeddings, etc.) connect via ai_* connections
+- They do NOT need main connections from data sources
+- Main data flows directly to the consumer node (e.g., Form -> Vector Store)
+
+**Check for these violations:**
+- **Critical (-40 to -50 points)**: Disconnected main nodes that process data (not AI sub-nodes), wrong execution order
+- **Major (-15 to -25 points)**: Missing data dependencies between main nodes, parallel execution errors
+- **Minor (-5 to -10 points)**: Redundant connections, suboptimal routing
+
+**DO NOT penalize:**
+- AI sub-nodes without main input connections (they use ai_* connections)
+- Document Loader/Token Splitter not connected to Form (correct pattern)
+- Tool nodes connected only via ai_tool connections (correct pattern)
+
+**Questions to consider:**
+- Are main data processing nodes properly connected?
+- Do connections follow the logical flow of data?
+- Are AI sub-nodes correctly connected via ai_* connections?
+- Are nodes that depend on each other's data properly connected in sequence?
+
+### Understanding Conditional Nodes (IF, Switch)
+- Conditional nodes have multiple outputs (true/false branches)
+- Not all branches need to be connected if logic doesn't require it
+- Empty/unconnected branches are valid when that condition isn't handled
+- Focus on whether the INTENDED logic flow is correct
+
+### 3. Expressions (25% weight)
+Evaluate whether expressions correctly reference nodes and data using modern n8n syntax.
+
+**Correct n8n expression syntax uses \`{{ $('Node Name').item.json.field }}\` format**
+
+**Valid expression patterns (DO NOT penalize):**
+- $fromAI() in ANY tool node: \`{{ $fromAI('parameterName', 'description', 'type', defaultValue) }}\`
+- Tool nodes are identified by: node type ending with "Tool" AND having ai_tool connections
+- String concatenation with embedded expressions: \`"=Text {{ expression }}"\` or \`"=Text - {{ $now.format('format') }}"\`
+- Mixed static text and expressions: \`"=Order #{{ $json.orderId }} processed"\`
+- Alternative but functionally equivalent syntax variations
+- Expression syntax that would work even if not optimal:
+  - Single '=' for simple strings (e.g., '=Weekly Report')
+  - String with embedded expressions (e.g., \`"=Report - {{ $now.format('MMMM d, yyyy') }}"\`)
+  - Different date formatting approaches that produce valid output
+  - String concatenation using various valid methods
+- Focus on whether expressions would cause runtime FAILURES, not style preferences
+
+**Check for these violations:**
+- **Critical (-40 to -50 points)**:
+  - Invalid JavaScript syntax that would cause runtime errors (unclosed brackets, syntax errors, malformed JSON)
+  - Referencing truly non-existent nodes or fields that would cause runtime errors
+- **Major (-20 to -25 points)**:
+  - Missing required = prefix for expressions (e.g., \`{{ $json.name }}\` instead of \`={{ $json.name }}\`)
+  - Using $fromAI in non-tool nodes (would cause runtime error)
+  - Referencing undefined variables or functions
+  - Wrong data paths that would prevent execution, type mismatches
+- **Minor (-5 to -10 points)**:
+  - Inefficient but working expressions
+  - Outdated syntax that still functions (e.g., \`$node["NodeName"]\` instead of \`$('NodeName')\`)
+
+**Valid n8n expression formats (MODERN SYNTAX - Preferred):**
+- Single item: \`={{ $('Node Name').item.json.fieldName }}\`
+- All items: \`={{ $('Node Name').all() }}\`
+- First/last item: \`={{ $('Node Name').first().json.field }}\` or \`={{ $('Node Name').last().json.field }}\`
+- Array index: \`={{ $('Node Name').all()[0].json.fieldName }}\`
+- Previous node: \`={{ $json.fieldName }}\` or \`={{ $input.item.json.field }}\`
+- All items: \`={{ $json }}\`
+- String with text and expression: \`="Text prefix {{ expression }} text suffix"\`
+- String with embedded date: \`="Report - {{ $now.format('MMMM d, yyyy') }}"\`
+- Tool nodes ONLY: \`={{ $fromAI('parameterName', 'description') }}\` - Dynamic parameter for AI Agent to populate
+
+**Valid JavaScript operations in expressions:**
+- Array methods: \`={{ $json.items.map(item => item.name).join(', ') }}\`
+- String operations: \`={{ $json.text.split(',').filter(x => x) }}\`
+- Math operations: \`={{ Math.round($json.price * 1.2) }}\`
+- Object operations: \`={{ Object.keys($json).length }}\`
+- Conditional logic: \`={{ $json.status === 'active' ? 'Yes' : 'No' }}\`
+
+**Special n8n variables (DO NOT penalize):**
+- \`$now\` - Current date/time with methods like .format(), .toISO()
+- \`$today\` - Today's date
+- \`$execution.id\` - Current execution ID
+- \`$workflow.id\` / \`$workflow.name\` - Workflow metadata
+- \`$env\` - Environment variables
+- \`$vars\` - Workflow variables
+- \`$binary\` - Binary data access
+
+**OUTDATED syntax (Minor penalty - still works but not preferred):**
+- \`$node["NodeName"]\` - Old syntax, should use \`$('NodeName')\` instead
+- \`$items()\` - Old syntax for accessing all items
+
+**IMPORTANT about the = prefix:**
+- The \`=\` sign prefix is REQUIRED when you want to use expressions or mixed text/expressions
+- For pure static text without any expressions, the \`=\` is optional (but harmless if included)
+- Examples:
+  - \`"Hello World"\` - Static text, no \`=\` needed
+  - \`="Hello World"\` - Also valid for static text
+  - \`="{{ $json.name }}"\` - Expression, \`=\` REQUIRED
+  - \`="Hello {{ $json.name }}"\` - Mixed text/expression, \`=\` REQUIRED
+
+**Important: $fromAI is ONLY valid in tool nodes (ending with "Tool"), NOT in regular nodes**
+
+**DO NOT penalize these valid expression patterns:**
+- Simple = prefix for strings: \`="Hello World"\`
+- Mixed text/expression: \`="Total: {{ $json.amount }}"\`
+- JavaScript operations that are syntactically correct
+- Any working expression format, even if not optimal
+- Alternative date formats that produce valid output
+- String concatenation in any valid form
+
+### Expression Context Understanding
+When evaluating expressions, consider the data flow context:
+- Field names may differ between nodes (e.g., 'articles' in one node, 'topArticles' in another)
+- Check if the referenced field EXISTS in the source node's output
+- Consider field name transformations between nodes
+- If a field doesn't exist but a similar one does, it's likely a naming mismatch
+- Example: If evaluating \`$('Node').item.json.articles\` but Node outputs 'topArticles', this is a minor issue if the data type matches
+
+### 4. Node Configuration (15% weight)
+Evaluate whether nodes are configured with correct parameters and settings.
+
+**Valid placeholder patterns (DO NOT penalize):**
+- \`<UNKNOWN>\` values when user didn't specify concrete values
+- Empty strings ("") in configuration fields when not provided by user
+- Empty strings in resource selectors (base/table/document IDs)
+- Placeholder API keys like "YOUR_API_KEY" or similar patterns
+- These are ALL valid user configuration points, not errors
+
+**Important**: Empty string ("") and \`<UNKNOWN>\` are BOTH valid placeholders
+
+**Check for these violations:**
+- **Critical (-40 to -50 points)**: ONLY for actual breaking issues:
+  - User provided specific value that's incorrectly implemented
+  - Truly required parameters completely absent (not empty/placeholder):
+    - HTTP Request without URL (unless using $fromAI)
+    - Database operations without operation type specified
+    - Code node without any code
+  - Parameters with invalid values that would crash:
+    - Invalid JSON in JSON fields
+    - Non-numeric values in number-only fields
+  - Configuration that would cause runtime crash
+- **Major (-15 to -25 points)**:
+  - Wrong operation mode when explicitly specified by user
+  - Significant deviation from requested behavior
+  - Missing resource/operation selection that prevents node from functioning
+- **Minor (-5 to -10 points)**:
+  - Suboptimal but working configurations
+  - Style preferences or minor inefficiencies
+
+**Special handling for Tool nodes:**
+- $fromAI expressions are VALID in ANY tool node (nodes ending with "Tool")
+- Tool nodes connected via ai_tool allow AI Agents to populate parameters dynamically
+- Format: \`{{ $fromAI('parameter', 'description') }}\` is correct and expected
+- DO NOT penalize $fromAI in tool node parameters
+
+**Questions to consider:**
+- Are parameters correctly set based on what the user actually specified?
+- Are placeholder values used appropriately for unspecified parameters?
+- Are the correct operations selected based on user requirements?
+- Are field mappings complete for the requested functionality?
+
+### 5. Structural Similarity to Reference (0% if no reference provided)
+If a reference workflow is provided, evaluate how well the generated workflow follows similar patterns.
+
+**Only evaluate this if a reference workflow is provided. Check for:**
+- Uses similar node types for similar operations
+- Follows similar architectural patterns
+- Adopts consistent naming conventions
+
+## Context-Aware Evaluation
+
+### Compare Against User Request
+- Only penalize missing features that were explicitly requested
+- Don't penalize missing optional enhancements
+- Consider what information was actually provided by the user
+
+### Severity Guidelines:
+- If user didn't provide email addresses, \`<UNKNOWN>\` is expected
+- If user didn't specify API keys, placeholder values are valid
+- If user didn't provide specific IDs or credentials, empty/placeholder values are correct
+- Focus on structural correctness, not specific values
+
+## Common Correct Patterns (DO NOT flag as violations)
+
+### RAG Workflow Pattern:
+- Form Trigger -> Vector Store (main) - carries file data
+- Token Splitter -> Document Loader (ai_textSplitter) - provides chunking capability
+- Document Loader -> Vector Store (ai_document) - provides document processing capability
+- Embeddings -> Vector Store (ai_embedding) - provides embedding generation capability
+- The Document Loader and Token Splitter do NOT need connections from Form
+
+### AI Agent with Tools:
+- Chat Trigger -> AI Agent (main) - carries user input
+- ANY Tool node -> AI Agent (ai_tool) - provides tool capabilities
+- Tool nodes use $fromAI for dynamic parameter population by the AI
+
+### Tool Node Parameters (ANY node ending with "Tool"):
+- Parameters with $fromAI expressions - VALID and expected pattern
+- Allows AI to dynamically determine values at runtime
+- Examples: email recipients, message content, search queries, API parameters
+- Format: {{ $fromAI('key', 'description', 'type', defaultValue) }}
+- Only 'key' is required, 'description', 'type', and 'defaultValue' are optional
+
+### Placeholder Values:
+- Empty credential fields - user configuration point, not error
+- <UNKNOWN> in required fields - valid when user didn't specify
+- YOUR_API_KEY placeholders - expected for user configuration
+
+## Scoring Instructions:
+
+1. Start each category at 100 points
+2. Deduct points for each violation found based on severity
+3. A category score cannot go below 0
+4. Convert scores to 0-1 scale by dividing by 100
+5. Do NOT calculate the weighted final score yourself - just provide individual category scores
+
+### Severity Level Guidelines:
+**When to use Critical (-40 to -50 points):**
+- Only for violations that would cause complete failure
+- Missing core functionality explicitly requested
+- Completely broken connections that prevent execution
+- Fatal expression errors that would crash the workflow
+- Invalid syntax that prevents parsing
+
+**When to use Major (-15 to -25 points):**
+- Issues that significantly impact functionality
+- Missing important features explicitly mentioned
+- Incorrect data flow that affects results
+- Wrong operation modes when specifically requested
+- Errors that would likely cause runtime failures
+
+**When to use Minor (-5 to -10 points):**
+- Style preferences and inefficiencies
+- Alternative valid approaches
+- Field naming inconsistencies that don't break functionality
+- Missing nice-to-have features not explicitly requested
+- Outdated but functional syntax
+
+**Apply severity based on actual impact on workflow execution:**
+- Consider: Will this definitely break? (Critical)
+- Will this likely cause issues? (Major)
+- Is this just suboptimal? (Minor)
+- Focus on functional impact, not perfection
+
+Remember: Focus on objective technical evaluation. Be specific about violations and reference exact node names and expressions when identifying issues. Consider the n8n AI architecture and don't penalize valid patterns.
+
+## Final Balance Statement
+While being thorough in identifying issues, remember the goal is functional correctness, not perfection. Focus on issues that would actually prevent the workflow from achieving its intended purpose. The evaluator should identify real problems that would cause failures or prevent the requested functionality, not enforce style preferences or require unrequested optimizations.`;
+
+const humanTemplate = `Please evaluate the following workflow:
+
+<user_prompt>
+{userPrompt}
+</user_prompt>
+
+<generated_workflow>
+{generatedWorkflow}
+</generated_workflow>
+
+{referenceSection}
+
+Provide a detailed evaluation following the scoring guidelines.`;
+
+export function createWorkflowEvaluatorChain(llm: BaseChatModel) {
+	if (!llm.bindTools) {
+		throw new OperationalError("LLM doesn't support binding tools");
+	}
+
+	const prompt = ChatPromptTemplate.fromMessages([
+		new SystemMessage(systemPrompt),
+		HumanMessagePromptTemplate.fromTemplate(humanTemplate),
+	]);
+
+	const llmWithStructuredOutput = llm.withStructuredOutput(evaluationResultSchema);
+	return prompt.pipe(llmWithStructuredOutput);
+}
+
+export async function evaluateWorkflow(llm: BaseChatModel, input: EvaluationInput) {
+	const chain = createWorkflowEvaluatorChain(llm);
+
+	// Format reference section if reference workflow is provided
+	const referenceSection = input.referenceWorkflow
+		? `<reference_workflow>
+${JSON.stringify(input.referenceWorkflow, null, 2)}
+</reference_workflow>`
+		: '';
+
+	const result = await chain.invoke({
+		userPrompt: input.userPrompt,
+		generatedWorkflow: JSON.stringify(input.generatedWorkflow, null, 2),
+		referenceSection,
+	});
+
+	const evaluationResult = result as z.infer<typeof evaluationResultSchema>;
+
+	// Calculate the overall score using the deterministic weighted calculation
+	evaluationResult.overallScore = calculateWeightedScore(evaluationResult);
+
+	return evaluationResult;
+}
+
+// Helper function to calculate weighted score
+export function calculateWeightedScore(result: {
+	functionality: { score: number };
+	connections: { score: number };
+	expressions: { score: number };
+	nodeConfiguration: { score: number };
+	structuralSimilarity?: { score: number; applicable: boolean };
+}) {
+	const weights = {
+		functionality: 0.35,
+		connections: 0.25,
+		expressions: 0.25,
+		nodeConfiguration: 0.15,
+		structuralSimilarity: 0.05,
+	};
+
+	let totalWeight = 0;
+	let weightedSum = 0;
+
+	// Add scores for categories that are always evaluated
+	weightedSum += result.functionality.score * weights.functionality;
+	weightedSum += result.connections.score * weights.connections;
+	weightedSum += result.expressions.score * weights.expressions;
+	weightedSum += result.nodeConfiguration.score * weights.nodeConfiguration;
+	totalWeight =
+		weights.functionality + weights.connections + weights.expressions + weights.nodeConfiguration;
+
+	// Add structural similarity only if applicable
+	if (result.structuralSimilarity?.applicable) {
+		weightedSum += result.structuralSimilarity.score * weights.structuralSimilarity;
+		totalWeight += weights.structuralSimilarity;
+	}
+
+	return totalWeight > 0 ? weightedSum / totalWeight : 0;
+}
--- a/packages/@n8n/ai-workflow-builder.ee/evaluations/cli/display.ts
+++ b/packages/@n8n/ai-workflow-builder.ee/evaluations/cli/display.ts
@@ -0,0 +1,102 @@
+import cliProgress from 'cli-progress';
+import pc from 'picocolors';
+
+import type { TestCase } from '../types/evaluation.js';
+import type { TestResult } from '../types/test-result.js';
+import {
+	calculateTestMetrics,
+	calculateCategoryAverages,
+	countViolationsByType,
+} from '../utils/evaluation-calculator.js';
+import {
+	displayTestResults,
+	displaySummaryTable,
+	displayViolationsDetail,
+} from '../utils/evaluation-reporter.js';
+
+/**
+ * Creates a progress bar for test execution
+ * @param total - Total number of tests
+ * @returns Progress bar instance
+ */
+export function createProgressBar(total: number): cliProgress.SingleBar {
+	const progressBar = new cliProgress.SingleBar(
+		{
+			format: 'Progress |{bar}| {percentage}% | {value}/{total} Tests | {status}',
+			barCompleteChar: '█',
+			barIncompleteChar: '░',
+			hideCursor: true,
+		},
+		cliProgress.Presets.shades_classic,
+	);
+	progressBar.start(total, 0, { status: 'Starting...' });
+	return progressBar;
+}
+
+/**
+ * Updates progress bar with current status
+ * @param progressBar - Progress bar instance
+ * @param completed - Number of completed tests
+ * @param total - Total number of tests
+ * @param status - Optional status message
+ */
+export function updateProgress(
+	progressBar: cliProgress.SingleBar,
+	completed: number,
+	total: number,
+	status?: string,
+): void {
+	progressBar.update(completed, {
+		status: status ?? `${completed}/${total} completed`,
+	});
+}
+
+/**
+ * Displays evaluation results in the console
+ * @param testCases - Array of test cases
+ * @param results - Array of test results
+ * @param totalTime - Total execution time in milliseconds
+ */
+export function displayResults(
+	testCases: TestCase[],
+	results: TestResult[],
+	totalTime: number,
+): void {
+	// Display test results
+	displayTestResults(testCases, results);
+
+	console.log();
+	console.log(pc.green(`✓ All tests completed in ${(totalTime / 1000).toFixed(1)}s`));
+
+	// Calculate metrics
+	const metrics = calculateTestMetrics(results);
+	const categoryAverages = calculateCategoryAverages(results);
+	const violationCounts = countViolationsByType(results);
+
+	const combinedMetrics = {
+		...metrics,
+		categoryAverages,
+		violationCounts,
+	};
+
+	// Display summary
+	displaySummaryTable(results, combinedMetrics);
+
+	// Display violations if any exist
+	if (violationCounts.critical > 0 || violationCounts.major > 0 || violationCounts.minor > 0) {
+		displayViolationsDetail(results);
+	}
+}
+
+/**
+ * Displays error message and exits
+ * @param message - Error message
+ * @param error - Optional error object
+ */
+export function displayError(message: string, error?: unknown): void {
+	console.error(pc.red(`✗ ${message}`));
+	if (error) {
+		console.error(error);
+	}
+	process.exit(1);
+}
--- a/packages/@n8n/ai-workflow-builder.ee/evaluations/cli/runner.ts
+++ b/packages/@n8n/ai-workflow-builder.ee/evaluations/cli/runner.ts
@@ -0,0 +1,104 @@
+import pLimit from 'p-limit';
+import pc from 'picocolors';
+
+import { createProgressBar, updateProgress, displayResults, displayError } from './display.js';
+import { basicTestCases, generateTestCases } from '../chains/test-case-generator.js';
+import {
+	setupTestEnvironment,
+	createAgent,
+	getConcurrencyLimit,
+	shouldGenerateTestCases,
+	howManyTestCasesToGenerate,
+} from '../core/environment.js';
+import { runSingleTest, initializeTestTracking } from '../core/test-runner.js';
+import type { TestCase } from '../types/evaluation.js';
+import {
+	calculateTestMetrics,
+	calculateCategoryAverages,
+	countViolationsByType,
+} from '../utils/evaluation-calculator.js';
+import { formatHeader, saveEvaluationResults } from '../utils/evaluation-helpers.js';
+import { generateMarkdownReport } from '../utils/evaluation-reporter.js';
+
+/**
+ * Main CLI evaluation runner that executes all test cases in parallel
+ * Supports concurrency control via EVALUATION_CONCURRENCY environment variable
+ */
+export async function runCliEvaluation(): Promise<void> {
+	console.log(formatHeader('AI Workflow Builder Full Evaluation', 70));
+	console.log();
+	try {
+		// Setup test environment
+		const { parsedNodeTypes, llm, tracer } = await setupTestEnvironment();
+
+		// Determine test cases to run
+		let testCases: TestCase[] = basicTestCases;
+
+		// Optionally generate additional test cases
+		if (shouldGenerateTestCases()) {
+			console.log(pc.blue('➔ Generating additional test cases...'));
+			const generatedCases = await generateTestCases(llm, howManyTestCasesToGenerate());
+			testCases = [...testCases, ...generatedCases];
+		}
+
+		// Get concurrency from environment
+		const concurrency = getConcurrencyLimit();
+		console.log(pc.dim(`Running ${testCases.length} test cases with concurrency=${concurrency}`));
+		console.log();
+
+		// Create progress bar
+		const progressBar = createProgressBar(testCases.length);
+
+		// Create concurrency limiter
+		const limit = pLimit(concurrency);
+
+		// Track progress
+		let completed = 0;
+		const startTime = Date.now();
+		const testResults = initializeTestTracking(testCases);
+
+		// Run all test cases in parallel with concurrency limit
+		const promises = testCases.map(
+			async (testCase) =>
+				await limit(async () => {
+					updateProgress(progressBar, completed, testCases.length, `Running: ${testCase.name}`);
+
+					// Create a dedicated agent for this test to avoid state conflicts
+					const testAgent = createAgent(parsedNodeTypes, llm, tracer);
+					const result = await runSingleTest(testAgent, llm, testCase);
+
+					testResults[testCase.id] = result.error ? 'fail' : 'pass';
+					completed++;
+					updateProgress(progressBar, completed, testCases.length);
+					return result;
+				}),
+		);
+
+		const results = await Promise.all(promises);
+		const totalTime = Date.now() - startTime;
+		progressBar.stop();
+
+		// Display results
+		displayResults(testCases, results, totalTime);
+
+		// Calculate metrics for report
+		const metrics = calculateTestMetrics(results);
+		const categoryAverages = calculateCategoryAverages(results);
+		const violationCounts = countViolationsByType(results);
+
+		const combinedMetrics = {
+			...metrics,
+			categoryAverages,
+			violationCounts,
+		};
+
+		// Generate and save results
+		const report = generateMarkdownReport(results, combinedMetrics);
+		const { reportPath, resultsPath } = saveEvaluationResults(results, report);
+
+		console.log(`\nReport saved to: ${reportPath}`);
+		console.log(`Detailed results saved to: ${resultsPath}`);
+	} catch (error) {
+		displayError('Evaluation failed', error);
+	}
+}
--- a/packages/@n8n/ai-workflow-builder.ee/evaluations/core/environment.ts
+++ b/packages/@n8n/ai-workflow-builder.ee/evaluations/core/environment.ts
@@ -0,0 +1,131 @@
+import type { BaseChatModel } from '@langchain/core/language_models/chat_models';
+import { LangChainTracer } from '@langchain/core/tracers/tracer_langchain';
+import { MemorySaver } from '@langchain/langgraph';
+import { Client } from 'langsmith/client';
+import type { INodeTypeDescription } from 'n8n-workflow';
+
+import { anthropicClaudeSonnet4 } from '../../src/llm-config.js';
+import { WorkflowBuilderAgent } from '../../src/workflow-builder-agent.js';
+import { loadNodesFromFile } from '../load-nodes.js';
+
+export interface TestEnvironment {
+	parsedNodeTypes: INodeTypeDescription[];
+	llm: BaseChatModel;
+	tracer?: LangChainTracer;
+	lsClient?: Client;
+}
+
+/**
+ * Sets up the LLM with proper configuration
+ * @returns Configured LLM instance
+ * @throws Error if N8N_AI_ANTHROPIC_KEY environment variable is not set
+ */
+export async function setupLLM(): Promise<BaseChatModel> {
+	const apiKey = process.env.N8N_AI_ANTHROPIC_KEY;
+	if (!apiKey) {
+		throw new Error('N8N_AI_ANTHROPIC_KEY environment variable is required');
+	}
+	return await anthropicClaudeSonnet4({ apiKey });
+}
+
+/**
+ * Creates a LangChain tracer for monitoring agent execution
+ * @param projectName - Name of the LangSmith project
+ * @returns LangChainTracer instance or undefined if API key not provided
+ */
+export function createTracer(projectName: string): LangChainTracer | undefined {
+	const apiKey = process.env.LANGSMITH_API_KEY;
+	if (!apiKey) {
+		return undefined;
+	}
+
+	const tracingClient = new Client({ apiKey });
+	return new LangChainTracer({
+		client: tracingClient,
+		projectName,
+	});
+}
+
+/**
+ * Creates a Langsmith client if API key is available
+ * @returns Langsmith client or undefined
+ */
+export function createLangsmithClient(): Client | undefined {
+	const apiKey = process.env.LANGSMITH_API_KEY;
+	if (!apiKey) {
+		return undefined;
+	}
+	return new Client({ apiKey });
+}
+
+/**
+ * Sets up the test environment with LLM, nodes, and tracing
+ * @returns Test environment configuration
+ */
+export async function setupTestEnvironment(): Promise<TestEnvironment> {
+	const parsedNodeTypes = loadNodesFromFile();
+	const llm = await setupLLM();
+	const tracer = createTracer('workflow-builder-evaluation');
+	const lsClient = createLangsmithClient();
+
+	return { parsedNodeTypes, llm, tracer, lsClient };
+}
+
+/**
+ * Creates a new WorkflowBuilderAgent instance
+ * @param parsedNodeTypes - Array of parsed node type descriptions
+ * @param llm - Language model instance
+ * @param tracer - Optional LangChain tracer
+ * @returns Configured WorkflowBuilderAgent
+ */
+export function createAgent(
+	parsedNodeTypes: INodeTypeDescription[],
+	llm: BaseChatModel,
+	tracer?: LangChainTracer,
+): WorkflowBuilderAgent {
+	return new WorkflowBuilderAgent({
+		parsedNodeTypes,
+		llmSimpleTask: llm,
+		llmComplexTask: llm,
+		checkpointer: new MemorySaver(),
+		tracer,
+	});
+}
+
+/**
+ * Get concurrency limit from environment
+ * @returns Concurrency limit (defaults to 5)
+ */
+export function getConcurrencyLimit(): number {
+	const envConcurrency = process.env.EVALUATION_CONCURRENCY;
+	if (envConcurrency) {
+		const parsed = parseInt(envConcurrency, 10);
+		if (!isNaN(parsed) && parsed > 0) {
+			return parsed;
+		}
+	}
+	return 5;
+}
+
+/**
+ * Check if test cases should be generated
+ * @returns True if test cases should be generated
+ */
+export function shouldGenerateTestCases(): boolean {
+	return process.env.GENERATE_TEST_CASES === 'true';
+}
+
+/**
+ * How many test cases to generate based on environment variable
+ * @returns Number of test cases to generate (defaults to 10)
+ */
+export function howManyTestCasesToGenerate(): number {
+	const envCount = process.env.GENERATE_TEST_CASES_COUNT;
+	if (envCount) {
+		const parsed = parseInt(envCount, 10);
+		if (!isNaN(parsed) && parsed > 0) {
+			return parsed;
+		}
+	}
+	return 10; // Default to 10 if not specified
+}
--- a/packages/@n8n/ai-workflow-builder.ee/evaluations/core/test-runner.ts
+++ b/packages/@n8n/ai-workflow-builder.ee/evaluations/core/test-runner.ts
@@ -0,0 +1,131 @@
+import type { BaseChatModel } from '@langchain/core/language_models/chat_models';
+
+import type { SimpleWorkflow } from '../../src/types/workflow.js';
+import type { WorkflowBuilderAgent, ChatPayload } from '../../src/workflow-builder-agent.js';
+import { evaluateWorkflow } from '../chains/workflow-evaluator.js';
+import type { EvaluationInput, EvaluationResult, TestCase } from '../types/evaluation.js';
+import { isWorkflowStateValues } from '../types/langsmith.js';
+import type { TestResult } from '../types/test-result.js';
+
+/**
+ * Creates an error result for a failed test
+ * @param testCase - The test case that failed
+ * @param error - The error that occurred
+ * @returns TestResult with error information
+ */
+export function createErrorResult(testCase: TestCase, error: unknown): TestResult {
+	const errorMessage = error instanceof Error ? error.message : String(error);
+
+	return {
+		testCase,
+		generatedWorkflow: { nodes: [], connections: {}, name: 'Generated Workflow' },
+		evaluationResult: {
+			overallScore: 0,
+			functionality: { score: 0, violations: [] },
+			connections: { score: 0, violations: [] },
+			expressions: { score: 0, violations: [] },
+			nodeConfiguration: { score: 0, violations: [] },
+			structuralSimilarity: { score: 0, violations: [], applicable: false },
+			summary: `Evaluation failed: ${errorMessage}`,
+		},
+		generationTime: 0,
+		error: errorMessage,
+	};
+}
+
+/**
+ * Runs a single test case by generating a workflow and evaluating it
+ * @param agent - The workflow builder agent to use
+ * @param llm - Language model for evaluation
+ * @param testCase - Test case to execute
+ * @param userId - User ID for the session
+ * @returns Test result with generated workflow and evaluation
+ */
+export async function runSingleTest(
+	agent: WorkflowBuilderAgent,
+	llm: BaseChatModel,
+	testCase: TestCase,
+	userId: string = 'test-user',
+): Promise<TestResult> {
+	try {
+		const chatPayload: ChatPayload = {
+			message: testCase.prompt,
+			workflowContext: {
+				currentWorkflow: { id: testCase.id, nodes: [], connections: {} },
+			},
+		};
+
+		// Generate workflow
+		const startTime = Date.now();
+		let messageCount = 0;
+		for await (const _output of agent.chat(chatPayload, userId)) {
+			messageCount++;
+		}
+		const generationTime = Date.now() - startTime;
+
+		// Get generated workflow with validation
+		const state = await agent.getState(testCase.id, userId);
+
+		// Validate workflow state
+		if (!state.values || !isWorkflowStateValues(state.values)) {
+			throw new Error('Invalid workflow state: missing or malformed workflow');
+		}
+
+		const generatedWorkflow = state.values.workflowJSON;
+
+		// Evaluate
+		const evaluationInput: EvaluationInput = {
+			userPrompt: testCase.prompt,
+			generatedWorkflow,
+			referenceWorkflow: testCase.referenceWorkflow,
+		};
+
+		const evaluationResult = await evaluateWorkflow(llm, evaluationInput);
+
+		return {
+			testCase,
+			generatedWorkflow,
+			evaluationResult,
+			generationTime,
+		};
+	} catch (error) {
+		return createErrorResult(testCase, error);
+	}
+}
+
+/**
+ * Initialize test tracking map
+ * @param testCases - Array of test cases
+ * @returns Map of test ID to status
+ */
+export function initializeTestTracking(
+	testCases: TestCase[],
+): Record<string, 'pending' | 'pass' | 'fail'> {
+	const tracking: Record<string, 'pending' | 'pass' | 'fail'> = {};
+	for (const testCase of testCases) {
+		tracking[testCase.id] = 'pending';
+	}
+	return tracking;
+}
+
+/**
+ * Create a test result from a workflow state
+ * @param testCase - The test case
+ * @param workflow - Generated workflow
+ * @param evaluationResult - Evaluation result
+ * @param generationTime - Time taken to generate workflow
+ * @returns TestResult
+ */
+export function createTestResult(
+	testCase: TestCase,
+	workflow: SimpleWorkflow,
+	evaluationResult: EvaluationResult,
+	generationTime: number,
+): TestResult {
+	return {
+		testCase,
+		generatedWorkflow: workflow,
+		evaluationResult,
+		generationTime,
+	};
+}
--- a/packages/@n8n/ai-workflow-builder.ee/evaluations/index.ts
+++ b/packages/@n8n/ai-workflow-builder.ee/evaluations/index.ts
@@ -0,0 +1,27 @@
+import { runCliEvaluation } from './cli/runner.js';
+import { runLangsmithEvaluation } from './langsmith/runner.js';
+
+// Re-export for external use if needed
+export { runCliEvaluation } from './cli/runner.js';
+export { runLangsmithEvaluation } from './langsmith/runner.js';
+export { runSingleTest } from './core/test-runner.js';
+export { setupTestEnvironment, createAgent } from './core/environment.js';
+
+/**
+ * Main entry point for evaluation
+ * Determines which evaluation mode to run based on environment variables
+ */
+async function main(): Promise<void> {
+	const useLangsmith = process.env.USE_LANGSMITH_EVAL === 'true';
+
+	if (useLangsmith) {
+		await runLangsmithEvaluation();
+	} else {
+		await runCliEvaluation();
+	}
+}
+
+// Run if called directly
+if (require.main === module) {
+	main().catch(console.error);
+}
--- a/packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/evaluator.ts
+++ b/packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/evaluator.ts
@@ -0,0 +1,171 @@
+import type { BaseChatModel } from '@langchain/core/language_models/chat_models';
+import type { EvaluationResult as LangsmithEvaluationResult } from 'langsmith/evaluation';
+import type { Run, Example } from 'langsmith/schemas';
+
+import type { SimpleWorkflow } from '../../src/types/workflow.js';
+import { evaluateWorkflow } from '../chains/workflow-evaluator.js';
+import type { EvaluationInput, CategoryScore } from '../types/evaluation.js';
+import {
+	isSimpleWorkflow,
+	isValidPrompt,
+	formatViolations,
+	type UsageMetadata,
+} from '../types/langsmith.js';
+
+// Helper to validate run outputs
+function validateRunOutputs(outputs: unknown): {
+	workflow?: SimpleWorkflow;
+	prompt?: string;
+	referenceWorkflow?: SimpleWorkflow;
+	usage?: Partial<UsageMetadata>;
+	error?: string;
+} {
+	if (!outputs || typeof outputs !== 'object') {
+		return { error: 'No outputs found in run' };
+	}
+
+	const runOutputs = outputs as Record<string, unknown>;
+
+	if (!isSimpleWorkflow(runOutputs.workflow)) {
+		return { error: 'Invalid or missing workflow in outputs' };
+	}
+
+	if (!isValidPrompt(runOutputs.prompt)) {
+		return { error: 'Invalid or missing prompt in outputs' };
+	}
+
+	// Extract usage metadata if available
+	const usage = extractUsageMetadata(runOutputs.usage);
+
+	// Extract reference workflow if available
+	let referenceWorkflow: SimpleWorkflow | undefined;
+	if (runOutputs.referenceOutputs && typeof runOutputs.referenceOutputs === 'object') {
+		const refOutputs = runOutputs.referenceOutputs as Record<string, unknown>;
+		if (isSimpleWorkflow(refOutputs.workflowJSON)) {
+			referenceWorkflow = refOutputs.workflowJSON;
+		}
+	}
+
+	return {
+		workflow: runOutputs.workflow,
+		prompt: runOutputs.prompt,
+		referenceWorkflow,
+		usage,
+	};
+}
+
+// Helper to extract usage metadata
+function extractUsageMetadata(usage: unknown): Partial<UsageMetadata> {
+	if (!usage || typeof usage !== 'object') return {};
+
+	const rawUsage = usage as Record<string, unknown>;
+	const usageFieldMap: Record<string, keyof UsageMetadata> = {
+		input_tokens: 'input_tokens',
+		output_tokens: 'output_tokens',
+		cache_create_input_tokens: 'cache_creation_input_tokens',
+		cache_read_input_tokens: 'cache_read_input_tokens',
+	};
+
+	const result: Partial<UsageMetadata> = {};
+	for (const [sourceKey, targetKey] of Object.entries(usageFieldMap)) {
+		const value = rawUsage[sourceKey];
+		if (typeof value === 'number') {
+			result[targetKey] = value;
+		}
+	}
+	return result;
+}
+
+// Helper to convert category scores to Langsmith results
+function categoryToResult(key: string, category: CategoryScore): LangsmithEvaluationResult {
+	return {
+		key,
+		score: category.score,
+		comment: formatViolations(category.violations),
+	};
+}
+
+/**
+ * Creates a Langsmith evaluator function that uses the LLM-based workflow evaluator
+ * @param llm - Language model to use for evaluation
+ * @returns Evaluator function compatible with Langsmith
+ */
+export function createLangsmithEvaluator(
+	llm: BaseChatModel,
+): (rootRun: Run, example?: Example) => Promise<LangsmithEvaluationResult[]> {
+	return async (rootRun: Run, _example?: Example): Promise<LangsmithEvaluationResult[]> => {
+		// Validate and extract outputs
+		const validation = validateRunOutputs(rootRun.outputs);
+		if (validation.error) {
+			return [
+				{
+					key: 'evaluationError',
+					score: 0,
+					comment: validation.error,
+				},
+			];
+		}
+
+		const evaluationInput: EvaluationInput = {
+			userPrompt: validation.prompt!,
+			generatedWorkflow: validation.workflow!,
+			referenceWorkflow: validation.referenceWorkflow,
+		};
+
+		try {
+			const evaluationResult = await evaluateWorkflow(llm, evaluationInput);
+			const results: LangsmithEvaluationResult[] = [];
+
+			// Add category scores
+			const categories = [
+				{ key: 'functionality', score: evaluationResult.functionality },
+				{ key: 'connections', score: evaluationResult.connections },
+				{ key: 'expressions', score: evaluationResult.expressions },
+				{ key: 'nodeConfiguration', score: evaluationResult.nodeConfiguration },
+			];
+
+			for (const { key, score } of categories) {
+				results.push(categoryToResult(key, score));
+			}
+
+			// Add usage metadata if available
+			const usageMetrics = [
+				{ key: 'inputTokens', value: validation.usage?.input_tokens },
+				{ key: 'outputTokens', value: validation.usage?.output_tokens },
+				{ key: 'cacheCreationInputTokens', value: validation.usage?.cache_creation_input_tokens },
+				{ key: 'cacheReadInputTokens', value: validation.usage?.cache_read_input_tokens },
+			];
+
+			for (const metric of usageMetrics) {
+				if (metric.value !== undefined) {
+					results.push({ key: metric.key, score: metric.value });
+				}
+			}
+
+			// Add structural similarity if applicable
+			if (validation.referenceWorkflow && evaluationResult.structuralSimilarity.applicable) {
+				results.push(
+					categoryToResult('structuralSimilarity', evaluationResult.structuralSimilarity),
+				);
+			}
+
+			// Add overall score
+			results.push({
+				key: 'overallScore',
+				score: evaluationResult.overallScore,
+				comment: evaluationResult.summary,
+			});
+
+			return results;
+		} catch (error) {
+			const errorMessage = error instanceof Error ? error.message : String(error);
+			return [
+				{
+					key: 'evaluationError',
+					score: 0,
+					comment: `Evaluation failed: ${errorMessage}`,
+				},
+			];
+		}
+	};
+}
--- a/packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/runner.ts
+++ b/packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/runner.ts
@@ -0,0 +1,168 @@
+import type { BaseChatModel } from '@langchain/core/language_models/chat_models.js';
+import type { LangChainTracer } from '@langchain/core/tracers/tracer_langchain.js';
+import { evaluate } from 'langsmith/evaluation';
+import type { INodeTypeDescription } from 'n8n-workflow';
+import pc from 'picocolors';
+
+import { createLangsmithEvaluator } from './evaluator.js';
+import type { ChatPayload } from '../../src/workflow-builder-agent.js';
+import type { WorkflowState } from '../../src/workflow-state.js';
+import { setupTestEnvironment, createAgent } from '../core/environment.js';
+import {
+	generateRunId,
+	safeExtractUsage,
+	isWorkflowStateValues,
+	extractMessageContent,
+} from '../types/langsmith.js';
+import { formatHeader } from '../utils/evaluation-helpers.js';
+
+/**
+ * Creates a workflow generation function for Langsmith evaluation
+ * @param parsedNodeTypes - Node types
+ * @param llm - Language model
+ * @param tracer - Optional tracer
+ * @returns Function that generates workflows from inputs
+ */
+function createWorkflowGenerator(
+	parsedNodeTypes: INodeTypeDescription[],
+	llm: BaseChatModel,
+	tracer?: LangChainTracer,
+) {
+	return async (inputs: typeof WorkflowState.State) => {
+		// Generate a unique ID for this evaluation run
+		const runId = generateRunId();
+
+		// Validate inputs
+		if (!inputs.messages || !Array.isArray(inputs.messages) || inputs.messages.length === 0) {
+			throw new Error('No messages provided in inputs');
+		}
+
+		// Extract first message content safely
+		const firstMessage = inputs.messages[0];
+		const messageContent = extractMessageContent(firstMessage);
+
+		// Create agent for this run
+		const agent = createAgent(parsedNodeTypes, llm, tracer);
+
+		const chatPayload: ChatPayload = {
+			message: messageContent,
+			workflowContext: {
+				currentWorkflow: { id: runId, nodes: [], connections: {} },
+			},
+		};
+
+		// Generate workflow
+		let messageCount = 0;
+		for await (const _output of agent.chat(chatPayload, 'langsmith-eval-user')) {
+			messageCount++;
+		}
+
+		// Get generated workflow with validation
+		const state = await agent.getState(runId, 'langsmith-eval-user');
+
+		// Validate state
+		if (!state.values) {
+			throw new Error('No values in agent state');
+		}
+
+		if (!isWorkflowStateValues(state.values)) {
+			throw new Error('Invalid workflow state: workflow or messages missing');
+		}
+
+		const generatedWorkflow = state.values.workflowJSON;
+		const messages = state.values.messages;
+
+		// Extract usage metadata safely
+		const usage = safeExtractUsage(messages);
+
+		return {
+			workflow: generatedWorkflow,
+			prompt: chatPayload.message,
+			usage,
+		};
+	};
+}
+
+/**
+ * Runs evaluation using Langsmith
+ */
+export async function runLangsmithEvaluation(): Promise<void> {
+	console.log(formatHeader('AI Workflow Builder Langsmith Evaluation', 70));
+	console.log();
+
+	// Check for Langsmith API key
+	if (!process.env.LANGSMITH_API_KEY) {
+		console.error(pc.red('✗ LANGSMITH_API_KEY environment variable not set'));
+		process.exit(1);
+	}
+
+	try {
+		// Setup test environment
+		const { parsedNodeTypes, llm, tracer, lsClient } = await setupTestEnvironment();
+
+		if (!lsClient) {
+			throw new Error('Langsmith client not initialized');
+		}
+
+		// Get dataset name from env or use default
+		const datasetName = process.env.LANGSMITH_DATASET_NAME ?? 'workflow-builder-canvas-prompts';
+		console.log(pc.blue(`➔ Using dataset: ${datasetName}`));
+
+		// Verify dataset exists
+		try {
+			await lsClient.readDataset({ datasetName });
+		} catch (error) {
+			console.error(pc.red(`✗ Dataset "${datasetName}" not found`));
+			console.log('\nAvailable datasets:');
+
+			// List available datasets
+			for await (const dataset of lsClient.listDatasets()) {
+				console.log(pc.dim(`  - ${dataset.name} (${dataset.id})`));
+			}
+
+			console.log(
+				'\nTo use a different dataset, set the LANGSMITH_DATASET_NAME environment variable',
+			);
+			process.exit(1);
+		}
+
+		console.log();
+		const startTime = Date.now();
+
+		// Create workflow generation function
+		const generateWorkflow = createWorkflowGenerator(parsedNodeTypes, llm, tracer);
+
+		// Create LLM-based evaluator
+		const evaluator = createLangsmithEvaluator(llm);
+
+		// Run Langsmith evaluation
+		const results = await evaluate(generateWorkflow, {
+			data: datasetName,
+			evaluators: [evaluator],
+			maxConcurrency: 7,
+			experimentPrefix: 'workflow-builder-evaluation',
+			metadata: {
+				evaluationType: 'llm-based',
+				modelName: process.env.LLM_MODEL ?? 'default',
+			},
+		});
+
+		const totalTime = Date.now() - startTime;
+		console.log(pc.green(`✓ Evaluation completed in ${(totalTime / 1000).toFixed(1)}s`));
+
+		// Display results information
+		console.log('\nView detailed results in Langsmith dashboard');
+		console.log(
+			`Experiment name: workflow-builder-evaluation-${new Date().toISOString().split('T')[0]}`,
+		);
+
+		// Log summary of results if available
+		if (results) {
+			console.log(pc.dim('Evaluation run completed successfully'));
+			console.log(pc.dim(`Dataset: ${datasetName}`));
+		}
+	} catch (error) {
+		console.error(pc.red('✗ Langsmith evaluation failed:'), error);
+		process.exit(1);
+	}
+}
--- a/packages/@n8n/ai-workflow-builder.ee/evaluations/load-nodes.ts
+++ b/packages/@n8n/ai-workflow-builder.ee/evaluations/load-nodes.ts
@@ -0,0 +1,106 @@
+import { readFileSync, existsSync } from 'fs';
+import { jsonParse, type INodeTypeDescription } from 'n8n-workflow';
+import { join } from 'path';
+
+interface NodeWithVersion extends INodeTypeDescription {
+	version: number | number[];
+	defaultVersion?: number;
+}
+
+export function loadNodesFromFile(): INodeTypeDescription[] {
+	console.log('Loading nodes from nodes.json...');
+
+	const nodesPath = join(__dirname, 'nodes.json');
+
+	// Check if nodes.json exists
+	if (!existsSync(nodesPath)) {
+		const errorMessage = `
+ERROR: nodes.json file not found at ${nodesPath}
+
+The nodes.json file is required for evaluations to work properly.
+Please ensure nodes.json is present in the evaluations root directory.
+
+To generate nodes.json:
+1. Run the n8n instance
+2. Export the node definitions to evaluations/nodes.json
+3. This file contains all available n8n node type definitions needed for validation
+
+Without nodes.json, the evaluator cannot validate node types and parameters.
+`;
+		console.error(errorMessage);
+		throw new Error('nodes.json file not found. See console output for details.');
+	}
+
+	const nodesData = readFileSync(nodesPath, 'utf-8');
+	const allNodes = jsonParse<NodeWithVersion[]>(nodesData);
+
+	console.log(`Total nodes loaded: ${allNodes.length}`);
+
+	// Group nodes by name
+	const nodesByName = new Map<string, NodeWithVersion[]>();
+
+	for (const node of allNodes) {
+		const existing = nodesByName.get(node.name) ?? [];
+		existing.push(node);
+		nodesByName.set(node.name, existing);
+	}
+
+	console.log(`Unique node types: ${nodesByName.size}`);
+
+	// Extract latest version for each node
+	const latestNodes: INodeTypeDescription[] = [];
+	let multiVersionCount = 0;
+
+	for (const [_nodeName, versions] of nodesByName.entries()) {
+		if (versions.length > 1) {
+			multiVersionCount++;
+			// Find the node with the default version
+			let selectedNode: NodeWithVersion | undefined;
+
+			for (const node of versions) {
+				// Select the node that matches the default version
+				if (node.defaultVersion !== undefined) {
+					if (Array.isArray(node.version)) {
+						// For array versions, check if it includes the default version
+						if (node.version.includes(node.defaultVersion)) {
+							selectedNode = node;
+						}
+					} else if (node.version === node.defaultVersion) {
+						selectedNode = node;
+					}
+				}
+			}
+
+			// If we found a matching node, use it; otherwise use the first one
+			if (selectedNode) {
+				latestNodes.push(selectedNode);
+			} else {
+				latestNodes.push(versions[0]);
+			}
+		} else {
+			// Single version node
+			latestNodes.push(versions[0]);
+		}
+	}
+
+	console.log(`\nNodes with multiple versions: ${multiVersionCount}`);
+	console.log(`Final node count: ${latestNodes.length}`);
+
+	// Filter out hidden nodes
+	const visibleNodes = latestNodes.filter((node) => !node.hidden);
+	console.log(`Visible nodes (after filtering hidden): ${visibleNodes.length}\n`);
+
+	return visibleNodes;
+}
+
+// Helper function to get specific node version for testing
+export function getNodeVersion(nodes: INodeTypeDescription[], nodeName: string): string {
+	const node = nodes.find((n) => n.name === nodeName);
+	if (!node) return 'not found';
+
+	const version = (node as NodeWithVersion).version;
+	if (Array.isArray(version)) {
+		return `[${version.join(', ')}]`;
+	}
+	return version?.toString() || 'unknown';
+}
--- a/packages/@n8n/ai-workflow-builder.ee/evaluations/types/evaluation.ts
+++ b/packages/@n8n/ai-workflow-builder.ee/evaluations/types/evaluation.ts
@@ -0,0 +1,68 @@
+import { z } from 'zod';
+
+import type { SimpleWorkflow } from '../../src/types/workflow';
+
+// Violation schema
+const violationSchema = z.object({
+	type: z.enum(['critical', 'major', 'minor']),
+	description: z.string(),
+	pointsDeducted: z.number().min(0),
+});
+
+// Category score schema
+const categoryScoreSchema = z.object({
+	violations: z.array(violationSchema),
+	score: z.number().min(0).max(1),
+});
+
+// Structural similarity schema (with applicable flag)
+const structuralSimilaritySchema = z.object({
+	violations: z.array(violationSchema),
+	score: z.number().min(0).max(1),
+	applicable: z
+		.boolean()
+		.describe('Whether this category was evaluated (based on reference workflow availability)'),
+});
+
+// Main evaluation result schema
+export const evaluationResultSchema = z.object({
+	overallScore: z
+		.number()
+		.min(0)
+		.max(1)
+		.describe('Weighted average score across all categories (0-1)'),
+	functionality: categoryScoreSchema,
+	connections: categoryScoreSchema,
+	expressions: categoryScoreSchema,
+	nodeConfiguration: categoryScoreSchema,
+	structuralSimilarity: structuralSimilaritySchema,
+	summary: z.string().describe('2-3 sentences summarizing main strengths and weaknesses'),
+	criticalIssues: z
+		.array(z.string())
+		.describe('List of issues that would prevent the workflow from functioning')
+		.optional(),
+});
+
+// Type exports
+export type Violation = z.infer<typeof violationSchema>;
+export type CategoryScore = z.infer<typeof categoryScoreSchema>;
+export type EvaluationResult = z.infer<typeof evaluationResultSchema>;
+
+// Test case schema for evaluation
+export const testCaseSchema = z.object({
+	id: z.string(),
+	name: z.string(),
+	prompt: z.string(),
+	referenceWorkflow: z.custom<SimpleWorkflow>().optional(),
+});
+
+export type TestCase = z.infer<typeof testCaseSchema>;
+
+// Evaluation input schema
+export const evaluationInputSchema = z.object({
+	userPrompt: z.string(),
+	generatedWorkflow: z.custom<SimpleWorkflow>(),
+	referenceWorkflow: z.custom<SimpleWorkflow>().optional(),
+});
+
+export type EvaluationInput = z.infer<typeof evaluationInputSchema>;
--- a/packages/@n8n/ai-workflow-builder.ee/evaluations/types/langsmith.ts
+++ b/packages/@n8n/ai-workflow-builder.ee/evaluations/types/langsmith.ts
@@ -0,0 +1,140 @@
+import type { BaseMessage } from '@langchain/core/messages';
+
+import type { SimpleWorkflow } from '../../src/types/workflow';
+import type { AIMessageWithUsageMetadata } from '../../src/utils/token-usage';
+
+// Define strict interfaces
+export type UsageMetadata = AIMessageWithUsageMetadata['response_metadata']['usage'];
+
+export interface WorkflowOutput {
+	workflow?: unknown;
+	prompt?: unknown;
+	usage?: unknown;
+}
+
+export interface WorkflowStateValues {
+	messages: BaseMessage[];
+	workflowJSON: SimpleWorkflow;
+	[key: string]: unknown;
+}
+
+// Type guards - no coercion, just validation
+export function isMessageWithMetadata(message: BaseMessage): message is AIMessageWithUsageMetadata {
+	return (
+		message.response_metadata !== undefined &&
+		message.response_metadata !== null &&
+		typeof message.response_metadata === 'object'
+	);
+}
+
+export function hasUsageMetadata(metadata: { usage?: unknown }): metadata is {
+	usage: Partial<UsageMetadata>;
+} {
+	if (!metadata.usage || typeof metadata.usage !== 'object') {
+		return false;
+	}
+
+	const usage = metadata.usage as Record<string, unknown>;
+
+	// Validate each field is either undefined or a number
+	const validFields = [
+		'input_tokens',
+		'output_tokens',
+		'cache_read_input_tokens',
+		'cache_creation_input_tokens',
+	];
+
+	return validFields.every(
+		(field) => usage[field] === undefined || typeof usage[field] === 'number',
+	);
+}
+
+export function isValidPrompt(value: unknown): value is string {
+	return typeof value === 'string' && value.length > 0;
+}
+
+export function isSimpleWorkflow(value: unknown): value is SimpleWorkflow {
+	if (!value || typeof value !== 'object') return false;
+
+	const obj = value as Record<string, unknown>;
+	return (
+		Array.isArray(obj.nodes) && obj.connections !== undefined && typeof obj.connections === 'object'
+	);
+}
+
+export function isWorkflowStateValues(values: unknown): values is WorkflowStateValues {
+	if (!values || typeof values !== 'object') return false;
+	if (!('messages' in values) || !('workflowJSON' in values)) return false;
+
+	return Array.isArray(values.messages) && isSimpleWorkflow(values.workflowJSON);
+}
+
+// Safe extraction without coercion
+export function safeExtractUsage(messages: BaseMessage[]): UsageMetadata {
+	const defaultUsage: UsageMetadata = {
+		input_tokens: 0,
+		output_tokens: 0,
+		cache_read_input_tokens: 0,
+		cache_creation_input_tokens: 0,
+	};
+
+	return messages.reduce((acc, message) => {
+		if (!isMessageWithMetadata(message)) return acc;
+		if (!hasUsageMetadata(message.response_metadata)) return acc;
+
+		const usage = message.response_metadata.usage;
+
+		return {
+			input_tokens: acc.input_tokens + (usage.input_tokens ?? 0),
+			output_tokens: acc.output_tokens + (usage.output_tokens ?? 0),
+			cache_read_input_tokens:
+				(acc?.cache_read_input_tokens ?? 0) + (usage.cache_read_input_tokens ?? 0),
+			cache_creation_input_tokens:
+				(acc?.cache_creation_input_tokens ?? 0) + (usage.cache_creation_input_tokens ?? 0),
+		};
+	}, defaultUsage);
+}
+
+// Helper to format violations for display
+export function formatViolations(violations: Array<{ type: string; description: string }>): string {
+	if (violations.length === 0) {
+		return 'All checks passed';
+	}
+	return `Found ${violations.length} violation(s): ${violations
+		.map((v) => `${v.type} - ${v.description}`)
+		.join('; ')}`;
+}
+
+// Generate a unique run ID
+export function generateRunId(): string {
+	return `eval-${Date.now()}-${Math.random().toString(36).substring(2, 11)}`;
+}
+
+// Validate and extract message content
+export function extractMessageContent(message: BaseMessage | undefined): string {
+	if (!message) {
+		throw new Error('No message provided');
+	}
+
+	// @ts-expect-error We need to extract content from kwargs as that's how Langsmith messages are structured
+	// eslint-disable-next-line @typescript-eslint/no-unsafe-member-access
+	const content = message.content ?? message.kwargs?.content;
+
+	if (typeof content === 'string') {
+		return content;
+	}
+
+	if (Array.isArray(content)) {
+		// Extract text from complex content
+		const textContent = content
+			.filter((item) => item?.type === 'text')
+			.map((item) => (item as { text: string }).text)
+			.join('\n');
+
+		if (textContent) {
+			return textContent;
+		}
+	}
+
+	throw new Error('Message content must be a string or contain text content');
+}
--- a/packages/@n8n/ai-workflow-builder.ee/evaluations/types/test-result.ts
+++ b/packages/@n8n/ai-workflow-builder.ee/evaluations/types/test-result.ts
@@ -0,0 +1,13 @@
+import type { TestCase, EvaluationResult } from './evaluation.js';
+import type { SimpleWorkflow } from '../../src/types/workflow.js';
+
+/**
+ * Result of running a single test case
+ */
+export interface TestResult {
+	testCase: TestCase;
+	generatedWorkflow: SimpleWorkflow;
+	evaluationResult: EvaluationResult;
+	generationTime: number;
+	error?: string;
+}
--- a/packages/@n8n/ai-workflow-builder.ee/evaluations/utils/evaluation-calculator.ts
+++ b/packages/@n8n/ai-workflow-builder.ee/evaluations/utils/evaluation-calculator.ts
@@ -0,0 +1,126 @@
+import type { TestResult } from '../types/test-result.js';
+
+/**
+ * Calculates average scores for each evaluation category
+ * @param results - Array of test results
+ * @returns Object with average scores per category
+ */
+export function calculateCategoryAverages(
+	results: TestResult[],
+): Record<'functionality' | 'connections' | 'expressions' | 'nodeConfiguration', number> {
+	const successfulTests = results.filter((r) => !r.error).length;
+	const categoryAverages = {
+		functionality: 0,
+		connections: 0,
+		expressions: 0,
+		nodeConfiguration: 0,
+	};
+
+	results
+		.filter((r) => !r.error)
+		.forEach((r) => {
+			categoryAverages.functionality += r.evaluationResult.functionality.score;
+			categoryAverages.connections += r.evaluationResult.connections.score;
+			categoryAverages.expressions += r.evaluationResult.expressions.score;
+			categoryAverages.nodeConfiguration += r.evaluationResult.nodeConfiguration.score;
+		});
+
+	Object.keys(categoryAverages).forEach((key) => {
+		categoryAverages[key as keyof typeof categoryAverages] /= successfulTests || 1;
+	});
+
+	return categoryAverages;
+}
+
+/**
+ * Counts violations by severity type across all test results
+ * @param results - Array of test results
+ * @returns Object with counts for each violation type
+ */
+export function countViolationsByType(results: TestResult[]): {
+	critical: number;
+	major: number;
+	minor: number;
+} {
+	let criticalCount = 0;
+	let majorCount = 0;
+	let minorCount = 0;
+
+	results.forEach((r) => {
+		if (!r.error) {
+			const allViolations = [
+				...r.evaluationResult.functionality.violations,
+				...r.evaluationResult.connections.violations,
+				...r.evaluationResult.expressions.violations,
+				...r.evaluationResult.nodeConfiguration.violations,
+			];
+			criticalCount += allViolations.filter((v) => v.type === 'critical').length;
+			majorCount += allViolations.filter((v) => v.type === 'major').length;
+			minorCount += allViolations.filter((v) => v.type === 'minor').length;
+		}
+	});
+
+	return { critical: criticalCount, major: majorCount, minor: minorCount };
+}
+
+/**
+ * Calculates test metrics including success rate and average score
+ * @param results - Array of test results
+ * @returns Object with calculated metrics
+ */
+export function calculateTestMetrics(results: TestResult[]): {
+	totalTests: number;
+	successfulTests: number;
+	failedTests: number;
+	averageScore: number;
+	successRate: number;
+} {
+	const totalTests = results.length;
+	const successfulTests = results.filter((r) => !r.error).length;
+	const failedTests = totalTests - successfulTests;
+
+	const averageScore =
+		successfulTests > 0
+			? results
+					.filter((r) => !r.error)
+					.reduce((sum, r) => sum + r.evaluationResult.overallScore, 0) / successfulTests
+			: 0;
+
+	const successRate = totalTests > 0 ? successfulTests / totalTests : 0;
+
+	return {
+		totalTests,
+		successfulTests,
+		failedTests,
+		averageScore,
+		successRate,
+	};
+}
+
+/**
+ * Calculates average generation time for successful tests
+ * @param results - Array of test results
+ * @returns Average generation time in milliseconds
+ */
+export function calculateAverageGenerationTime(results: TestResult[]): number {
+	const successfulResults = results.filter((r) => !r.error);
+	if (successfulResults.length === 0) return 0;
+
+	const totalTime = successfulResults.reduce((sum, r) => sum + r.generationTime, 0);
+	return totalTime / successfulResults.length;
+}
+
+/**
+ * Groups test results by their success status
+ * @param results - Array of test results
+ * @returns Object with grouped results
+ */
+export function groupResultsByStatus(results: TestResult[]): {
+	successful: TestResult[];
+	failed: TestResult[];
+} {
+	return {
+		successful: results.filter((r) => !r.error),
+		failed: results.filter((r) => r.error),
+	};
+}
--- a/packages/@n8n/ai-workflow-builder.ee/evaluations/utils/evaluation-helpers.ts
+++ b/packages/@n8n/ai-workflow-builder.ee/evaluations/utils/evaluation-helpers.ts
@@ -0,0 +1,270 @@
+import type { BaseChatModel } from '@langchain/core/language_models/chat_models';
+import { LangChainTracer } from '@langchain/core/tracers/tracer_langchain';
+import { MemorySaver } from '@langchain/langgraph';
+import { mkdirSync, writeFileSync } from 'fs';
+import { Client } from 'langsmith';
+import type { INodeTypeDescription } from 'n8n-workflow';
+import { join } from 'path';
+import pc from 'picocolors';
+
+import { anthropicClaudeSonnet4 } from '../../src/llm-config.js';
+import { WorkflowBuilderAgent } from '../../src/workflow-builder-agent.js';
+import type { Violation } from '../types/evaluation.js';
+import type { TestResult } from '../types/test-result.js';
+
+/**
+ * Sets up the LLM with proper configuration
+ * @returns Configured LLM instance
+ * @throws Error if N8N_AI_ANTHROPIC_KEY environment variable is not set
+ */
+export async function setupLLM(): Promise<BaseChatModel> {
+	const apiKey = process.env.N8N_AI_ANTHROPIC_KEY;
+	if (!apiKey) {
+		throw new Error('N8N_AI_ANTHROPIC_KEY environment variable is required');
+	}
+	return await anthropicClaudeSonnet4({ apiKey });
+}
+
+/**
+ * Creates a LangChain tracer for monitoring agent execution
+ * @param projectName - Name of the LangSmith project
+ * @returns LangChainTracer instance or undefined if API key not provided
+ */
+export function createTracer(projectName: string): LangChainTracer | undefined {
+	const apiKey = process.env.LANGSMITH_API_KEY;
+	if (!apiKey) {
+		return undefined;
+	}
+
+	const tracingClient = new Client({ apiKey });
+	return new LangChainTracer({
+		client: tracingClient,
+		projectName,
+	});
+}
+
+/**
+ * Creates a new WorkflowBuilderAgent instance
+ * @param parsedNodeTypes - Array of parsed node type descriptions
+ * @param llm - Language model instance
+ * @param tracer - Optional LangChain tracer
+ * @returns Configured WorkflowBuilderAgent
+ */
+export function createAgent(
+	parsedNodeTypes: INodeTypeDescription[],
+	llm: BaseChatModel,
+	tracer?: LangChainTracer,
+): WorkflowBuilderAgent {
+	return new WorkflowBuilderAgent({
+		parsedNodeTypes,
+		llmSimpleTask: llm,
+		llmComplexTask: llm,
+		checkpointer: new MemorySaver(),
+		tracer,
+	});
+}
+
+/**
+ * Groups violations by category for display
+ * @param violations - Array of violations with category information
+ * @returns Grouped violations by severity type
+ */
+export function groupViolationsBySeverity(violations: Array<Violation & { category: string }>): {
+	critical: Array<Violation & { category: string }>;
+	major: Array<Violation & { category: string }>;
+	minor: Array<Violation & { category: string }>;
+} {
+	return {
+		critical: violations.filter((v) => v.type === 'critical'),
+		major: violations.filter((v) => v.type === 'major'),
+		minor: violations.filter((v) => v.type === 'minor'),
+	};
+}
+
+/**
+ * Formats violations for console display
+ * @param violations - Array of violations to format
+ * @param title - Section title
+ */
+export function displayViolationSection(
+	violations: Array<Violation & { category: string }>,
+	title: string,
+): void {
+	if (violations.length === 0) return;
+
+	console.log(`\n${title}:`);
+	violations.forEach((v) => {
+		const typeFormatted = formatViolationType(v.type);
+		console.log(
+			`  ${typeFormatted} [${v.category}] ${v.description} ${pc.dim(`(-${v.pointsDeducted} pts)`)}`,
+		);
+	});
+}
+
+/**
+ * Logs progress dots during long-running operations
+ * @param count - Current iteration count
+ * @param interval - How often to print a dot (default: 10)
+ */
+export function logProgress(count: number, interval: number = 10): void {
+	if (count % interval === 0) {
+		process.stdout.write('.');
+	}
+}
+
+/**
+ * Formats percentage for display
+ * @param value - Decimal value between 0 and 1
+ * @param decimals - Number of decimal places (default: 1)
+ * @returns Formatted percentage string
+ */
+export function formatPercentage(value: number, decimals: number = 1): string {
+	return `${(value * 100).toFixed(decimals)}%`;
+}
+
+/**
+ * Calculates elapsed time and formats it for display
+ * @param startTime - Start timestamp from Date.now()
+ * @returns Formatted time string
+ */
+export function formatElapsedTime(startTime: number): string {
+	const elapsed = Date.now() - startTime;
+	if (elapsed < 1000) {
+		return `${elapsed}ms`;
+	}
+	return `${(elapsed / 1000).toFixed(1)}s`;
+}
+
+/**
+ * Formats a score with appropriate color based on value
+ * @param score - Score between 0 and 1
+ * @param asPercentage - Whether to format as percentage
+ * @returns Color-formatted score string
+ */
+export function formatColoredScore(score: number, asPercentage = true): string {
+	const value = asPercentage ? formatPercentage(score) : score.toFixed(2);
+	if (score >= 0.8) return pc.green(value);
+	if (score >= 0.5) return pc.yellow(value);
+	return pc.red(value);
+}
+
+/**
+ * Creates a formatted section header
+ * @param title - Header title
+ * @param width - Total width of the header line
+ * @returns Formatted header string
+ */
+export function formatHeader(title: string, width = 60): string {
+	const padding = Math.max(0, width - title.length - 4);
+	const leftPad = Math.floor(padding / 2);
+	const rightPad = padding - leftPad;
+	return pc.blue('═'.repeat(leftPad) + ` ${title} ` + '═'.repeat(rightPad));
+}
+
+/**
+ * Formats a status badge with appropriate icon and color
+ * @param status - Status type
+ * @returns Formatted status badge
+ */
+export function formatStatusBadge(status: 'pass' | 'fail' | 'running' | 'pending'): string {
+	switch (status) {
+		case 'pass':
+			return pc.green('✓ PASS');
+		case 'fail':
+			return pc.red('✗ FAIL');
+		case 'running':
+			return pc.blue('⚡ RUNNING');
+		case 'pending':
+			return pc.gray('○ PENDING');
+	}
+}
+
+/**
+ * Formats violation type with appropriate color
+ * @param type - Violation type
+ * @returns Color-formatted violation type
+ */
+export function formatViolationType(type: 'critical' | 'major' | 'minor'): string {
+	switch (type) {
+		case 'critical':
+			return pc.red('[CRITICAL]');
+		case 'major':
+			return pc.yellow('[MAJOR]');
+		case 'minor':
+			return pc.gray('[MINOR]');
+	}
+}
+
+/**
+ * Formats a test name with dimmed ID
+ * @param name - Test name
+ * @param id - Test ID
+ * @returns Formatted test name
+ */
+export function formatTestName(name: string, id: string): string {
+	return `${name} ${pc.dim(`(${id})`)}`;
+}
+
+/**
+ * Collects all violations from test results with their test context
+ * @param results - Array of test results
+ * @returns Array of violations with test name and category
+ */
+export function collectAllViolations(results: TestResult[]): Array<{
+	violation: Violation & { category: string };
+	testName: string;
+}> {
+	const allViolations: Array<{
+		violation: Violation & { category: string };
+		testName: string;
+	}> = [];
+
+	results.forEach((result) => {
+		if (!result.error) {
+			const testViolations = [
+				...result.evaluationResult.functionality.violations.map((v) => ({
+					violation: { ...v, category: 'Functionality' },
+					testName: result.testCase.name,
+				})),
+				...result.evaluationResult.connections.violations.map((v) => ({
+					violation: { ...v, category: 'Connections' },
+					testName: result.testCase.name,
+				})),
+				...result.evaluationResult.expressions.violations.map((v) => ({
+					violation: { ...v, category: 'Expressions' },
+					testName: result.testCase.name,
+				})),
+				...result.evaluationResult.nodeConfiguration.violations.map((v) => ({
+					violation: { ...v, category: 'Node Config' },
+					testName: result.testCase.name,
+				})),
+			];
+			allViolations.push.apply(allViolations, testViolations);
+		}
+	});
+
+	return allViolations;
+}
+
+/**
+ * Saves evaluation results to disk in both JSON and markdown formats
+ * @param results - Array of test results
+ * @param report - Generated markdown report
+ * @returns Paths to saved files
+ */
+export function saveEvaluationResults(
+	results: TestResult[],
+	report: string,
+): { reportPath: string; resultsPath: string } {
+	const resultsDir = join(process.cwd(), 'evaluations', 'results');
+	mkdirSync(resultsDir, { recursive: true });
+
+	const timestamp = new Date().toISOString().replace(/:/g, '-');
+	const reportPath = join(resultsDir, `evaluation-report-${timestamp}.md`);
+	const resultsPath = join(resultsDir, `evaluation-results-${timestamp}.json`);
+
+	writeFileSync(reportPath, report);
+	writeFileSync(resultsPath, JSON.stringify(results, null, 2));
+
+	return { reportPath, resultsPath };
+}
--- a/packages/@n8n/ai-workflow-builder.ee/evaluations/utils/evaluation-reporter.ts
+++ b/packages/@n8n/ai-workflow-builder.ee/evaluations/utils/evaluation-reporter.ts
@@ -0,0 +1,264 @@
+import Table from 'cli-table3';
+import pc from 'picocolors';
+
+import {
+	formatColoredScore,
+	formatHeader,
+	formatPercentage,
+	formatStatusBadge,
+	formatTestName,
+	formatViolationType,
+} from './evaluation-helpers.js';
+import type { Violation } from '../types/evaluation.js';
+import type { TestResult } from '../types/test-result.js';
+
+/**
+ * Generates a markdown report from evaluation results
+ * @param results - Array of test results
+ * @param metrics - Calculated metrics including averages and counts
+ * @returns Formatted markdown report string
+ */
+export function generateMarkdownReport(
+	results: TestResult[],
+	metrics: {
+		totalTests: number;
+		successfulTests: number;
+		averageScore: number;
+		categoryAverages: Record<string, number>;
+		violationCounts: { critical: number; major: number; minor: number };
+	},
+): string {
+	const { totalTests, successfulTests, averageScore, categoryAverages, violationCounts } = metrics;
+
+	let report = `# AI Workflow Builder Evaluation Report
+
+## Summary
+- Total Tests: ${totalTests}
+- Successful: ${successfulTests}
+- Failed: ${totalTests - successfulTests}
+- Average Score: ${formatPercentage(averageScore)}
+
+## Category Averages
+- Functionality: ${formatPercentage(categoryAverages.functionality)}
+- Connections: ${formatPercentage(categoryAverages.connections)}
+- Expressions: ${formatPercentage(categoryAverages.expressions)}
+- Node Configuration: ${formatPercentage(categoryAverages.nodeConfiguration)}
+
+## Violations Summary
+- Critical: ${violationCounts.critical}
+- Major: ${violationCounts.major}
+- Minor: ${violationCounts.minor}
+
+## Detailed Results
+
+`;
+
+	results.forEach((result) => {
+		report += `### ${result.testCase.name} (${result.testCase.id})
+- **Score**: ${formatPercentage(result.evaluationResult.overallScore)}
+- **Generation Time**: ${result.generationTime}ms
+- **Nodes Generated**: ${result.generatedWorkflow.nodes.length}
+- **Summary**: ${result.evaluationResult.summary}
+
+`;
+
+		if (
+			result.evaluationResult.criticalIssues &&
+			result.evaluationResult.criticalIssues.length > 0
+		) {
+			report += '**Critical Issues**:\n';
+			result.evaluationResult.criticalIssues.forEach((issue) => {
+				report += `- ${issue}\n`;
+			});
+			report += '\n';
+		}
+
+		const allViolations = [
+			...result.evaluationResult.functionality.violations.map((v) => ({
+				...v,
+				category: 'Functionality',
+			})),
+			...result.evaluationResult.connections.violations.map((v) => ({
+				...v,
+				category: 'Connections',
+			})),
+			...result.evaluationResult.expressions.violations.map((v) => ({
+				...v,
+				category: 'Expressions',
+			})),
+			...result.evaluationResult.nodeConfiguration.violations.map((v) => ({
+				...v,
+				category: 'Node Configuration',
+			})),
+		];
+
+		if (allViolations.length > 0) {
+			report += '**Violations**:\n';
+			allViolations.forEach((v) => {
+				report += `- [${v.type.toUpperCase()}] ${v.category}: ${v.description}\n`;
+			});
+			report += '\n';
+		}
+	});
+
+	return report;
+}
+
+/**
+ * Displays test results summary in the console
+ * @param testCases - Array of test cases
+ * @param results - Array of test results
+ */
+export function displayTestResults(
+	testCases: Array<{ id: string; name: string }>,
+	results: TestResult[],
+): void {
+	console.log();
+	console.log(formatHeader('Test Results', 70));
+	console.log();
+
+	for (const testCase of testCases) {
+		const result = results.find((r) => r.testCase.id === testCase.id);
+		if (result) {
+			const status = result.error ? 'fail' : 'pass';
+			const badge = formatStatusBadge(status);
+			const score = result.error ? 'N/A' : formatColoredScore(result.evaluationResult.overallScore);
+			console.log(`  ${badge} ${formatTestName(testCase.name, testCase.id)}`);
+			console.log(
+				`     Score: ${score} | Nodes: ${result.generatedWorkflow?.nodes?.length} | Time: ${result.generationTime}ms`,
+			);
+			if (result.error) {
+				console.log(`     ${pc.red('Error:')} ${pc.dim(result.error)}`);
+			}
+		}
+	}
+}
+
+/**
+ * Displays the evaluation summary table
+ * @param results - Array of test results
+ * @param metrics - Calculated metrics
+ */
+export function displaySummaryTable(
+	_results: TestResult[],
+	metrics: {
+		totalTests: number;
+		successfulTests: number;
+		averageScore: number;
+		categoryAverages: Record<string, number>;
+		violationCounts: { critical: number; major: number; minor: number };
+	},
+): void {
+	const { totalTests, successfulTests, averageScore, categoryAverages, violationCounts } = metrics;
+	const failedTests = totalTests - successfulTests;
+
+	const summaryTable = new Table({
+		head: ['Metric', 'Value'],
+		style: { head: ['cyan'] },
+	});
+
+	summaryTable.push(
+		['Total Tests', totalTests.toString()],
+		['Successful', pc.green(successfulTests.toString())],
+		['Failed', failedTests > 0 ? pc.red(failedTests.toString()) : '0'],
+		['Average Score', formatColoredScore(averageScore)],
+		[pc.dim('─'.repeat(20)), pc.dim('─'.repeat(20))],
+		['Functionality', formatColoredScore(categoryAverages.functionality)],
+		['Connections', formatColoredScore(categoryAverages.connections)],
+		['Expressions', formatColoredScore(categoryAverages.expressions)],
+		['Node Config', formatColoredScore(categoryAverages.nodeConfiguration)],
+		[pc.dim('─'.repeat(20)), pc.dim('─'.repeat(20))],
+		[
+			'Critical Issues',
+			violationCounts.critical > 0 ? pc.red(violationCounts.critical.toString()) : '0',
+		],
+		['Major Issues', violationCounts.major > 0 ? pc.yellow(violationCounts.major.toString()) : '0'],
+		['Minor Issues', pc.dim(violationCounts.minor.toString())],
+	);
+
+	console.log();
+	console.log(formatHeader('Summary', 70));
+	console.log(summaryTable.toString());
+}
+
+/**
+ * Displays detailed violations grouped by severity
+ * @param results - Array of test results
+ */
+export function displayViolationsDetail(results: TestResult[]): void {
+	// Collect all violations with test context
+	const allViolations: Array<{
+		violation: Violation & { category: string };
+		testName: string;
+	}> = [];
+
+	results.forEach((result) => {
+		if (!result.error) {
+			const testViolations = [
+				...result.evaluationResult.functionality.violations.map((v) => ({
+					violation: { ...v, category: 'Functionality' },
+					testName: result.testCase.name,
+				})),
+				...result.evaluationResult.connections.violations.map((v) => ({
+					violation: { ...v, category: 'Connections' },
+					testName: result.testCase.name,
+				})),
+				...result.evaluationResult.expressions.violations.map((v) => ({
+					violation: { ...v, category: 'Expressions' },
+					testName: result.testCase.name,
+				})),
+				...result.evaluationResult.nodeConfiguration.violations.map((v) => ({
+					violation: { ...v, category: 'Node Config' },
+					testName: result.testCase.name,
+				})),
+			];
+			allViolations.push.apply(allViolations, testViolations);
+		}
+	});
+
+	if (allViolations.length === 0) return;
+
+	console.log();
+	console.log(formatHeader('Violations Detail', 70));
+
+	// Group violations by severity
+	const criticalViolations = allViolations.filter((v) => v.violation.type === 'critical');
+	const majorViolations = allViolations.filter((v) => v.violation.type === 'major');
+	const minorViolations = allViolations.filter((v) => v.violation.type === 'minor');
+
+	// Display critical violations
+	if (criticalViolations.length > 0) {
+		console.log();
+		console.log(pc.red('Critical Violations:'));
+		criticalViolations.forEach(({ violation, testName }) => {
+			console.log(
+				`  ${formatViolationType('critical')} [${violation.category}] ${violation.description}`,
+			);
+			console.log(`     ${pc.dim(`Test: ${testName} | Points: -${violation.pointsDeducted}`)}`);
+		});
+	}
+
+	// Display major violations
+	if (majorViolations.length > 0) {
+		console.log();
+		console.log(pc.yellow('Major Violations:'));
+		majorViolations.forEach(({ violation, testName }) => {
+			console.log(
+				`  ${formatViolationType('major')} [${violation.category}] ${violation.description}`,
+			);
+			console.log(`     ${pc.dim(`Test: ${testName} | Points: -${violation.pointsDeducted}`)}`);
+		});
+	}
+
+	// Display minor violations
+	if (minorViolations.length > 0) {
+		console.log();
+		console.log(pc.gray('Minor Violations:'));
+		minorViolations.forEach(({ violation, testName }) => {
+			console.log(
+				`  ${formatViolationType('minor')} [${violation.category}] ${violation.description}`,
+			);
+			console.log(`     ${pc.dim(`Test: ${testName} | Points: -${violation.pointsDeducted}`)}`);
+		});
+	}
+}
--- a/packages/@n8n/ai-workflow-builder.ee/package.json
+++ b/packages/@n8n/ai-workflow-builder.ee/package.json
@@ -20,8 +20,9 @@
    "deps:report": "madge src/index.ts --json > deps-report.json && echo 'Dependency report saved to deps-report.json'",
    "deps:orphans": "madge src/index.ts --orphans",
    "deps:all": "pnpm run deps:graph && pnpm run deps:graph:service && pnpm run deps:graph:tools && pnpm run deps:circular && pnpm run deps:report",
-    "eval": "tsx evaluations/run-evaluation.ts",
-    "eval:generate": "GENERATE_TEST_CASES=true tsx evaluations/run-evaluation.ts"
+    "eval": "tsx evaluations",
+    "eval:langsmith": "USE_LANGSMITH_EVAL=true tsx evaluations",
+    "eval:generate": "GENERATE_TEST_CASES=true tsx evaluations"
  },
  "main": "dist/index.js",
  "module": "src/index.ts",
--- a/packages/@n8n/ai-workflow-builder.ee/src/chains/prompts/prompt-builder.ts
+++ b/packages/@n8n/ai-workflow-builder.ee/src/chains/prompts/prompt-builder.ts
@@ -66,7 +66,7 @@ export class ParameterUpdatePromptBuilder {
 			const examples = this.selectRelevantExamples(context);
 			if (examples.length > 0) {
 				sections.push('\n## Relevant Examples');
-				sections.push(...examples);
+				sections.push.apply(sections, examples);
 			}
 		}

--- a/packages/@n8n/ai-workflow-builder.ee/src/utils/token-usage.ts
+++ b/packages/@n8n/ai-workflow-builder.ee/src/utils/token-usage.ts
@@ -1,10 +1,12 @@
 import { AIMessage } from '@langchain/core/messages';

-type AIMessageWithUsageMetadata = AIMessage & {
+export type AIMessageWithUsageMetadata = AIMessage & {
 	response_metadata: {
 		usage: {
 			input_tokens: number;
 			output_tokens: number;
+			cache_read_input_tokens?: number;
+			cache_creation_input_tokens?: number;
 		};
 	};
 };