Architecture & Design

Watcher is an Open Source ETL Metadata Framework designed for high-performance data pipeline monitoring and observability. Built with FastAPI and optimized for speed, it provides fast response times to ensure minimal impact on your data pipelines.

Design Philosophy

Watcher is built on several core design principles that guide its architecture and implementation:

Configuration as Code

Watcher is designed to reflect configuration stored in source control. Any updates to the configuration in source control will be automatically reflected in Watcher through hash-based change detection. This ensures that the Watcher configuration stays synchronized with the configuration in your code.

Key Benefits:

  • Version Control Integration - Pipeline and lineage definitions stored alongside ETL code

  • Automatic Synchronization - Changes in code automatically update Watcher configuration

  • Reproducibility - Same configuration across all environments

  • Code Review - Pipeline changes go through the same review process as code changes

Recommended Practice: Store your Pipeline configuration and Address Lineage in source control within your pipeline code for optimal integration.

Efficiency & Performance

Watcher is designed to be efficient and performant to have minimal impact on the data pipelines it is monitoring. Any non-essential operations are designed to run in the background.

Performance Features:

  • Async Operations - Non-blocking I/O for maximum throughput

  • Background Processing - Heavy operations don’t impact API response times

  • Connection Pooling - Efficient database connection management

  • Optimized Queries - Minimal database round trips with strategic indexing

Scalability

Watcher is designed to be scalable and handle large amounts of data. It can handle thousands of pipelines and millions of executions through a single instance.

Scalability Features:

  • Horizontal Scaling - Multiple Celery workers for background processing

  • Database Optimization - Efficient queries and indexing for large datasets

  • Resource Management - Optimized memory and CPU usage

  • Load Distribution - Celery task distribution across multiple workers

Reliability

Watcher is designed to be deployed on Kubernetes to allow for replicas and failover, ensuring high availability and fault tolerance.

Reliability Features:

  • Container Orchestration - Kubernetes deployment for high availability

  • Task Persistence - Redis-backed queues prevent task loss

  • Error Recovery - Automatic retry logic with exponential backoff

  • Health Monitoring - Comprehensive health checks and alerting

Observability

Watcher is designed to be observable through its integration with Logfire. Having an outside service monitoring the Watcher framework is essential for active monitoring.

Observability Features:

  • Centralized Logging - Aggregated logs from all components

  • Performance Metrics - Request/response time monitoring

  • Error Tracking - Automatic error detection and alerting

  • Debugging Support - Detailed request tracing and analysis

High-Level Architecture

Watcher is built on a modern, high-performance stack designed for scalability and reliability:

Core Components:

  • FastAPI - High-performance web framework with async support

  • PostgreSQL - Reliable, open-source database with advanced features

  • Celery - Distributed task queue for background processing

  • Redis - Message broker and caching layer

  • Docker - Containerization for deployment and scaling

FastAPI Framework

FastAPI was chosen for its simplicity, speed, and modern async capabilities:

Key Features:

  • orjson Integration - Fast JSON serialization/deserialization for optimal performance

  • Gunicorn + Uvicorn - Production-ready ASGI server with worker management

  • asyncpg Integration - Fastest asynchronous PostgreSQL driver

  • Connection Pooling - Efficient database connection management

  • Pydantic Validation - Automatic request/response validation with database constraint enforcement

  • Full Async Support - Asynchronous operations throughout the application

Performance Benefits:

  • Sub-second response times for all API endpoints

  • Minimal overhead on pipeline execution

  • High concurrency support for multiple simultaneous requests

PostgreSQL Database

PostgreSQL provides the robust, scalable foundation for Watcher’s metadata storage:

Optimizations:

  • RETURNING Clauses - Combines INSERT/UPDATE and SELECT operations to minimize database trips

  • Strategic Indexing - Comprehensive indexing strategy designed for large table growth

  • BIGINT Support - Handles massive scale (beyond 2 billion records)

  • Check Constraints - Data quality enforcement at the database level

  • Custom Enums - Type-safe enumeration support for data validation

  • JSONB Support - Flexible metadata storage with efficient querying

Celery Background Processing

Celery handles time-consuming operations to keep the main API responsive:

Performance Features:

  • Rate Limiting - Protects database from overwhelming requests

  • Task Duration Logging - Performance monitoring and analysis

  • Scalable Workers - Horizontal scaling for increased throughput

  • Retry Logic - Automatic failure recovery with exponential backoff

Redis Message Broker

Redis serves as the message broker and provides additional functionality:

Primary Functions:

  • Celery Queue - Task distribution and management

  • Persistence - Queue durability for reliability

  • Task Duration Storage - Performance metrics collection

  • Queue Monitoring - Real-time queue status and health checks

Docker Containerization

Docker provides standardized deployment and development environments:

Container Strategy:

  • FastAPI Application - Web API container with Gunicorn workers

  • Celery Workers - Background task processing containers

  • Development Environment - Docker Compose for local development

  • Production Deployment - Kubernetes-ready container images

Logfire Integration

Logfire provides comprehensive logging and monitoring capabilities:

Features:

  • Centralized Logging - Aggregates logs from FastAPI and Celery workers

  • Error Rate Monitoring - Automatic alerting for error rate spikes

  • Performance Tracking - Request/response time monitoring

  • Debugging Support - Detailed request tracing and error analysis

Benefits:

  • Observability - Complete visibility into system behavior

  • Alerting - Proactive issue detection and notification

  • Debugging - Easy troubleshooting with detailed logs

  • Performance Analysis - Identify bottlenecks and optimization opportunities

Performance Design Goals

Watcher was designed with one primary goal: minimal impact on your data pipelines.

Key Principles:

  • Fast Response Times - Every API call optimized for speed

  • Efficient Database Queries - Minimal database round trips

  • Background Processing - Heavy operations don’t block API responses

  • Resource Optimization - Efficient memory and CPU usage

  • Scalable Architecture - Grows with your data infrastructure

Result: The Watcher framework provides comprehensive metadata management and monitoring capabilities while maintaining negligible performance impact on your data pipelines. This ensures that adding observability doesn’t slow down your data processing workflows.