Blog

Distributed data scanning: How we built scalable privacy-first processing

September 16, 2025
3 min. Read
Felix Dent
Felix Dent
Principal Engineer

Distributed data scanning: How we built scalable privacy-first processing

September 16, 2025
3 min. Read

Building a distributed task framework that processes sensitive data at scale without moving it outside customer networks.

When you're scanning terabytes of sensitive data for security violations, single-threaded processing doesn't cut it. Traditional pipeline orchestration tools like Airflow can distribute work, but they come with a problem: they require access to your customer's network.

We needed something different. A system that could scale data processing across multiple workers while keeping data exactly where it belongs in the customer’s environment.

The problem: Scale vs. privacy in data scanning

Most data security tools face a fundamental tradeoff. You can either:

  1. Scale efficiently by moving data to centralized processing clusters
  2. Maintain privacy by keeping data in the customer's network

Traditional tools choose scale. They extract data, process it centrally, then send back results. This approach works for performance but breaks down when customers have strict data residency requirements.

Our customers often can't move sensitive data outside their networks. Healthcare companies with HIPAA requirements. Financial services with regulatory constraints. Government agencies with classification levels.

For them, the privacy-first approach isn't optional, it's mandatory.

Our approach: Distributed processing within customer networks

We built what we call Distributed GAI (General App Inspector) a system that brings compute to the data instead of data to compute.

Architecture: Celery + Redis + Kubernetes

The system runs on three core components:

Celery for task distribution: Handles job queuing and worker coordination. When we need to scan a large dataset, Celery breaks it into chunks and distributes them across available workers.

Redis for state management: Stores task state and intermediate results. If a worker goes down mid-scan, Redis ensures we can resume exactly where we left off.

Kubernetes for worker orchestration: Manages worker lifecycle and scaling. Most importantly, it lets us use spot instances for cost efficiency.

Processing flow: Chunked and parallel

Here's how a typical scan works:

# Simplified example of our chunking logic
def distribute_scan(dataset_path, chunk_size=1000):
    chunks = split_dataset(dataset_path, chunk_size)
    
    for chunk in chunks:
        scan_chunk.delay(chunk.path, chunk.metadata)
    
    return aggregate_results()

@celery.task(bind=True)
def scan_chunk(self, chunk_path, metadata):
    try:
        sensitive_data = analyze_for_pii(chunk_path)
        return {
            'chunk_id': metadata['chunk_id'],
            'findings': sensitive_data,
            'status': 'complete'
        }
    except Exception as exc:
        # Retry logic for spot instance interruptions
        self.retry(countdown=60, max_retries=3)

Instead of processing an entire database in sequence, we:

  1. Split data into chunks - Break large datasets into manageable pieces
  2. Process in parallel - Distribute chunks across multiple workers
  3. Handle interruptions gracefully - Resume tasks if spot instances get terminated
  4. Aggregate results - Combine findings from all workers

Cost optimization: Spot instances with resilience

The key insight was using spot instances AWS instances that can be interrupted but cost 60-90% less than on-demand.

Most systems avoid spot instances for critical workloads because interruptions cause data loss. We built interruption handling into the architecture:

  • Stateless workers: No worker holds critical state that can't be reconstructed
  • Checkpoint-based resumption: Tasks save progress to Redis at regular intervals
  • Automatic retry: When a spot instance disappears, Kubernetes schedules the task on a new worker

This lets us process data at a fraction of traditional costs while maintaining reliability.

Why existing tools don't work

Airflow and similar orchestrators excel at complex workflows but assume you can move data freely. They're designed for centralized data lakes, not distributed customer environments.

Pipeline-as-a-Service tools offer convenience but require data egress. When your customer is a bank that can't move transaction data outside their VPC, these tools are non-starters.

Traditional scanning tools often use single-threaded processing or require expensive dedicated infrastructure.

We needed something that combined the scalability of modern orchestration with the privacy constraints of regulated industries.

The results: Faster scans, lower costs, zero data movement

Six months after deploying the distributed system:

10x faster processing. What used to take hours now completes in minutes. Parallel processing across dozens of workers transforms scan performance.

70% cost reduction. Spot instances deliver massive savings while our retry logic handles interruptions seamlessly.

Zero data egress. Sensitive data never leaves the customer's network. Processing happens where the data lives.

Flexible deployment. Customers can run our system in their own infrastructure or let us manage it without changing our privacy guarantees.

What's next

The distributed processing framework is just the beginning. We're exploring how this approach enables new capabilities:

Real-time sensitive data detection as data moves through systems, not just at rest

Cross-environment scanning that works consistently across cloud, on-premises, and hybrid setups

Intelligent resource allocation that automatically scales workers based on data volume and sensitivity

As data volumes grow and privacy requirements tighten, the ability to process data where it lives without compromising scale or performance becomes increasingly critical.

The future of data security isn't about building bigger centralized systems. It's about building smarter distributed ones.

Building a distributed task framework that processes sensitive data at scale without moving it outside customer networks.

When you're scanning terabytes of sensitive data for security violations, single-threaded processing doesn't cut it. Traditional pipeline orchestration tools like Airflow can distribute work, but they come with a problem: they require access to your customer's network.

We needed something different. A system that could scale data processing across multiple workers while keeping data exactly where it belongs in the customer’s environment.

The problem: Scale vs. privacy in data scanning

Most data security tools face a fundamental tradeoff. You can either:

  1. Scale efficiently by moving data to centralized processing clusters
  2. Maintain privacy by keeping data in the customer's network

Traditional tools choose scale. They extract data, process it centrally, then send back results. This approach works for performance but breaks down when customers have strict data residency requirements.

Our customers often can't move sensitive data outside their networks. Healthcare companies with HIPAA requirements. Financial services with regulatory constraints. Government agencies with classification levels.

For them, the privacy-first approach isn't optional, it's mandatory.

Our approach: Distributed processing within customer networks

We built what we call Distributed GAI (General App Inspector) a system that brings compute to the data instead of data to compute.

Architecture: Celery + Redis + Kubernetes

The system runs on three core components:

Celery for task distribution: Handles job queuing and worker coordination. When we need to scan a large dataset, Celery breaks it into chunks and distributes them across available workers.

Redis for state management: Stores task state and intermediate results. If a worker goes down mid-scan, Redis ensures we can resume exactly where we left off.

Kubernetes for worker orchestration: Manages worker lifecycle and scaling. Most importantly, it lets us use spot instances for cost efficiency.

Processing flow: Chunked and parallel

Here's how a typical scan works:

# Simplified example of our chunking logic
def distribute_scan(dataset_path, chunk_size=1000):
    chunks = split_dataset(dataset_path, chunk_size)
    
    for chunk in chunks:
        scan_chunk.delay(chunk.path, chunk.metadata)
    
    return aggregate_results()

@celery.task(bind=True)
def scan_chunk(self, chunk_path, metadata):
    try:
        sensitive_data = analyze_for_pii(chunk_path)
        return {
            'chunk_id': metadata['chunk_id'],
            'findings': sensitive_data,
            'status': 'complete'
        }
    except Exception as exc:
        # Retry logic for spot instance interruptions
        self.retry(countdown=60, max_retries=3)

Instead of processing an entire database in sequence, we:

  1. Split data into chunks - Break large datasets into manageable pieces
  2. Process in parallel - Distribute chunks across multiple workers
  3. Handle interruptions gracefully - Resume tasks if spot instances get terminated
  4. Aggregate results - Combine findings from all workers

Cost optimization: Spot instances with resilience

The key insight was using spot instances AWS instances that can be interrupted but cost 60-90% less than on-demand.

Most systems avoid spot instances for critical workloads because interruptions cause data loss. We built interruption handling into the architecture:

  • Stateless workers: No worker holds critical state that can't be reconstructed
  • Checkpoint-based resumption: Tasks save progress to Redis at regular intervals
  • Automatic retry: When a spot instance disappears, Kubernetes schedules the task on a new worker

This lets us process data at a fraction of traditional costs while maintaining reliability.

Why existing tools don't work

Airflow and similar orchestrators excel at complex workflows but assume you can move data freely. They're designed for centralized data lakes, not distributed customer environments.

Pipeline-as-a-Service tools offer convenience but require data egress. When your customer is a bank that can't move transaction data outside their VPC, these tools are non-starters.

Traditional scanning tools often use single-threaded processing or require expensive dedicated infrastructure.

We needed something that combined the scalability of modern orchestration with the privacy constraints of regulated industries.

The results: Faster scans, lower costs, zero data movement

Six months after deploying the distributed system:

10x faster processing. What used to take hours now completes in minutes. Parallel processing across dozens of workers transforms scan performance.

70% cost reduction. Spot instances deliver massive savings while our retry logic handles interruptions seamlessly.

Zero data egress. Sensitive data never leaves the customer's network. Processing happens where the data lives.

Flexible deployment. Customers can run our system in their own infrastructure or let us manage it without changing our privacy guarantees.

What's next

The distributed processing framework is just the beginning. We're exploring how this approach enables new capabilities:

Real-time sensitive data detection as data moves through systems, not just at rest

Cross-environment scanning that works consistently across cloud, on-premises, and hybrid setups

Intelligent resource allocation that automatically scales workers based on data volume and sensitivity

As data volumes grow and privacy requirements tighten, the ability to process data where it lives without compromising scale or performance becomes increasingly critical.

The future of data security isn't about building bigger centralized systems. It's about building smarter distributed ones.

You may also like

The tectonic shift: Why data security must be rebuilt for the age of superintelligence

September 9, 2025
The tectonic shift: Why data security must be rebuilt for the age of superintelligence

Building cloud inventory at scale: Why we chose Delta Lake over traditional databases

September 2, 2025
Building cloud inventory at scale: Why we chose Delta Lake over traditional databases

Inside Relyance AI's Culture: "Being Data-Driven"

August 29, 2025
Inside Relyance AI's Culture: "Being Data-Driven"
No items found.
No items found.