Building a distributed task framework that processes sensitive data at scale without moving it outside customer networks.
When you're scanning terabytes of sensitive data for security violations, single-threaded processing doesn't cut it. Traditional pipeline orchestration tools like Airflow can distribute work, but they come with a problem: they require access to your customer's network.
We needed something different. A system that could scale data processing across multiple workers while keeping data exactly where it belongs in the customer’s environment.
The problem: Scale vs. privacy in data scanning
Most data security tools face a fundamental tradeoff. You can either:
- Scale efficiently by moving data to centralized processing clusters
- Maintain privacy by keeping data in the customer's network
Traditional tools choose scale. They extract data, process it centrally, then send back results. This approach works for performance but breaks down when customers have strict data residency requirements.
Our customers often can't move sensitive data outside their networks. Healthcare companies with HIPAA requirements. Financial services with regulatory constraints. Government agencies with classification levels.
For them, the privacy-first approach isn't optional, it's mandatory.
Our approach: Distributed processing within customer networks
We built what we call Distributed GAI (General App Inspector) a system that brings compute to the data instead of data to compute.
Architecture: Celery + Redis + Kubernetes
The system runs on three core components:
Celery for task distribution: Handles job queuing and worker coordination. When we need to scan a large dataset, Celery breaks it into chunks and distributes them across available workers.
Redis for state management: Stores task state and intermediate results. If a worker goes down mid-scan, Redis ensures we can resume exactly where we left off.
Kubernetes for worker orchestration: Manages worker lifecycle and scaling. Most importantly, it lets us use spot instances for cost efficiency.
Processing flow: Chunked and parallel
Here's how a typical scan works:
# Simplified example of our chunking logic
def distribute_scan(dataset_path, chunk_size=1000):
chunks = split_dataset(dataset_path, chunk_size)
for chunk in chunks:
scan_chunk.delay(chunk.path, chunk.metadata)
return aggregate_results()
@celery.task(bind=True)
def scan_chunk(self, chunk_path, metadata):
try:
sensitive_data = analyze_for_pii(chunk_path)
return {
'chunk_id': metadata['chunk_id'],
'findings': sensitive_data,
'status': 'complete'
}
except Exception as exc:
# Retry logic for spot instance interruptions
self.retry(countdown=60, max_retries=3)
Instead of processing an entire database in sequence, we:
- Split data into chunks - Break large datasets into manageable pieces
- Process in parallel - Distribute chunks across multiple workers
- Handle interruptions gracefully - Resume tasks if spot instances get terminated
- Aggregate results - Combine findings from all workers
Cost optimization: Spot instances with resilience
The key insight was using spot instances AWS instances that can be interrupted but cost 60-90% less than on-demand.
Most systems avoid spot instances for critical workloads because interruptions cause data loss. We built interruption handling into the architecture:
- Stateless workers: No worker holds critical state that can't be reconstructed
- Checkpoint-based resumption: Tasks save progress to Redis at regular intervals
- Automatic retry: When a spot instance disappears, Kubernetes schedules the task on a new worker
This lets us process data at a fraction of traditional costs while maintaining reliability.
Why existing tools don't work
Airflow and similar orchestrators excel at complex workflows but assume you can move data freely. They're designed for centralized data lakes, not distributed customer environments.
Pipeline-as-a-Service tools offer convenience but require data egress. When your customer is a bank that can't move transaction data outside their VPC, these tools are non-starters.
Traditional scanning tools often use single-threaded processing or require expensive dedicated infrastructure.
We needed something that combined the scalability of modern orchestration with the privacy constraints of regulated industries.
The results: Faster scans, lower costs, zero data movement
Six months after deploying the distributed system:
10x faster processing. What used to take hours now completes in minutes. Parallel processing across dozens of workers transforms scan performance.
70% cost reduction. Spot instances deliver massive savings while our retry logic handles interruptions seamlessly.
Zero data egress. Sensitive data never leaves the customer's network. Processing happens where the data lives.
Flexible deployment. Customers can run our system in their own infrastructure or let us manage it without changing our privacy guarantees.
What's next
The distributed processing framework is just the beginning. We're exploring how this approach enables new capabilities:
Real-time sensitive data detection as data moves through systems, not just at rest
Cross-environment scanning that works consistently across cloud, on-premises, and hybrid setups
Intelligent resource allocation that automatically scales workers based on data volume and sensitivity
As data volumes grow and privacy requirements tighten, the ability to process data where it lives without compromising scale or performance becomes increasingly critical.
The future of data security isn't about building bigger centralized systems. It's about building smarter distributed ones.