Each intelligent service upstream of our data processing publishes data rapidly in their own schema. Our product pipelines then unify this data into single delta lakes per type of data we want to analyze:
- data_flows - Real-time data movement patterns across all sources
- source_code_mappings - Static analysis results with database and API references
- runtime_connections - Live infrastructure connectivity and data volumes
- third_party_integrations - External service data flows and transformations
- ai_model_lineage - Training data sources and model input/output tracking
This unified approach allows us to create snapshots of fully resolved data journeys over time. We're not just tracking current data flows, we're building a complete historical record of how sensitive data moves through your infrastructure.
This foundation will soon unlock time travel across data stores and data journeys: "Show me exactly how customer PII flowed through our systems on any date in the past quarter."# Real-time data flow tracking: From code to cloud
Learn how we built a system to track sensitive data flows from source code to AI models in real-time using DuckDB and efficient sampling.
When we started building our data security platform, we had a straightforward problem: existing tools only show you where data sits, not where it goes. They take snapshots of databases and file systems, but miss how data actually moves through your infrastructure.
This creates blind spots. Data flows through microservices, gets cached in Redis, processed by feature flags, and fed into ML models. Traditional scanning tools see none of this.
We needed to trace complete data flows from code to runtime to AI systems, continuously.
The problem: Incomplete visibility across code, runtime, and cloud
Existing DSPM tools scan databases and file systems periodically. They'll find PII in your PostgreSQL database, but miss that it's also flowing through your feature flag service, cached in Redis, processed by three microservices, and feeding your ML personalization model.
But here's the deeper challenge: no single data source gives you complete visibility.
Source code analysis is static. We can identify table names, column names, sometimes database names but we may not always know what the actual database host is at runtime.
Runtime monitoring captures live data flows but misses the broader context of how systems connect.
Third-party integrations operate in black boxes with limited visibility into their internal data handling.
AI workflows make this worse. When engineers deploy new AI features, there's no unified view of training data sources, prompt inputs, or model outputs across these disparate monitoring approaches.
The deployment velocity doesn't match security visibility. Engineers ship daily. Security scans weekly. But more critically, each monitoring approach only sees part of the puzzle.
Our approach: Correlating incomplete data sources for complete data journeys
We built visibility into four key points, but more importantly, we built the ability to correlate incomplete information across these sources to reconstruct complete data flows:
- Source code analysis - Parse repositories to map data flows at the application level
- Runtime monitoring - Track data movement through live infrastructure
- Third-party integrations - Monitor flows to SaaS tools and external APIs
- AI systems - Trace data through ML pipelines and model inference
The insight: we don't need to capture everything from each source, but we need to be able to match partial identifiers across sources to solve the complete data journey puzzle.
The correlation challenge: Matching partial clues
Here's a real example of the puzzle we solve daily:
From source code analysis: We find database queries referencing table customer_profiles with columns email, ssn, phone and connection string pointing to ${DB_HOST}.
From runtime monitoring: We see data flowing from a service to an external endpoint, but connection details are dynamic.
From cloud inventory: We discover an RDS instance and a third-party integration API key.
From AI systems: We detect training data being fed into a personalization model.
The challenge: How do we connect these partial clues into a complete data journey?
Data storage: Parquet + DuckDB for intelligent correlation
When you're correlating data flows across potentially thousands of services, storage and query performance become critical. We settled on a combination that enables sophisticated correlation queries across billions of rows from multiple data sources:
We store everything in parquet datasets for efficient columnar storage and use DuckDB as our in-memory analytical engine. This lets us perform complex correlation queries that match partial identifiers across data sources:
-- Example: Correlating partial database identifiers across sources
WITH source_code_hints AS (
SELECT service_name, table_name, column_names, connection_pattern
FROM code_analysis_data
WHERE table_name LIKE '%customer%'
),
runtime_connections AS (
SELECT source_service, destination_host, data_volume, timestamp
FROM runtime_monitoring_data
WHERE timestamp > NOW() - INTERVAL '1 hour'
),
cloud_resources AS (
SELECT resource_id, host_address, database_names
FROM cloud_inventory_data
WHERE resource_type = 'database'
)
SELECT
sc.service_name,
sc.table_name,
sc.column_names,
cr.resource_id,
rc.data_volume,
rc.timestamp
FROM source_code_hints sc
JOIN runtime_connections rc ON sc.service_name = rc.source_service
JOIN cloud_resources cr ON rc.destination_host = cr.host_address
WHERE cr.database_names LIKE '%' || SPLIT(sc.connection_pattern, '/')[1] || '%'
As we capture more metadata about datastores and data flows, we're able to increase our accuracy in correctly identifying the edges between systems. Each new data source additional runtime monitoring, expanded third-party integrations, deeper source code analysis helps us rapidly solve more pieces of the organization's data flow puzzle.
Real-time detection: Catching drift before it becomes breach
Static security configurations drift over time. A developer adds a new service. A data scientist experiments with a different ML library. Someone connects a new SaaS tool.
Our system continuously scans for what we call "pipeline-aware drifts" changes in data flow patterns that could indicate new security risks:
- Sensitive data suddenly flowing to a new third-party service
- AI models beginning to process data they weren't originally trained on
- Changes in access patterns that suggest potential insider threats
- New edges discovered between systems through improved correlation accuracy
When we detect anomalous flows, we don't just alert we provide actionable context about what changed, when, and what the business impact might be. More importantly, as our correlation algorithms improve with more metadata, we can automatically discover previously unknown connections between services by matching partial identifiers across data sources.
The results: Complete data journey reconstruction
Six months after deployment, here's what we're seeing:
80% reduction in manual oversight. Security teams spend their time investigating real risks instead of chasing false positives or missing blind spots entirely.
Complete data flow visibility. By correlating partial identifiers across source code, runtime monitoring, and cloud inventory, we can trace complete data journeys even when no single source has the full picture.
Automatic edge discovery. The system identifies previously unknown connections between services and data stores by matching database names, table schemas, and connection patterns across different monitoring sources.
Historical data journey analysis. Minutes, not weeks, for compliance evidence. When auditors ask about data handling practices, our customers can generate comprehensive reports showing exactly how data flowed through their infrastructure on any historical date.
Proactive AI governance. Instead of security being a bottleneck for AI deployment, it becomes an enabler teams can ship AI features confidently knowing that training data sources, model inputs, and outputs are all tracked as part of complete data journeys.
What's next
We're just getting started. The future of data security isn't about building higher walls around static data it's about understanding and controlling how data moves through increasingly complex, AI-driven systems.
As AI systems become more autonomous and data becomes more executable, the security tools we build today will determine whether the next generation of software can be both innovative and trustworthy.
The world's data is moving faster than ever. It's time our security moved with it.