How we built a multi-cloud asset discovery system that processes massive datasets without breaking the bank.
When you're scanning cloud infrastructure across AWS, GCP, Azure, Databricks, and Snowflake, you quickly run into a data volume problem. A single AWS account can generate billions of rows of metadata across all its resources and configurations.
Multiply that across multiple cloud providers and you're looking at massive datasets that need to be stored, queried, and kept up-to-date with sub-second query response times.
Traditional databases don't scale economically for this use case. We needed something different.
The problem: Multi-cloud visibility at enterprise scale
Modern enterprises don't use just one cloud provider. They use services from AWS, some from GCP, maybe Databricks for ML workloads, and Snowflake for analytics. Each service generates its own metadata about resources, access patterns, and configurations.
Security teams need a unified view across all these providers. They need to answer questions like:
- What sensitive data do we have across all our cloud environments?
- Who has access to what resources?
- Which resources haven't been accessed in months and are costing us money?
- Are we compliant with GDPR, CCPA, and HIPAA across all environments?
The challenge isn't just aggregating this data, it's doing it at scale without massive infrastructure costs.
Why traditional databases fail at cloud-scale metadata
When we started building our cloud inventory system, the obvious choice seemed like a traditional relational database. Scan cloud APIs, normalize the data, store it in PostgreSQL or similar.
This approach breaks down quickly:
Volume overwhelms single instances. Cloud metadata isn't small. AWS alone can return gigabytes of resource information for a large enterprise account. When you're tracking historical changes and access patterns, storage requirements explode.
Scaling databases gets expensive fast. RDS instances with enough compute and storage to handle enterprise cloud metadata cost thousands per month. As data grows, costs scale linearly or worse.
Compute and storage are tightly coupled. Traditional databases bundle storage and compute in a single instance. You can't scale one without the other, leading to resource waste and cost inefficiency.
Our approach: Delta Lake for cloud metadata
Instead of fighting database limitations, we chose a different architecture: Delta Lake with decoupled storage and compute.
Architecture: S3 + structured data + flexible compute
Our system works in three layers:
Storage layer: All metadata lives in Google Cloud Storage as structured parquet files organized by delta lake. (This could be S3 or any object storage it's just a configuration change.) This gives us:
- Massive scalability at object storage prices
- ACID transactions on data lake storage
- Time travel and versioning built-in
- Schema evolution without migrations
Compute layer: Our services use native Delta Lake libraries to read data directly, or load datasets into DuckDB for complex queries when needed. No permanent database instances consuming resources 24/7. DuckDB excels at analytical queries on large datasets with minimal overhead perfect for cloud metadata analysis.
Query interface: Applications connect through standard SQL interfaces, but the underlying storage is optimized for cloud-scale data.
Data organization: Partitioned by cloud and resource type
We organize metadata using Delta Lake's partitioning features:
-- Example of our partitioning strategy
CREATE TABLE cloud_inventory (
resource_id STRING,
resource_type STRING,
cloud_provider STRING,
region STRING,
last_accessed TIMESTAMP,
data_classification STRING,
access_permissions ARRAY<STRING>,
cost_monthly DECIMAL,
created_at TIMESTAMP
)
USING DELTA
PARTITIONED BY (cloud_provider, resource_type, year(created_at))
This partitioning strategy enables efficient queries like "show me all S3 buckets from the last quarter" without scanning irrelevant data.
Cost optimization: Pay for what you use
The key insight is separating storage from compute:
Storage costs: Google Cloud Storage for structured metadata costs roughly $0.020 per GB per month. Even massive datasets with billions of rows cost hundreds, not thousands.
Compute costs: We only pay for compute when actively processing queries. A typical cloud scan might use compute for 10-15 minutes, then shut down until the next scheduled scan.
No idle resources: Traditional databases consume resources 24/7, whether you're querying them or not. Our approach eliminates idle costs entirely.
Real-world benefits: Discovery and cost savings
Six months after deployment, the cloud inventory system delivers measurable impact:
Complete multi-cloud visibility. Security teams can see all resources across AWS, GCP, Azure, and specialty platforms in a single interface. No more blind spots or manual account switching.
Automated compliance reporting. GDPR, CCPA, and HIPAA reports generate automatically by querying metadata across all environments. What used to take weeks of manual collection now happens in minutes.
Cost optimization insights. We identify unused resources that haven't been accessed in months. One customer found $50k/month in forgotten S3 buckets and idle compute instances.
Scalable architecture. The system handles petabyte-scale enterprises as easily as smaller deployments. Storage and compute scale independently based on actual usage.
Lessons from DSPM experience
This architectural approach builds on lessons learned from the broader DSPM industry. Traditional database approaches often hit scaling walls when dealing with enterprise cloud environments single-instance databases struggle with data volume, scaling becomes expensive, and query performance degrades as datasets grow.
Our delta lake architecture avoids these constraints by decoupling storage from compute and leveraging purpose-built tools like DuckDB for analytical workloads.
What's next: From inventory to active security
Cloud inventory is foundational, but it's just the beginning. The real value comes from layering security intelligence on top of comprehensive asset discovery.
Data classification at scale. Once you know what cloud resources exist, the next question is what sensitive data they contain. Our Q3 roadmap includes automated PII, PHI, and financial data detection across all discovered assets.
Access pattern analysis. Knowing who can access what becomes powerful when combined with usage patterns. Unused high-privilege access represents significant security risk.
Real-time security posture. Static inventory snapshots miss the dynamic nature of cloud environments. We're building continuous monitoring that detects configuration changes and access pattern anomalies in real-time.
The cloud inventory foundation makes all of this possible. You can't secure what you can't see and at enterprise scale, seeing everything requires architecture that can handle massive data volumes economically.
Cloud inventory gives us that foundation. Now we're building the security intelligence on top.