The Original sin of Cloud Infrastructure (WarpStream)

#reading-list **URL:** https://www.warpstream.com/blog/the-original-sin-of-cloud-infrastructure **Date Read:** 2025-04-27 ## Summary - **Thesis**: most Big Data Infrastructure and Database projects come from a time before The Cloud™️ was the dominant computing paradigm, and in the era of Cloud, you need to specialize the entire technology stack around Cloud’s unit economics to get cost efficiency and reliability. - **Business Model I**: Cassandra, Kafka, other databases were originally intended for on-prem. They were generally self-hosted and managed by customers. Vendors of these databases (DataStax, Confluent) made money via paid consulting services and support contracts - **Business Model II**: Hosted services. The creation of the SaaS business model allowed people to host cloud databases and other infra in the vendor’s enclave, with API/network access sold to customers. However, the _**products themselves are a terrible fit**_ for Cloud’s unit economics. This is what Richie refers to as _“The Original Sin of Cloud Infrastructure”_: infra companies tried to just lazily take existing products off the shelf and sell them, rather than rework them to fit the Cloud’s particular strengths. - In particular, networking costs dominate vendor-hosted infra offerings, and make them considerably more expensive than just hosting the infrastructure yourselves - Author’s anecdote: _**“Transmitting a single GiB of data through Kafka in the cloud costs more than storing a GiB of data in S3 for 2 months”**_ - **Business Model III**: Sell a BYOC offering. Host the dataplane in the customer’s VPC, host the controlplane for the system in the vendor’s VPC. - Goal: reduce networking costs, reap the benefits of remotely-managed infra - Reality: Vendor is now running `N` Kafka clusters, and must operate a centralized team to manage all this complexity, which scales as the number of customers scales. It’s incredibly hard to create efficiency gains b/c you’re still using technology not built for The Cloud, so there will always be something wrong. - **The Author’s Way**: Keep the protocols, _**but rebuild the entire database**_ to suit the BYOC cloud business model. > _This is the promise of BYOC done right – it’s how great things can be when infrastructure is truly designed for the cloud. **If a vendor is going to run software in the customer’s cloud account, that software has to be trivial to manage**. Trivial as in “completely stateless and almost impossible to mess up”, trivial as in: “if you accidentally delete the entire Kubernetes cluster running the software, you won’t lose any data”, trivial as in: “scaling in out just means adding or removing containers”. Not trivial as in “we wrote a really sophisticated Kubernetes operator to manage this extremely stateful software that thinks it’s running in a datacenter”. In other words, not all BYOC implementations are created equal._ - **Duffy Editorial**: I led the development of Palantir’s streaming product, which involved also becoming the main operator of dozens of customer-specific Kafka clusters, each running in isolated VPCs. Over the course of that time I wrote various runbooks to teach others how to do such tasks as reassign topic partitions, gradually roll out new broker nodes, enable retention temporarily to clear up disk space, change compression codec configurations, alter the consistency, etc. etc. This was really hard, and over the course of 2 years we didn’t find any real silver bullets of automation or efficiency gains. We acknowledged internally that Kafka was a terrible piece of infrastructure to be responsible for. WarpStream seems to be doing all of the right things and I’m really excited to watch them grow.