How Did Cloudflare Build a Unified AI Data Lakehouse?

How Did Cloudflare Build a Unified AI Data Lakehouse?

Managing a global network that handles more than one billion events every second across hundreds of cities requires a level of architectural precision that few organizations ever achieve. For a long period, the sheer velocity and volume of this information led to a fragmented digital environment where critical assets were scattered across disparate databases, streaming platforms, and various third-party cloud providers. This fragmentation created a substantial barrier to efficiency, as the lack of a centralized repository meant that even the most basic queries required navigating a labyrinth of isolated systems. To regain control over this expanding digital universe, the company embarked on a strategic overhaul to centralize its assets into a cohesive ecosystem known as Town Lake. This transformation was far more than a simple storage consolidation; it represented a fundamental shift in how the organization perceives and utilizes its intellectual property. By unifying these disparate streams, the company created a single source of truth that powers both human-driven analysis and automated intelligence.

The implementation of Town Lake allowed the organization to move beyond traditional storage limitations and embrace a model that prioritizes utility and accessibility. Central to this success was the development of Skipper, an advanced AI-powered agent designed to democratize access to the wealth of information within the lakehouse. In the past, uncovering critical business insights often required a high degree of specialized technical knowledge, leaving many employees unable to leverage the data they needed for their daily operations. With the introduction of this AI system, the barrier to entry has been significantly lowered, allowing users to ask complex questions in plain English and receive accurate, data-driven answers in real-time. This transition from a siloed, expert-only model to a transparent and inclusive one has fundamentally changed the internal culture, making every team more data-literate and responsive to the needs of the global network.

Facing the Challenges of Hyper-Growth

The Cost of Fragmented Information

Before the centralization effort began, the organization struggled with a phenomenon commonly referred to as data sprawl, where valuable information was locked away in isolated systems such as Postgres and Kafka. This fragmentation created significant hurdles for engineers and analysts who were frequently forced to navigate a complex web of varying credentials and distinct query languages just to perform basic cross-functional analysis. Because there was no central map or catalog to guide them, finding specific tables or understanding the relationship between different datasets often required relying on tribal knowledge. This meant that newer employees or those outside of specific technical circles had to spend hours or even days reaching out to various colleagues simply to locate the correct source for their work. Such a reliance on informal communication networks slowed down the pace of innovation and introduced unnecessary friction into every major project.

Beyond the immediate loss of productivity, this fragmented approach to information management led to deep-seated issues regarding the consistency and reliability of business intelligence. When information exists in multiple silos without a unified governing structure, different departments often end up with conflicting versions of the same metric. For instance, a marketing team might pull usage stats from one source while the finance team pulls similar figures from another, leading to discrepancies that complicate executive decision-making. The absence of a standardized pipeline meant that every query was a high-effort endeavor, often resulting in duplicated work as different teams independently built similar but slightly different views of the same underlying facts. This lack of cohesion was not just an administrative nuisance; it was a strategic liability that hindered the company’s ability to respond quickly to market shifts and emerging security threats across its global infrastructure.

Accuracy and Cost in Scaling Infrastructure

Another significant challenge faced during the era of rapid expansion was the tension between system performance and the precision of analytical insights. To maintain the speed required for high-volume traffic monitoring, many older analytics pipelines relied on sampled data, which provided a general overview but lacked the granular detail necessary for high-precision tasks. While sampling is often sufficient for identifying broad trends, it is notoriously inadequate for mission-critical functions like billing accuracy or detailed security investigations where every single event matters. Relying on approximations meant that the organization risked missing small but significant anomalies that could indicate a sophisticated cyberattack or a subtle bug in a customer’s configuration. As the company scaled, the need for a system that could handle full-fidelity information without sacrificing performance became an undeniable priority for the engineering leadership.

Financial considerations also played a pivotal role in the decision to overhaul the existing infrastructure and build an internal lakehouse solution. In previous years, the company relied heavily on external vendors and third-party cloud platforms for internal reporting and storage, which created a mounting financial burden as the volume of traffic continued to surge. These external dependencies not only increased operational costs but also introduced strategic risks, as sensitive internal metrics were hosted on infrastructure that the company did not fully control. By moving away from these costly third-party solutions and building Town Lake on its own network, the organization was able to significantly reduce overhead while simultaneously improving security and performance. This shift ensured that internal data lived within the same robust and secure environment that the company provides to its own customers, reinforcing the principle of using its own products to solve its most complex internal challenges.

Engineering the Town Lake Solution

A Robust Foundation for Analytics

To address the complexities of its vast and varied data streams, the organization built Town Lake as a unified data lakehouse that bridges the gap between the flexibility of a traditional data lake and the high performance of a structured database. The architectural centerpiece of this system is Apache Trino, a distributed SQL query engine that provides a single, unified interface for information stored across multiple formats and locations. Trino allows engineers to perform complex joins between different sources, such as merging customer metadata from a relational database with real-time usage statistics from a streaming platform, all within a single query. This capability eliminates the need for expensive and time-consuming data movement, as the engine can pull and process the necessary information in place. By standardizing on a single SQL-based interface, the company has made it possible for any employee with basic database knowledge to explore the entire ecosystem without learning a dozen different tools.

Supporting this high-performance query engine is a storage layer designed for both massive scale and long-term durability, utilizing the company’s own R2 object storage in conjunction with the Apache Iceberg table format. Apache Iceberg provides the structure and management capabilities required to treat a collection of files in object storage as if they were a traditional database table, enabling features like ACID transactions and schema evolution. One of the most impactful features of this combination is the ability to perform time travel, which allows users to query historical versions of a dataset to see how it looked at a specific point in the past. This is particularly valuable for debugging and auditing, as it provides a clear record of how information has changed over time. Additionally, the system includes automatic data compaction, which optimizes the storage of billions of small files into larger, more efficient blocks, ensuring that performance remains high even as the total volume of stored information reaches petabyte scales.

Automating Data Processing and Access

Managing the immense lifecycle of information within Town Lake required the development of specialized internal tools designed to handle the heavy lifting of extraction, transformation, and loading. One such tool, known as Transformer, utilizes a serverless approach to manage the cycles of data movement across the network, ensuring that information flows smoothly from its point of origin to its final destination in the lakehouse. Transformer is capable of handling complex logic and scheduling without requiring a massive, always-on infrastructure, which keeps operational costs low while maintaining the high availability needed for real-time analytics. This automation ensures that the data in Town Lake is always fresh and ready for consumption, reducing the manual effort previously required by data engineering teams and allowing them to focus on more strategic architectural tasks rather than routine maintenance.

In parallel with the processing pipelines, the company implemented a service called Lifeguard to manage the complex world of dynamic permissions and access control. In an environment where information is centralized, ensuring that only the right people have access to the right tables is a critical security requirement. Lifeguard integrates directly with the company’s existing identity management systems to grant or revoke access based on a user’s role, project requirements, and security clearance. This automated approach replaces the old, manual ticket-based system for granting permissions, which was often a significant bottleneck for researchers and analysts. By providing a streamlined and secure way to manage access, Lifeguard ensures that data remains protected while still being accessible enough to support the organization’s goals of democratization and rapid internal innovation across all departments.

Implementing a Default-Closed Security Model

Privacy and Protection by Design

A fundamental pillar of the Town Lake strategy was the transition to a default-closed security posture, a significant departure from the more open data cultures found in many large technology firms. In this environment, every new table or dataset added to the lakehouse is completely inaccessible to the general employee population by default. To gain access, a dataset must undergo a rigorous and comprehensive review process that involves both automated scanning and human oversight. This process is designed to identify and categorize sensitive information before it is ever made available for wider consumption, ensuring that privacy is a primary consideration from the moment data is ingested. By enforcing this strict entry requirement, the company has created a culture where data protection is not an afterthought but a prerequisite for any new analytical project or business initiative.

This proactive approach to security is further bolstered by a sophisticated classification system that automatically flags potential risks within the massive datasets flowing into the lakehouse. When a team proposes opening a table for general use, the system analyzes the schema and the underlying content to detect anything that might violate internal policies or global privacy regulations. If sensitive fields are discovered, the review team works with the data owners to ensure that appropriate protections, such as encryption or anonymization, are in place before any permissions are granted. This layer of governance provides the organization with the confidence to centralize its most valuable assets without fearing that a simple configuration error could lead to a major privacy incident. It ensures that the benefits of a unified lakehouse do not come at the expense of the trust that customers place in the company to handle their information responsibly.

Managed Access and Auditing

To maintain a high level of security without hindering the productivity of legitimate users, the organization developed a specialized service known as Skimmer, which employs artificial intelligence to continuously scan datasets for personally identifiable information. Skimmer acts as a persistent guardian, searching through billions of rows to locate sensitive details like email addresses, phone numbers, or IP addresses that might have been overlooked during the initial ingestion process. Even when an employee is granted access to a specific table, the system automatically redacts these sensitive details by default, allowing the user to perform their analysis on the non-sensitive portions of the data. This masking ensures that the vast majority of internal work can proceed without exposing personal data to individuals who do not have a strictly defined business need to see it.

For those rare occasions when a developer or security researcher genuinely needs to access the raw, unredacted information for troubleshooting or investigation, the system requires an intentional action known as a session flip. This process temporarily elevates the user’s access level but does so under a magnifying glass, creating a detailed and immutable audit trail that security teams can review at any time. Every instance of a session flip is logged with information about who accessed the data, what specific query was run, and the business justification provided for the request. This high level of accountability serves as a powerful deterrent against the misuse of information and provides the security department with the tools they need to conduct thorough forensic investigations if a policy violation is suspected. The combination of AI-driven redaction and rigorous auditing creates a balanced ecosystem where data utility and personal privacy coexist.

Empowering the Workforce with Skipper AI

Bridging the Gap with Natural Language

The most visible component of the new data strategy is Skipper, a sophisticated AI agent that serves as the primary interface for employees interacting with the Town Lake infrastructure. Built entirely on the company’s own serverless developer platform, Skipper is designed to translate the complex technical requirements of data querying into the simplicity of natural language conversations. This allows a member of the sales team, for example, to ask a question like “Which customers in Europe saw the highest growth in traffic last month?” without needing to know the specific SQL syntax or table names required to generate that report. By acting as an intelligent intermediary, Skipper effectively removes the technical barriers that previously prevented a large portion of the workforce from engaging with the organization’s massive wealth of information.

Skipper is not merely a basic chatbot that returns canned responses; it is a true reasoning engine capable of understanding the intent behind a user’s question and executing the necessary steps to find an answer. When a query is received, the AI analyzes the request, determines which datasets are relevant, and formulates a plan to retrieve and analyze the information. This process often involves synthesizing data from multiple sources and applying complex business logic that would be difficult for a human to manage manually in a short timeframe. The result is a system that empowers every employee to be a data scientist in their own right, fostering a more informed and agile workforce. This democratization of information ensures that insights are no longer the exclusive domain of a few specialized teams but are instead available to anyone with a question and a business need to solve it.

Technical Innovation and Performance

To ensure that the AI provides reliable and accurate answers, the engineering team implemented a robust grounding system that provides the model with five distinct layers of context. This context includes detailed information about table schemas, human-written documentation, and even the original SQL code used by engineers to create and maintain the data tables. By analyzing how humans have historically interacted with the data, the AI can learn the nuances of the company’s specific business logic, such as how certain billing categories are defined or how different types of network traffic are classified. This deep level of understanding is critical for avoiding hallucinations, a common problem where AI models generate plausible-sounding but factually incorrect information. Grounding the AI in the actual technical and operational reality of the organization ensures that the answers it provides are not just fast, but fundamentally correct.

A significant technical breakthrough in the development of Skipper was the implementation of a high-performance interaction method known as Code Mode. In traditional AI architectures, a model might make multiple, slow back-and-forth requests to a database, which can lead to high latency and increased costs. With Code Mode, the AI instead writes a complete script that is executed within a secure, sandboxed environment directly on the company’s network. This single-step execution allows the model to search through datasets, perform complex mathematical calculations, and even generate visual charts all at once. This approach significantly improves the speed of the user experience, providing answers in seconds rather than minutes, while also reducing the computational overhead. By executing code locally within a controlled environment, the company maintains strict security standards while providing a fluid and powerful tool for data exploration.

Real-World Impact and Strategic Lessons

Driving Business Results

The successful integration of Town Lake and Skipper has delivered tangible improvements across multiple facets of the organization’s operations, with the billing and security departments seeing some of the most dramatic gains. In the past, discrepancies between the data shown on customer dashboards and the information used for internal invoicing could lead to complex support tickets and customer frustration. By moving to a unified lakehouse, the company ensured that both internal and external views are powered by the same underlying high-fidelity datasets. This synchronization has significantly improved billing accuracy and transparency, allowing the finance team to resolve disputes faster and provide customers with more detailed insights into their usage patterns. The ability to rely on a single source of truth has eliminated a major source of operational friction and strengthened customer trust.

In the realm of cybersecurity, the unified data platform has become an indispensable tool for identifying and mitigating global threats in near real-time. Security researchers can now use Skipper to query massive datasets spanning the entire network, allowing them to spot emerging attack patterns that might have been invisible in a fragmented environment. For example, when a new type of botnet activity is detected, the team can immediately analyze historical traffic to see where the threat originated and how it has evolved over the past few days. The speed and ease with which these queries can be performed allow the company to deploy updated protection rules to its global network much faster than was previously possible. This rapid response capability is a direct result of having a centralized, AI-enhanced data environment that can handle the full scale of the company’s traffic without the delays associated with manual data gathering.

Engineering Insights and AI Prompting

The journey of building this complex system provided the engineering team with several valuable insights that challenge common assumptions about AI development and infrastructure management. One of the most important lessons learned was that when it comes to AI prompting, less is often more. Initially, developers tried to micromanage the AI’s thought process by providing highly detailed, step-by-step instructions for every possible scenario. However, they found that the model performed significantly better when given high-level guidance and the freedom to reason through problems on its own. By trusting the model to handle the nuances of data interpretation within a well-defined context, the team was able to create a more flexible and robust system that could adapt to a wider range of user queries without constant manual updates.

Another critical takeaway from the project was the realization that the success of glamorous AI features depends entirely on the boring but essential work of infrastructure and metadata management. Without the rigorous access controls provided by Lifeguard, the clean data formats of Iceberg, and the comprehensive catalogs of DataHub, the AI would have been unable to provide useful or secure answers. The project demonstrated that a powerful AI is only as good as the data it can access and the rules it must follow. This perspective has led the organization to prioritize foundational data engineering as the primary driver of its AI strategy, ensuring that every new feature is built on a stable and well-governed base. This focus on the fundamentals has allowed the company to move quickly and innovate without compromising the integrity or security of its most critical information assets.

Charting the Path for Future Innovation

Toward Self-Serve Data Engineering

Looking toward the future, the organization is focused on expanding the capabilities of the Town Lake ecosystem to support a fully self-serve model for data engineering and analysis. The goal is to move beyond a world where specialized data teams act as gatekeepers for information and instead create a system where any department can deploy and manage its own curated datasets. In this vision, a team could simply provide a SQL file and a brief description of their requirements, and the platform would automatically handle the complexities of ingestion, scheduling, monitoring, and security review. This would drastically reduce the time it takes for a new idea to go from a conceptual question to a production-ready analytical tool, further accelerating the pace of innovation across the entire company.

To achieve this level of automation, the engineering team is working on deeper integrations between Skipper and the company’s internal communication and ticketing systems. Imagine a scenario where a security alert is triggered, and a specialized AI agent automatically gathers all the relevant data from the lakehouse, performs an initial analysis, and presents the findings directly within a chat thread for the on-call engineer. By embedding these data-driven insights into the tools that employees already use every day, the company aims to make informed decision-making a seamless part of every workflow. This shift toward a more proactive and integrated data environment will ensure that the organization remains at the forefront of the industry, capable of navigating the challenges of a rapidly changing digital landscape with precision and speed.

The Blueprint for Scalable Ecosystems

The development of Town Lake and Skipper has provided a successful blueprint for how modern organizations can manage the dual challenges of massive data growth and the need for advanced AI integration. By dogfooding its own products like R2 and Workers, the company has not only solved its own internal problems but has also demonstrated the power and security of its platform to the wider world. This experience highlighted that the most effective AI tools are not those built in isolation but those that are deeply integrated into a secure, well-governed, and high-performance data infrastructure. The transition from a fragmented collection of silos to a unified, intelligent lakehouse has fundamentally transformed the company’s ability to operate at a global scale, providing a clear path for other organizations seeking to harness the power of their own information.

The strategic decision to centralize assets into a single ecosystem proved to be a pivotal moment in the organization’s technological evolution, ultimately enabling a new era of transparency and efficiency. By focusing on the foundational elements of storage, metadata, and security, the team created a fertile ground where advanced intelligence could thrive without compromising on privacy or performance. This journey served as a reminder that true innovation is often the result of combining cutting-edge technology with disciplined engineering practices and a clear vision for democratized access. As the company moved forward, the lessons learned from this project continued to influence its approach to building scalable systems that empowered its workforce and protected its customers. The legacy of this transformation was a more resilient, data-driven culture that was better equipped to face the complexities of the modern digital era.

Subscribe to our weekly news digest.

Join now and become a part of our fast-growing community.

Invalid Email Address
Thanks for Subscribing!
We'll be sending you our best soon!
Something went wrong, please try again later