Data Talks on the Rocks

Data Talks on the Rocks 4 - Alexey Milovidov, ClickHouse

Michael Driscoll

Author

October 7, 2024

Date

minutes

Reading time

CONTENTS

Example H2

Example H3

Data Talks on the Rocks is a series of interviews from thought leaders and founders discussing the latest trends in data and analytics.

Data Talks on the Rocks 1 features:

Edo Liberty, founder & CEO of Pinecone
Erik Bernhardsson, founder & CEO of Modal Labs
Katrin Ribant, founder & CEO of Ask-Y, and the former founder of Dataroma

Data Talks on the Rocks 2 features:

Guillermo Rauch, founder & CEO of Vercel
Ryan Blue, founder & CEO of Tabular which was recently acquired by Databricks

Data Talks on the Rocks 3 features:

Lloyd Tabb, creator of Malloy, and the former founder of Looker

Data Talks on the Rocks 5 features:

Hannes Mühleisen, creator of DuckDB

Data Talks on the Rocks 6 features:

Simon Späti, technical author & data engineer

Data Talks on the Rocks 7 features:

Kishore Gopalakrishna, co-founder & CEO of StarTree

Data Talks on the Rocks 8 features:

Toby Mao, founder of Tobiko (creators of SQLMesh and SQLGlot)
Jordan Tigani, co-founder of MotherDuck
Yury Izrailevsky, co-founder of ClickHouse
Kishore Gopalakrishna, founder of StarTree

While he was on a world tour, I was lucky enough to interview Alexey Milovidov, co-founder and CTO of ClickHouse. We chatted about the inception of ClickHouse and of course AI, but also had the time to dive deep into ClickHouse’s architecture, different database categories, and pain points in the market that ClickHouse uniquely solves.

I’ve noted some of my favorite highlights below:

Alexey (00:25:28): I am a ClickHouse user... I notice when anything is slightly not right...I always notice that something has to be improved. It’s not about technology, but about the attention to detail. The same thing applies when you install and start using ClickHouse. Actually you don’t even have to install, you just download and run it. That’s the magic.

Michael (00:50:45): The experience people have with real-time analytics... it’s an overloaded term: real-time can mean real-time reads or it can mean real-time writes.
Alexey: It should mean both.
Michael: It should mean both, but the experience of a lot of organizations is that they have a greater need for real-time reads, than they do for real-time writes.
Alexey: Don’t compromise on this. Don’t accept defeat.

FYI - portions of the transcript were edited for brevity and clarity. For the full, unscripted interview, watch the video.

Michael: Welcome to Data Talks on the Rocks. I’m Mike Driscoll, co-founder and CEO of Rill Data and today I’m delighted to have Alexey Milovidov, co-founder and CTO of ClickHouse with me. Alexey, welcome to Data Talks on the Rocks.

Alexey: Thank you.

Michael (00:00:20): We’re going to talk about ClickHouse. The database ecosystem is an incredibly crowded space. In the last few years we’ve seen lots of new entrants and despite that crowd, ClickHouse has emerged as one of the fastest growing and most popular database technologies in the world. You have over 2,000 contributors part of the project and there are thousands of companies around the globe using ClickHouse like Uber, Spotify, Cloudflare, eBay, and many others for real-time use cases. I want to start by saying what I know about ClickHouse and many others know about ClickHouse, and that it is insanely fast. But before we dive into the architecture and go deeper into that, I really want to understand from your perspective why you believe ClickHouse is so popular and why developers are choosing it. I want to frame this discussion from the perspective of people who are out there who might be listening to this interview, who might be searching the web, maybe an engineering leader or developer who needs to choose a new database to power a new initiative or they need to migrate from an existing database that’s not scaling or meeting their needs. What are some pain points that ClickHouse uniquely solves which are not being solved from other databases or technologies in the market like Snowflake or Postgres?

Alexey (00:01:56): To me, ClickHouse is the only working technology, the only working analytics database. You take any other alternative like Snowflake and it doesn’t work for low latency queries. You can’t use it as a backend for real-time applications to serve user requests directly.

You take BigQuery, it also doesn’t work. It can handle large data volumes, but not in an interactive fashion. You take Redshift and it also doesn’t work in real time, it does not provide all the features of ClickHouse. It doesn’t have such attention to detail. If you take Postgres, it is just not an analytical database. There are a few modules for Postgres, but they are far away, far behind the quality that is actually required. If you take some other technology like Vertica and you will find that it is a long time ago [acquired by] private equity and will squeeze money from you, and the quality is far behind. So, ClickHouse is actually the most popular open source analytical database as of today and I really like it.

Michael (00:03:17): It seems to me, as an observer, that more and more AI companies have been adopting ClickHouse for what feel like different use cases. They don't seem like they are all the same. Maybe you can just share with us, what you are seeing. What are some of the features of ClickHouse that are making it widely adopted by folks like Weights & Biases and others out there?

Alexey (00:03:42): There are many scenarios. One is about what you do to prepare your data to train machine learning or AI models. If you’re old fashioned, you’ll say machine learning. If you are hip, you’ll say AI. This includes the scenario of feature stores. You collect a lot of data, you track your users for advertising. It’s a boring and maybe hated use case, but it exists. You collect this data and for your model you have to organize this data into high quality features. This process of data preparation, data aggregation, and data summary and even profiling, this is done easily by ClickHouse.

Another scenario is ML Ops and ML observability. How to make the process of machine learning not messy and chaotic. Essentially how to organize this process. This is what companies like LangChain and Weights & Biases use ClickHouse for.

Another scenario is semantic search and analytics on embeddings. So vector embeddings is a way to transform multimodal data, images, text, videos into something that is suitable for search and analytics: multidimensional vectors. You can run queries that can extract a subset of these vectors and search for similar ones, aggregate, summarize to provide recommendations. I find that companies are using ClickHouse for this scenario, especially when the amount of vectors is massive, and cannot be indexed in advance, and it cannot fit in memory. In these cases, many specialized vector databases stop working, but ClickHouse continues to work. Though we still have lots of work to do to optimize vector indexes, we still have to introduce some vector index that we do not have. We want to better serve this use case.

Michael (00:06:16): To repeat that back, there are three different use cases for AI where ClickHouse shines. One is feature stores, which sounds a lot like a database [data preparation, data science, data profiling, data cleaning]. The second is telemetry or observability, coming off the training of these models [not only training but also inferences, from inferences you get lots of logs and you can analyze it]. The third is like Pinecone and others, vector search and vector embedding.

Alexey (00:07:02): And you don’t have to limit yourself to a single scenario of vector search. You can search the metadata of your data. You don’t only search, but can do analytics. You take the results and you put the results in the context to the LLM and you will get something interesting out.

Michael (00:07:26): Does ClickHouse have a native data type for vector embeddings?

Alexey (00:07:35): Yeah, what is interesting is we used to have this data type from 2012.

Michael (00:07:40) What was the motivation for that when it was built? Was it created for a search use case?

Alexey (00:07:42): It was not related to AI at all. It was used to collect some identifiers, categorize into an array, and for vector search we just use this array data type.

Michael (00:08:08): You shared last night that there is a new data type that ClickHouse supports, a JSON data type natively. Clearly when it comes to observability and that data exhaust coming off my many systems, JSON is the lingua franca for how most systems emit data so no surprise that AI companies are also generating a lot of JSON longs that need to be made sense of. Beyond speed and beyond performance which is top of mind when you think about ClickHouse, what are some other aspects that differentiate ClickHouse - cost-efficiency, open source license, developer ergonomics? What are your opinions as the lead author of this project? What do you see developers responding to more compared to other tools?

Alexey (00:09:10): The feature set and it comes from the practice. For many use cases, it feels really natural when you see the features in ClickHouse. If you do clickstream analytics, web analytics, or telemetry, you’ll find functionality exactly for this. If you read the documentation from start to end, you’ll notice there are five or more functions just to calculate quantiles. There is a reason for that. People who make these applications will find that this is made exactly for them. It’s the same in other areas. I was surprised to see many blockchain analytics applications, they say the datatypes like uint256 are the exact data types for Ethereum. Actually we designed them for other use cases, but they fit very naturally.

Michael (00:10:26): We’ll talk about some of those features in a bit. I want to go down a level and talk about the ClickHouse architecture. Nothing is ever built from scratch. What were some other technologies that inspired the design, the architecture, and the choices behind ClickHouse? What other databases have you experienced using, or studying in your earlier experiences or educational experiences? And were there any contrarian choices made where you decided to go against some of the conventions that other databases tools had chosen? What precedents led you to design ClickHouse the way you did?

Alexey (00:11:20): I started working on ClickHouse in 2009 for one particular challenging task, for one particularly large workload, web analytics. Actually, I did not want to implement ClickHouse because it sounded insane. If I tell you “go write a new database that is better than everything existing”, you would probably not do it. But I researched the area where comparable technologies existed and I can name all of them. In 2009, I considered Infobright, it does not exist anymore. The code base was open source and it moved to …

Michael (00:12:12): Wasn’t there also an Infobright engine that you could use inside MySQL?

Alexey (00:12:20): Yes, when I tested it, it was not open source and it didn’t even support compression. Compression is essential for analytical databases. Actually, it was InfiniDB. Infobright supports compression, InfiniDB does not. Infobright still exists, but in a very unusual form. MonetDB, a research database from CWI, Netherlands, it worked but not for every query. I remember I tested it on a set of queries and it did not work when I needed to do a group by string. It did not work well, which was surprising. I looked at it and throw it away. Another was also Vertica.

Michael (00:13:21): Which started as C-Store, I believe…

Alexey (00:13:22): Yeah started as C-Store started by Micheal Stonebraker then it was HP Vertica, currently it’s OpenText Vertica.

Michael (00:13:39): What was the issue with C-Store? I know there was an open source version, but was your experience with trying to run that?

Alexey (00:13:51): It did not even build. It was not a project, not even an open source product, Vertica worked but the cost was…

Michael (00:14:04): That’s great. You’re clearly not alone in evaluating those systems. Certainly, a lot of the engines today also have done the work you’ve had and looking at those precedents and chosen to go the path of building something new. This is an educate the audience type of question. For those who are not deeply versed in columnar, MPP database architecture, I’d like to talk about database categories. Broadly, we have categories like OLAP databases (Druid, Pinot, DuckDB), OLTP systems, time series databases like Prometheus, document stores like MongoDB and Elasticsearch. Where do you position ClickHouse amongst all these different axes today? Where would you say ClickHouse lands and then to the extent that it overlaps, what are some of those categories?

Alexey (00:15:22): I like your classification because I also like to collect different categories of databases. We can also talk about graph databases or the new modern one, vector databases.

ClickHouse is OLAP or to put it simpler an analytical database. The main difference between transactional and analytical databases is that with analytical databases you take large volumes or a large subset of data and summarize it as fast as possible as quickly as possible.

Michael (00:16:11): You put ClickHouse in the OLAP category. Clearly that’s where a lot of people see it, but last night I saw you present ClickHouse as a drop-in replacement for Prometheus. Traditionally observability platforms have used time series databases under the hood. But lately, we’ve seen lots of observability platforms adopt ClickHouse under the hood- examples are Sentry and Posthog. We’re seeing a lot of different companies embed ClickHouse using it as a time series database. Where do you see the similarities and differences between ClickHouse as an OLAP engine and Prometheus as a time series database?

Alexey (00:17:13): It’s interesting. Some people noticed how well ClickHouse works within the scenario of time series data and they mistakenly name ClickHouse as a time series database. And I have to correct them. ClickHouse is much more, the area of possible applications is much wider. time series databases are suited for a narrow use case, when you have time and some metrics, that’s it. Honestly you don’t need time series databases because this use case, as we have proved, can be covered by a good analytical database.

Michael (00:18:00): Where do time series databases break down? What we’ve seen, among [those who] read the data engineering Reddit blogs, I think there's a meme on the internet right now where a man says to his girlfriend, take me somewhere expensive. Then there’s a picture of them eating at a restaurant called Datadog. Clearly these time series databases that use tags and have a single time series per metric, get very expensive as companies scale up. Why is that and where do you think ClickHouse in particular or OLAP engines in general are a better alternative to a time series database with high cardinality of data?

Alexey (00:19:00): With this type of data, you don’t have to limit yourself to just this data model. You might imagine you could specialize your data structures, but actually you have to get fundamentals. And fundamentals like data compression, query engine, how data is actually sorted on disk, what is the storage. If you can get it right, you can solve this more efficiently compared to when you just focus on a single data model. And to your question as to why is Datadog so expensive? Because they are greedy.

Michael (00:19:57): [laughs] One can argue that the market supports the price, and they are charging a fair price. Let’s talk about other OLAP engines and rather than criticizing them, one thing I want to do is turn this question on its head. When we started this conversation, we talked about the framing of an engineering leader, a developer who needs to make a decision for “what's a fast analytical database that I want to choose?” Right now a lot of people are looking at Apache Druid, Apache Pinot, DuckDB, there are some new contenders out there, there are always new folks showing up. Rather than talking about what you might not like about those OLAP engines, I’d like to hear, to the extent that you do, what do you admire about these different OLAP engines? And maybe, if there is anything that you’ve learned from them, or anything that you may consider building ClickHouse towards in the future?

Alexey (00:21:00): First of all don’t worry, I like every database technology. Every database is interesting in its own regard. Let’s talk about Druid. I remember when Druid was named at Metamarkets. I first learned about it when I found a really good article about the application of HyperLogLog algorithms. [article link]

Michael (00:21:27): As you know, I was at the company that created Druid. I was the CTO at Metamarkets and I pushed really hard for us to adopt HyperLogLog inside of Druid. It was being used at Facebook at the time. Facebook had an implementation of HyperLogLog for approximate counting. So, absolutely, I was very fond of that algorithm and that blog post, for sure.

Alexey (00:21:56): There are many technologies that have their authors thought that you should do something really complex to solve analytics. Examples are multidimensional OLAP when you aggregate data by every required dimension. But in contrast, ClickHouse is trying to approach this in a simple way. Maybe in a brute force way. Don’t aggregate anything by default. Imagine this relational database with flat tables, you can calculate everything on the fly and surprisingly it works. If you compare it to Druid or other technologies like Apache Kudu, Apache Kylin, they’re are so many Apache technologies. They try to stick together multiple components. One will be responsible for data ingestion, another for aggregation, yet another for something else, and you have to write complex JSON or YAML configurations just to get data into the database. I don’t like it. Because, if i start to read the documentation about let's say Apache Kudu, it starts to get boring after the first paragraph.

Michael (00:23:57): I’m going to shift forward, and talk about getting tired of reading long documentation and you were referring to some other database technologies out there that are quite complex. You have to manage a number of instance types in your cluster topology just to make it work. One of the things I personally admire about ClickHouse is the developer experience. As you said, there’s not a lot of different types of instances involved in ingestion, and a different instance involved in joining subqueries… Getting started with ClickHouse is very easy. Using ClickHouse in general just feels good. What are the principles that have informed the design choices for [that] developer experience? When you think about introducing a new feature, a JSON datatype, or introducing support for Kafka, consuming Kafka streams, or being able to automatically ingest data from a S3 bucket, a few of the things you talked about last night – how do you make sure that those features don't end up having a wall of documentation that you have to scroll through just to get them to work?

Alexey (00:25:20): We ensure that we do have all the documentation for these features but the main reason is that I am a ClickHouse user, and I am a really bad user. I notice when anything is slightly not right. When I go to production, I open the logs and I always notice that something has to be improved. When I open the website and see something like if my display is not high DPI, the gradient doesn’t look correct, I report this. It’s not about technology, but about the attention to detail. The same thing applies when you install and start using ClickHouse. Actually you don’t even have to install, you just download and run it. That’s the magic. When the command line application opens, I ensure there is nothing strange for me. But also I think about all of our users.

Michael (00:26:33): It’s very common for people to say that great products are built by folks who are building that product for themselves. Clearly this is an example of that. You’re using the product every day. I’ve seen it on our way from Brussels to Amsterdam. Clearly, you have a deep connection with your product and technology that informs the design choices. I want to shift a little bit, there’s a vibrant community of open source users that are part of the ClickHouse ecosystem - 2000 contributors to the project. On the other hand, you are the co-founder of a commercial company that has raised hundreds of millions of dollars with customers demanding new features. How do you balance the demands of that vocal open source community that in many ways helped get ClickHouse to where it is today, against the needs of commercial customers who may have a different set of needs and roadmap requests? How do you as the author, largest contributor to ClickHouse, make sense of these contending, potentially not consistent set of requests.

Alexey (00:28:15): I would say the requests are mostly the same for open source and commercial users. When they are not the same, I try to slightly generalize and think more about these ideas. Maybe I will implement some feature set or functionality inside of ClickHouse that would automatically serve both types of audience. But we do have the principle, customer first: if something is needed for our customers, we will do it.

Michael (00:28:50): What is the biggest difference between open source ClickHouse and ClickHouse Cloud?

Alexey (00:28:57): The table engine. ClickHouse Cloud has a different table engine that is developed to allow scaling, dynamic scaling, changing the number of nodes quickly. This makes it possible for us to run our service for many thousands of customers. And second, make them not plan in advance their usage. For open source users, we have different scenarios. For many of those scenarios it’s enough to provision a couple of machines, a couple of replicas and run ClickHouse and problem solved. Some scenarios require loading several petabytes of data a day, then they have to actually think about their infrastructure because it will cost around some numbers but ClickHouse scales for this workload, both the open source and the cloud version of ClickHouse.

Michael (00:30:18): I want to talk about ClickHouse past, present, and future. We’ve talked about some of the inspiration for the architecture of ClickHouse. Have there been any significant pivots or changes in the architecture of the ClickHouse project since ClickHouse inception ? Is there anything you thought you had right when you were designing ClickHouse and later learned through real world experiences that you need to change and what were the pressures that drove those changes?

Alexey (00:30:54): I remember 10 years ago in 2014, ClickHouse was just starting to be used in production, but there was no replication. We were thinking about how to get replication into ClickHouse and my colleagues who were Java engineers suggested [that we] use ZooKeeper. This decision was quite unnatural for me, but I accepted it. We started to use ZooKeeper for metadata, but two years ago we rewrote ZooKeeper with our own implementation, ClickHouse Keeper, which is by the way compatible on the wire protocol level. So the API is the same, everything is the same, but it’s written in C++ because it’s more memory efficient. It fixes a few unusual corner cases with Zookeeper. By the way, Zookeeper is a great technology, but it’s another component in a different language that you have to know how to deal with it. Currently ClickHouse does everything.

Michael (00:32:33): You’re not the only open source project to refactor ZooKeeper out of its project. At last night’s Meetup, you mentioned the Analyzer and talked about the two year journey to get from alpha to beta to now being standard with every ClickHouse cluster that's launched in ClickHouse Cloud and available for all. Tell us a little bit about why the Analyzer took two years to build, and more importantly what’s possible now that wasn’t before now that the Analyzer exists?

Alexey (00:33:10): The Analyzer is the infrastructure for query analysis, query plans, and query optimization. Many people ask me why did we have to implement it. There are open source projects for query analysis – Apache Calcite (written in Java), Apache Substrate, Greenplum – why did I not just pick one of them? The reason is that ClickHouse has a lot of convenient and good features. It is a SQL database, and because of the SQL language it contains a lot of extensions. For example, you can write aliases as names for expressions in any part of other expressions. There are extensions such as array_join and lambda functions, higher-order functions, where sql starts to feel like a functional, actually SQL is a declarative programming language, but here starts to feel like Haskell, some people like it. It is not as complex as Haskell.

We did not want to lose it, but we had to support all the corner cases. Most importantly, we have to do it in a 100% compatible manner. We don't want to make it like Python 2 and Python 3, where there are several incompatibles and migration takes 10 years. We want to make it in a way that we just switch, enabling it by default and the user will only get the advantages: better performance, more features that were previously not available before, and no incompatibilities. And for features, there are many such features, [such as] recursive CTE. When we enabled Analyzer it was just a few days change, it was easy.

Michael (00:35:32): What are some real world use cases you think the Analyzer will have the biggest impact on? From your experience talking to users of ClickHouse out there.

Alexey (00:35:42): I think of data warehousing use cases. We discussed real-time analytics, analytical applications, but there are also some classic data warehousing. When you collect a lot of information about your company, about some artifacts produced by your company and you join this for business analytics, the problem in this use case compared to applications is that these queries are ad hoc queries written by users or even worse generated by business intelligence systems. These queries are huge and contain subqueries, joins, correlated queries, and the database system has to deal with all this mess. It has to optimize. It has to rewrite these queries. Our ability to analyze and rewrite queries were quite limited before Analyzer.

Michael (00:36:55): For instance, Tableau or Looker, if I look at the SQL generated by those systems it would be quite verbose. You’re saying that now ClickHouse is better suited to handle some of the queries generated by those tools in the same way that other databases like Oracle, Snowflake, BigQuery, Redshift, they have their own analyzers that are taking that verbose queries and doing some optimizations to make it work.

Alexey (00:37:32): Yes. When you talk about business intelligence systems, I would make a line and put Rill at the far end of the line. In between there would be Tableau and Looker. Then on the opposite side would be Microstrategy and very old tools. Rill works perfectly with ClickHouse. Tableau has a connector with ClickHouse. But the main point is that every system has to work, even such old tools.

Michael (00:38:21): We’re talking about the past, present, and future. We talked about the Analyzer two year journey, much like your decision to rewrite Zookeeper out of ClickHouse and create ClickHouse Keeper. Obviously with Java, you could have done the same thing with your Analyzer approach and [just use] Apache Calcite, but a lot of advantages to leveraging something that was built natively to work inside of ClickHouse and take advantage of the unique extensions to SQL that ClickHouse has built. What is a similarly thorny, long range problem with high impact, like Analyzer, that you are excited to tackle with ClickHouse?

Alexey (00:39:07): First let’s talk about the product and our customers. When we just introduced our product, the first customers were, imagine a Blockchain company from Hong Kong. Now, there are more companies, imagine a bank in Germany. We will have more and more such customers. The main challenge is to make the product good for the new large customers. We don’t even have every requested feature in our product. For example it started with AWS support, then GCP in half a year, then Azure. Just a few weeks ago, we introduced a BYOC solution.

Michael (00:40:01): Bring-Your-Own-Cloud solution for those not familiar with BYOC. Since C could stand for ClickHouse too. BYOC much like Databricks means they can bring their own VPC or cloud environment and ClickHouse Cloud will run within that environment.

Alexey (00:40:22): Now let’s imagine there is some old telecommunications company who says “we won’t use any sort of cloud, anytime, but we want to use ClickHouse and we want to use your product with the scalability that you provide. We want to use your engine, we want to use your security features.” We want to make our product good even for these types of customers.

Michael (00:40:49): We talked about BYOC, bringing ClickHouse Cloud into a major bank or pharmaceutical company or automotive company, their cloud environment. Now you are talking about the possibility of bringing ClickHouse and some of the features of ClickHouse Cloud into an on-prem environment?

Alexey (00:41:14): It’s one of the possibilities. Why not? Why not bring the full feature set of our product to these types of customers?

Michael (00:41:26): I’m going to challenge you on that a little bit. Doesn’t that feel like that’s building a bridge to the past? Or do you believe or do you see that there are a substantial number of companies out there that are not just running in one of the major clouds that actually are in fact, and maybe this is from your own experience observing the ClickHouse community. Do you see a real market out there of customers that still have not adopted a traditional cloud vendor for their compute workflow?

Alexey (00:41:57): There definitely is, but what’s more interesting is, let’s take a large conservative company. We approach this company and ask if they’ll use ClickHouse, they say probably not. But after a while, we discover that some of their engineers already installed ClickHouse and are using it in production.

Michael (00:42:21): That’s often the case when the executive team does not know everything the engineering team is doing. It’s also the beauty of open source where you don’t need permission as an engineer working at a New York City hedge fund… you don’t have to get the finance team to approve an invoice. So I’m sure there are many instances of ClickHouse out there, many more than folks are aware of.

Now I want to talk about the ecosystem at large. One of the guests we’ve had previously on Data Talks on the Rocks was Ryan Blue, the author and co-founder of Tabular. The commercializers of Apache Iceberg and recently acquired by Databricks for a few billion dollars.

We’ve also seen lately that Microsoft has made a huge investment in their Microsoft Fabric architecture. These are just a few signals that data lake architectures are on the rise. The idea of object storage as a foundational fabric for where we store our data is really being embraced.

In private conversations with engineering leaders and developers, some of the same people who are thinking about what their business should adopt as a data stack architecture, it feels like the data lake is winning. Because of [its] interoperability and today, things like Iceberg are not proprietary, the fact that they offer lower costs, this is being widely embraced. What does the rise of the data lake architecture mean for ClickHouse? And how is it influencing the roadmap?

Alexey (00:44:34): Data lakes look like a data format for the tables that get aggregated from the actual query engine. So you can use different query engines on top of the same data and even share the data between different organizations. So it is quite nice. By the way, recently, I had made a prototype that demonstrates the ClickHouse format is also a good data lake.

Michael (00:45:08): A competitor to Parquet? You’re saying the native ClickHouse’s underlying storage implementation itself could be something you store in object storage, and that’s something that can be a standard like Parquet or LanceDB or other formats out there?

Alexey (00:45:30): It does not have to be as widespread as Iceberg.The point of ClickHouse is not only about good data format and good query engine, but also about integration. With ClickHouse you can connect data from Postgres, MySQL, MongoDB and join everything together. In the same way, you can read a bunch of Parquet files from S3, Azure, or even Hive. If you can read data from Iceberg, the support for Iceberg and ClickHouse still requires a lot of work. We do have support for Iceberg but for example we still have to work on schema evolution, we still have to add a distributed query engine for Iceberg. We have a distributed query engine that works for a bunch of parquet files. It will not be hard to do the same for Iceberg or other data lakes and it makes sense. But there is a big difference between the actual database with tightly integrated formats and data lakes. The difference is, first, in the ability to insert data with low latency from many clients. If you do it with Iceberg, the latency could be as high as minutes, so it’s not real-time analytics anymore. The other difference is the tight integration with the query engine. With all the data types that ClickHouse supports, all the indexes, it allows faster querying. Regardless of what you do with Iceberg, ClickHouse with its native formats is faster.

Michael (00:47:17): In some ways, it’s not “either/or”, it’s “both”. The question is about architecture. Some people think of the data lake architecture that the data will live in the data lake and we will bring the compute engine to run over that data lake, that’s one approach. The other approach is there will be data in the data lake as a permanent archive of our data, but a subset of that data that we choose to frequently query, we’re going to orchestrate into another engine in a much more efficient format for querying, and serve those queries that require higher performance, higher frequency, more efficiency out of that orchestrated subset of data, that might live in ClickHouse, in ClickHouse format.

Alexey (00:48:22): Yes, it should work. It has its use cases. But the question is why not put everything into ClickHouse.

Michael (00:48:30): Certainly. I think the answer, coming from a data engineer or VP of engineering perspective, is the same reason why they don’t put all their data into any one single tool. I think it’s about interoperability, but that may change. I suspect the answer is that the world is not ready to adopt a one-size-fits-all database. Thinking about the landscape, we have tools like Iceberg, ClickHouse, Trino, and cloud data warehouses. Where do you see ClickHouse fitting in this landscape? We think about complementary, Where do you think ClickHouse is a complement to another tool, another technology and where could [ClickHouse] be a replacement?

Alexey (00:49:37): There are underlying fundamental technologies like object storage: AWS S3, and Minio, and we are definitely not going to replace Minio, we integrate with these technologies. Another area of complementary technologies is streaming engines like Kafka and Redpanda. Actually I don't use Kafka, I just insert data into ClickHouse with the insert statement.

Michael (00:50:11): What do you think of object store backed alternatives to Kafka like Warpstream?

Alexey (00:50:19): Yes, it sounds very natural. There is some compromise. It is half a second latency on inserts. It is mostly unavoidable. Half a second is 90% latency when you use Warpstream with object storage, but apparently it’s still good enough.

Michael (00:50:42): The experience people have with real-time analytics, well first of all it’s an overloaded term: real-time can mean real-time reads or it can mean real-time writes.

Alexey (00:50:55): It should mean both.

Michael (00:50:56): It should mean both, but the experience of a lot of organizations is that they have a greater need for real-time reads, than they do for real-time writes.

Alexey (00:51:08): Don’t compromise on this. Don’t accept defeat.

Michael (00:51:15): There’s cost trade offs. Anyone who thinks about realtime vs micro batching vs batching…

Alexey (00:51:27): Don't think. Use ClickHouse.

Michael (00:51:30): I’d be remiss if I were doing an interview and didn’t bring up the topic that's on everyone's mind: you’re doing a world tour as part of your role here at ClickHouse. San Francisco is the first stop and in some ways you are in the epicenter of the AI revolution going on and the innovation and development of LLM, in particular. Two questions in terms of how AI can impact ClickHouse, both the technology, the community and the users. Where do you see AI having an impact on ClickHouse’s internals? We were just talking about the Analyzer for instance, Jeff Dean wrote a paper on learned indexing structures, which was pretty interesting. How do you see AI impacting Clickhouse’s internals now and in the future?

Alexey (00:52:28): AI is a quite wide term. We should think about it in different terms - machine learning, AI, and inside of AI there is generative AI, large language models, etc. Let’s start with classic, old school machine learning even statistical methods, even Bayesian optimization.

We already use it inside of ClickHouse for example when we do dynamic dispatch for different CPU instruction sets and for different computational kernels. For example AVX512 could be supported on the CPU, but it’s actually less efficient as AVX2 and no one will tell you about this.

We do the following: We start picking a computational kernel at random, collect the statistics, then use a simple Bayes formula from high school, to choose the best one, to start picking the best one a little more frequently. I would not say this is AI, definitely it is not LLM. It is not generative AI.

But actually generative AI still makes sense for database development. For example, what about testing the database, finding the corner cases, what about making databases accessible for people who don’t write SQL. It is quite popular and we have these features in ClickHouse Cloud. There are also some internal algorithms that are, they don't affect query results, they are deterministic from the results stand point. But, there is a lot of degrees of freedom for how this algorithm might work. One example is caching. There are many caching algorithms like LRU and others and let’s put AI there.

Michael (00:55:01): I want to highlight what you’ve just shared is that when people ask “why is ClickHouse so fast”, you’re saying internally at ClickHouse when a query runs there is dynamic optimization occurring in terms of the types of instruction sets that are being delivered to the kernel. Where does that state live? If I rerun queries again and again, does it get flushed every time I restart ClickHouse? The machine learns a better approach. How does ClickHouse retain that learning between sessions?

Alexey (00:55:47): It doesn’t. It lives in memory, when it restarts it goes away. And it is very practical. Let me provide an example about caching algorithms. There are LRU [least recently used] algorithms, there are many. There is also one very simple algorithm. Make a cache of limited size like 1,000 entries, put entries into the cache, when it is full just erase everything and start from scratch. It is very practical, it has minimum overhead, it works perfectly and you can write this code in under a minute, half a minute. This is very popular when we have to learn from game developers, from even competition programmers. I'm not a competition programmer but I like to learn from my colleagues who have this experience.

Michael (00:57:20): By the way, I've heard it said before, [about] the best algorithm to predict the weather by Nate Silver – who covers politics and is also apparently a poker player in the USA and a statistician. He wrote a book about prediction and he said that for a long time, the best weather algorithm for predicting tomorrow's weather is to take today's weather and replicate it.

Simplest algorithms are sometimes the best.

Alexey (00:57:24): It is important to draw a bridge between these algorithms. The first algorithm is to take the measurement yesterday to predict the next weather. The second is to take several days and make an autoregressive model like ARIMA. The next will be to take thousands of data points, make them tokens, and apply GPT models that is also autoregressive. If you understand the connection between these points, you will also understand why it could give some marginal benefits.

Michael (00:58:07): We’re going to move to a question a bit about your background. I wasn't initially going to ask about this, but for those out there, one thing you talked about in 2009 you created ClickHouse and you didn’t want to create ClickHouse. In fact, you tried to take these other tools and they didn’t work. If you had told someone that’s what you were going to do, they would’ve probably said you were crazy or if someone told you I have an idea I want to create a database, you would’ve told them, that’s crazy. As a creator of a database technology, what have you found to be the most useful academic subjects that you have studied that has relevance to the development of ClickHouse? I'm surprised how much something like machine learning, and you referencing Bayesian statistics has come into play in why ClickHouse is so performant. Maybe you could share a bit about your academic background and what you think, or what you studied and looked back on and realized that was extremely helpful in your work as an author of ClickHouse.

Alexey (00:59:44): I studied mathematics and graduated as a mathematician. Sometimes I like to see connections between different areas. One example, let’s suppose we have a hash table, and let’s suppose it is a linear probing hash table. It's a line, an array, when you insert an item, you calculate the hash, from the hash you derive the place, put the item there. If the place is occupied, you take the nearby place and put it there. The question is, there will be some span of continuous occupied places. What is the distribution of length of this span? Let me give you one answer that is incorrect and maybe you can explain why it is incorrect. So, the incorrect answer, it can be modeled with a statistical process named Chinese restaurant process. Do you know? It’s very funny. Imagine a Chinese restaurant and there are round tables. A guy came to the restaurant and look at these tables, there are definite number of people at each of these tables. They will select a table with probability proportional to the existing number of people at each of these tables. But, some other probability to take an empty table. This process will model distributions such as distributions of cities, city populations, like cities in the United States.

Micheal: Like a Zipf distribution.

Alexey: Yeah, like Pareto distribution? And, the similar will be for this hash table, we have more probability for this item to expand the existing chain so it modeled by chain is less than process. Now where is the mistake?

Michael (01:02:05): … it seems that it should be a random location in the hash table that it's mapped to. Now for the first location that its mapped will take its adjacent. Now, I don’t see a Pareto distribution would be the right one, as there’s no conglomeration effect that you would have.

Alexey (01:02:58): If you have a small field vector, there will be almost no such an affect, but if the field factor is 50% or above, then they will combine together and the mistake is that I explained this without considering that the chain can also collide together and merge together and the performance of the database will degrade.

Michael (01:03:44): Interesting, so then it depends on how populated the hash table is.

Alexey (01:03:52): Other interesting examples when you don't have a higher education like mathematics, you just have to pay attention to some details like you a tuple of two integers. How to construct a hash function on this tuple?

Michael (01:04:14): I’m just making stuff up but treat them as prime numbers and multiply the prime numbers to create a single axis that is unique.

Alexey (01:04:29): Okay, but let me explain. I see this mistake in many code bases, they just create the hash function from each element of the tuple and XOR them. And this is absolutely incorrect.

Michael (01:04:44): Right of course, because it’s not…it's actually…it's uh its not a right? it's not a two dimensional, yeah it seems like…

Alexey (01:04:54): Right, if you have a dataset with its components identical for some reason, all of them will be mapped to a single item and the hash function ends, you will have atrociously bad performance.

Michael (01:05:05): As I understand it, and I think I knew this already from when we met in Amsterstad, that you studied mathematics as an undergraduate in your university, and given from what you’ve just have shared, it is clear that that background has been quite useful in the design and architecture of ClickHouse. Of all the different mathematical concepts, what would you say is the most impactful for database design?

Alexey (01:05:52): Hmm, let me think about this question because it is difficult to take one single mathematical discipline. All of them are interconnected.

Michael (01:06:08): Okay, I'm going to do a followup in my experience about this, one concept that comes up all the time in computer science that has deep mathematical roots is hashing functions. Why are hashes so incredibly useful in computer science generally but databases specifically?

Alexey (01:06:38): It’s because hashing is a bridge between randomness and determinism. In mathematics, you have probability theory and you could say that probability theory is about non-determinism. But the fact is that every statement is deterministic. It works with probability distributions, probability distributions are well defined, I would say, it’s slightly incorrect, but I would say, functions. These functions are deterministic, you can combine and analyze, etc. What about hash functions? You use the methods from probability theory but also make everything 100% deterministic. It allows you to make data sketches, make these probabilistic data structures, but at the same time these probabilistic data structures will give you the same answer every time. Hashing is the single thing that makes it possible.

Michael (01:08:00): I love probabilistic data structures. HyperLogLog being one of those which I feel today is still underused in the analytics community, but maybe that will change if we can find easier ways to expose these probabilistic data structures [like] data sketches, HyperLogLog. I'm going to go back to one other topic and then we’re going to bring it to an end, but we talked about AI. It was amazing to hear you talk about the difference between machine learning, generative AI, and LLMs. One area where LLM has been clearly widely adopted is in the creation of code namely software development. As one of the world’s top software developers, are you using CursorAI or ChatGPT?

Alexey (01:09:12): Yes, I use ChatGPT but not for C++. I use it for “less favorable” from my standpoint progamic languages JavaScript and Python and it helps. But the problem is that recently I asked ChatGPT to write code in JavaScript to make signatures for requests to AWS S3 and it made five mistakes. Then I spent four hours trying to correct these mistakes. It would have been better if I had written it by myself, or should I have used Anthropic or something.

Michael (01:10:01): The story people say is that at the current stage of these AI code partners is like having a junior programmer that you can delegate to. Clearly, the junior programmer that you asked to write this JavaScript, you have to check the work and it took you four hours to check their work. It’s not a senior programmer, so the languages that you know well, like C++, there’s not the same need to delegate that to it.

Alexey (01:10:28): But I would like to. I would be very happy if these generative models could generate production grade code, large projects. I would use it.

Michael (01:10:41) Last question is about the craft of software development and the craft of software engineering. You remain, when I check the ClickHouse open source [repo], and we can all look on Github and see who is contributing to the project and every time I’ve checked, you are there as the most prolific contributor to the codebase. How do you stay inspired and creative when you're working on a long term and incredibly complex project? It’s been over a decade now. You’re coming up on a 15 year anniversary since you’ve started this journey. And I would observe that when I see you present, that there is a sense of joy that comes through in your talks and your demos. I'm just curious, how have you kept that joy and that inspiration for working on ClickHouse alive now after 15 years?

Alexey (01:11:56): Sometimes I’m thinking about a problem and I know exactly how I should do it. Then I want to do it right now. At 3am. On Saturday. I just go and do it.

‍Michael (01:12:28): I certainly have more questions. We could go deeper in a number of directions. I’m honored and thankful that you took the time to speak today as a guest on Data Talks on Rocks, an emerging set of interviews we will be doing with other leaders, founders, and technologists. Thank you Alexey for joining me. I hope more folks will be inspired by your work and this discussion today to pursue their own passions in creating world-changing technologies.

Ready for faster dashboards?

Try for free today.

Get started

Related Articles

Data Talks on the Rocks 8 - ClickHouse, StarTree, MotherDuck, and Tobiko Data

Data Talks on the Rocks 7 - Kishore Gopalakrishna, StarTree

Data Talks on the Rocks 6 - Simon Späti

Ready for faster dashboards?