“We did something crazy: we rolled our own database.”
– Eric Tschetter, creator of Druid
Ten years ago today, the Druid data store was introduced to the world by Eric Tschetter, its creator, working at a small start-up named Metamarkets. Eric had left LinkedIn six months earlier to join us as the first full-time employee, and I was the CTO and co-founder, working in a shoebox office[1] off South Park in San Francisco. In his blog post, Introducing Druid: Real-Time Analytics at a Billion Rows Per Second, he shared the rationale for Druid’s creation:
“Here at Metamarkets we have developed a web-based analytics console that supports drill-downs and roll-ups of high dimensional data sets – comprising billions of events – in real-time. This [post introduces] Druid, the data store that powers our console. Over the last twelve months, we tried and failed to achieve scale and speed with relational databases (Greenplum, InfoBright, MySQL) and NoSQL offerings (HBase). So instead we did something crazy: we rolled our own database. Druid is the distributed, in-memory OLAP data store that resulted."
The initial responses from HackerNews were predictably skeptical:
- “It’s always tempting to build it yourself.“
- “They should have just used QlikView.”
- “HANA has been doing this for at least 5 years now.”
Ignoring the naysayers, Eric continued to lead the engineering team in building out Druid as the core engine for the Metamarkets platform. A year and half later, in October 2012, we open sourced Druid in a talk we gave at O’Reilly Strata’s conference [2]. Open source Druid has now been adopted by hundreds of leading companies around the globe, notably Netflix [3], Lyft [4], eBay [5], Netflix [6], Salesforce[7], Pinterest [8], Yahoo! [9], and Snap [10].
While most of this is public, there is one piece of history about Druid that hasn’t previously been shared. Before Metamarkets’ acquisition by Snap in 2017, I retained a few keepsakes from the early days. One of them was an email titled, quite simply, “Druid — the spec” from February 7, 2011. It is 78 lines and 553 words long. It lays out a simple proposal for Druid’s architecture, with a postscript “I’m going to take this project on as a background thread, working on it whenever there aren’t other more pressing things to deal with.”
In the subsequent eight weeks, Eric not only wrote Druid but he pushed it into production. He sunset our HBase cluster on April 4, 2011, replacing it with the first Druid service. That original Druid cluster has been continuously operational for over a decade, having processed over 100 trillion events, 100 billion queries, and 1000s of end users in its lifetime.
Druid architecture — “the spec”
While I could share my own interpretations about the modest beginnings of technology innovation, often stories and source materials speak for themselves. That’s the appeal of the collections of stories in books like Revolution in the Valley and Founders at Work. I hope this story of Druid’s origins will be valuable to others out there who have a crazy idea for a new software architecture, and steel them with the confidence that their contribution could indeed make a dent in the technology universe.
Footnotes
[1] The shoebox was 300 Brannan Street, San Francisco. It was packed to the gills with startups, like many office buildings around South Park at the time. It was an unofficial, physical incubator where the rents were (then) affordable and the amenities were few. Guillermo Rauch, who went out to create Next.js and Vercel, was downstairs from us; Daniel Gross, was then a 19-year prodigy and media darling working on Greplin (eventually Cue) down the hall. The entire building reeked of a sickly sweet odor from its first floor restaurant, aptly named Ozone Thai.
[2] Beyond Hadoop: Fast Ad-Hoc Queries on Big Data Michael Driscoll, Eric Tschetter. O’Reilly Strata Conference 2012. I remember the night before the Strata talk, our head of sales & marketing, Eric, and I were holed up in a Manhattan hotel room rehearsing while Eric pored over the code base to clear it for public release.
[3] How Superset and Druid Power Real-Time Analytics at AirBnB. (June 2017) by Maxime Beauchemin, YouTube.
[4] Streaming SQL and Druid at Lyft. (August 2018) by Arup Malakar, YouTube.
[5] Monitoring at eBay with Druid (May 2019). Mohan Garadi, eBay Blog.
[6] How Netflix uses Druid for Real-time Insights to Ensure a High-Quality Experience (March 2020). Ben Sykes, Netflix Tech Blog.
[7] Salesforce – Delivering High-Quality Insights Interactively Using Apache Druid at Salesforce (2020). Dun Lu, Salesforce Engineering Blog.
[8] Powering Pinterest ad analytics with Apache Druid (Jan 2020). Filip Jaros and Weihong Wang, Pinterest Engineering Blog.
[9] Yahoo Casts Real-Time OLAP Queries with Druid (Aug 2015). Datanami.
[10] Data analytics and processing at Snap (Sep 2018). Charles Allen, Slideshare.
Apache, Apache Druid, Druid, the Apache Druid logo, Apache Superset, and Superset are registered trademarks or trademarks of Apache Software Foundation.