Apache Druid is an open source data store designed for high performance (sub-second) OLAP queries on large (terabyte) datasets. It is most commonly used for operational analytics use cases, where quick decisions must be made on data that is being streamed in.
Common Druid use cases include clickstream analytics, anomaly detection, network monitoring, customer behavior analysis and digital marketing, but it is applicable in any environment where you need real-time data ingestion, fast aggregation, and low-latency queries. In today’s competitive business landscape, analytics is core to understanding both user and product behavior, and providing fast decision making on real-time data is an increasingly critical component in your data strategy.
Apache Druid’s price performance advantage
Druid’s query performance is typically at least 10X faster than common 2nd-generation cloud warehouses such as Snowflake and BigQuery. But what about the cost? With Druid, the more queries you perform, the better your price performance will be. In other words, if you perform under 1000 queries per month, your cost to achieve speed might be comparable to other data warehouse solutions, but if you perform 10 million queries a month, your cost will be a fraction (less than 1%) of what you would pay on other data warehouse solutions. With Druid, the price performance advantage when doing time series analysis on large streamed data allows you to accomplish infinitely more analysis than you could on the same budget using a non-operational database such as Snowflake, BigQuery, or Redshift.
Why is Druid so fast?
Druid’s powerful performance comes from an architecture that leverages ideas from data warehouses, time series databases, and search systems. Key characteristics from each of these architectures are brought together to create a highly performant, scaleable and self-healing database that supports high ingestion rates and low latency queries.
- Column oriented storage means it only loads the columns requested for a particular query
- Data is distributed across potentially hundreds of processing units and organized in a manner that allows it to be quickly accessed and aggregated. The system is scalable and can be augmented on demand to handle larger and larger amounts of data with no decrease in speed
- Recently ingested data is kept highly accessible, in memory, for fast access. Historical data is organized by time across the distributed nodes in a manner that allows fast time-based access to the data.
- Indexing is used to ensure fast filtering and searching across columns.
- Approximation algorithms are integrated to generate counts and distinct counts substantially faster than exact computations. If your use case only requires approximate counts (within 3-5% of exact counts), these approximation algorithms can provide an additional 10X speedup.
By leveraging these key features from data warehouses, time series databases, and search systems, Druid provides a highly cost effective and scalable solution to time series analysis and aggregation of very large scale data.
Druid configuration — it ain’t easy!
But nothing comes for free and despite its name, Druid cannot perform magic! With open source Apache Druid, you download the Druid software, create your own data cluster, and then tune it based on your needs. Achieving performance is easy but achieving price performance requires an intimate understanding of performance needs and the ability to scale your cluster up and down based on ingestion and query demand.
Cluster tuning for price performance involves:
- Managing the transfer of historical data as it moves from the “recent” bucket; and configuring the memory, cache size, and number of historicals.
- Sizing the memory and cache of time-based segments; and optimizing segment size for maximum performance on historicals.
- Managing process threads and buffer configuration to efficiently compute query results. The thread pool limits concurrent queries and misconfiguration or overloading can result in poor query performance.
- Configuring Brokers which route queries, and Middle Managers which forward tasks to JVM’s. These middle managers are particularly vulnerable to memory crashes and may require restarts.
Apache Druid is open source, and detailed configuration information is available at the Apache site. You’ll find the above information and much, much more at https://druid.apache.org/docs/latest/operations/basic-cluster-tuning.html.
The obvious question is: do you want to manage your own Druid cluster? Many companies have gone this route and it is feasible with a dedicated DevOps team. If you do have DevOps resources to spare, managing your own Druid cluster may be the right approach, but for most companies, that overhead is a burden.
Alternatively, many companies encapsulate some of the complexity by using a service provider to manage your cluster. Druid service providers will provide professional services to manage your cluster for you, or provide tools that help with the burden of configuring, monitoring, and healing your clusters.
Rill to the rescue
If you want the price performance of Apache Druid without the DevOps or maintenance overhead, Rill is the perfect solution for you. Our fully managed SaaS offering leverages Kubernetes auto-scaling to remove the burden of configuration. Your team simply logs into Rill to access your Druid cluster. All scaling and configuration is performed by Rill and the cluster is dynamically adjusted based on your team’s ingestion and query needs. User and group management and enterprise level security allow you to share your analytics with appropriate team members or customers. Ultimately you and your team can focus on your core business values rather than configuring servers.
Rill supports interoperability, both on the front end in your ETL tools and for your data visualization needs. Ingest data from Kafka or other common data warehouses and visualize your data using your Tableau, Looker, or your favorite data visualization tool.
At Rill our goal is to encapsulate the complexity of Druid behind the curtain of a seamless and secure cloud service. If you are looking to access mission-critical operational analytics capabilities with a fast-time-to-value, give Rill a try. We are happy to give you a hand bringing your data into Rill and Druid and getting you the fast access your business use cases demand.