Data Talks on the Rocks is a series of interviews from thought leaders and founders discussing the latest trends in data and analytics.
Data Talks on the Rocks 1 features:
- Edo Liberty, founder & CEO of Pinecone
- Erik Bernhardsson, founder & CEO of Modal Labs
- Katrin Ribant, founder & CEO of Ask-Y, and the former founder of Dataroma
Data Talks on the Rocks 2 features:
- Guillermo Rauch, founder & CEO of Vercel
- Ryan Blue, founder & CEO of Tabular which was recently acquired by Databricks
Data Talks on the Rocks 4 features:
- Alexey Milovidov, co-founder & CTO of ClickHouse
Data Talks on the Rocks 5 features:
- Hannes Mühleisen, creator of DuckDB
Data Talks on the Rocks 6 features:
- Simon Späti, technical author & data engineer
On June 10, 2024, we held a fireside chat with one of the data world's top innovators, Lloyd Tabb. While thousands of data professionals gathered nearby for Snowflake Summit and Data & AI Summit, we were thrilled to have such a great turnout of industry leaders and founders at Stable Cafe.
Below is the video and full transcript of the event where you will hear about:
- Lloyd's journey on building multiple versions of tools that help people look and understand data and how it has shaped Malloy.
- Lloyd's bets on the shifting data ecosystem over the years - SQL, DuckDB, AI, separation of compute and storage.
- Lloyd's advice on how to get data tools right and the key thing people are missing.
Michael:
It seems like every tool that you've built since working on dBase has had L in it in some way, shape or form. LiveOps, Livewire, Looker, and then the double L in Malloy. Tell me a little bit about these L tools.
Lloyd (00:27)
LiveOps was this amazing company, one of the very first gig economy companies. I was at Netscape and we met this team called New Who, which was building a crowd source to replace the Yahoo directory. I don't know if I remember the Yahoo directory, but it was human edited and they had engaged 50,000 people to work on this directory, and I got very excited about that. So we brought them into the Mozilla network, we renamed it the Open Directory. I ended up working with those guys and then just saw the power of putting people to work on the Internet. After Netscape, I taught middle school for a while and did some philanthropic work, and then a friend of mine was talking about a company that he was working on. He was doing basically a Twilio company, and he met a company in Florida that was doing a home based telephone operator thing, So anyway, LiveOps ended up putting 30,000 telephone operators to work, answering phone calls on the Internet. People asked for the data and we needed tooling to be able to figure out how to manage all these people. There were lots of questions - who are they? Are they doing well in the calls? What's the revenue like? Is this media buy working for our clients? It's a huge data problem and we wanted everybody in the company to be able to understand what's going on with the data.So I wrote L Tool, which was LiveOps tools, so that everybody could see what was going on in the data.
Michael (02:05)
In particular, what was that tool? Was it a browser based tool? Was it a CLI that people could use?
Lloyd (02:12)
It was a semantic modeling language. So it was actually semantic data structures that described all of the aggregate calculations and the dimensional calculations in code. It was actually written in Pearl and then it had a web based interface so that anybody could write queries by clicking on dimensions, measures, filters, and sorting. It automatically drilled and we discovered the drilling pattern so everybody in the company could actually understand what was happening in all of the call volume data, could drill in, and could understand and do the forecasting.
Michael (02:45)
Can you tell me why my customer service calls always take 37 minutes to get something very simple?
Lloyd (02:54)
So customer service is really hard, but orders are really easy. So all of our customers were 800 numbers that were selling things at the time. So we could judge those. And it turns out that in that case, a long call is not a bad call.
Michael (03:11)
Right. Correlated to the spend.
Lloyd (03:15)
It's not. Sometimes the short call works, but it's not correlated. It's not like it's right or wrong.
Michael (03:23)
It's uncorrelated or orthogonal. Okay so this L tool taught some of the patterns I think clearly inspired your later work. What was the back end for?
Lloyd (03:35)
So at LiveOps it was MySQL. We worked against replicants, so basically the way we would architect these things is you'd have a transactional server and then a replicant and you could get real time answers to what was going on in the data, only seconds behind. A lot of times with reporting it's delayed and latency means that you can't really use it, like you can't do self-driving cars if there's high latency, right? So there are a lot of things that don't work. It's funny we don't often talk about latency in data with reporting, but low latency reporting is super important especially if you're going to drive it or try to run your business by it. So it's relatively easy to extract data, do transforms and stuff, but building low latency reporting is complicated and it's hard.
Michael (04:24)
So now let's take the next step, the next L tool that came out of this work was the actual Looker tool.
Lloyd (04:33)
Actually there were two more after that.
Michael (04:36)
So let's talk about these next two tools.
Lloyd (04:40)
The L tool happened at a company called Luminate. We were an ad tech company basically trying to do ad placement and Instagram like things. It was pretty influencer. We were trying to figure that market out. We didn't figure it out. It eventually got acquired by Yahoo!, but not for a great thing, you know, for an acquihire kind of thing. But we built LTool there, and then I did it at another company that was trying to do staffing and crowdsourcing, but that didn't work at all. But I wrote it then too. Then after that one, I met my co-founder, Ben Porterfield, and I said, “Hey Ben, I built this thing three times. I know every company that can't see their customer directly needs this tool.” And so I sat down and started writing it, and Ben and I rolled forward and eight years later we were acquired by Google.
Michael (05:45)
So you and Ben started Looker in 2011?
Lloyd (05:51)
We started in 2011. We didn't incorporate until January of 2012. We actually had our first customer before we bothered to incorporate. We incorporated to take the check.
Michael (06:19)
One of the biggest decisions that I remember when first seeing Looker was you made a bet on cloud data warehouses or cloud databases, and of course, today that seems like a no brainer. But tell me a little bit about that architectural choice to plug in to the cloud or whether that was not the case initially, and it was an evolution.
Lloyd (06:45)
Actually the bet was SQL. So remember at this time it's all about Hadoop and processes writing MapReduce, and that's the way people are building, doing this because SQL can't handle this large enough of data. And we're like, no, you're wrong. It should be SQL and so we started and we originally connected to Postgres and MySQL. And as soon as we saw Redshift, we said, “wow look at this, we win! It can go 100 times faster than Postgres”.
Michael (07:18)
Did the massive expansion of Redshift propel Looker forward because it led to people looking for tools that worked?
Lloyd (07:29)
Yes, it was relatively easy to set up. You can show value almost immediately. People were finding Redshift without us and then not being able to manage it very well. That’s when we could come in. So the way we sold Looker was we would never ask for money upfront. We would just say, let us hook up Looker to your data. Let us show you what your future can be like. We would go into a trial. Our focus was on workflows.
Our focus was to make the data developer as efficient as you can possibly make them and turn them into the hero in the company. If we can make the data person successful in the organization and actually give sight to the rest of the organization there, that person's career is going to change and they're going to pay whatever we ask for. So Looker is notoriously expensive, right? The reason is that we were doing value based pricing. We would walk into a company, in a couple of hours, we would set it up, connect to whatever their main transactional table was, then ask them what they wanted to know. Then in the sales call being able to show them the answers to their own data, that would just turn everybody on. They would go, “wow, I get it.” Then we would let it run for a while and then people would start sending URLs around with, “look at this and look at that and look at that”. We would have engagement metrics on that and as soon as we had five or six people doing this, we would go, okay, now it's time to pay the price.
Michael (09:13)
Many of the challenges that you ran to Looker over time inspired this next phase. Tell me a little bit about the evolution for how we got from Looker to now.
Lloyd (09:31)
Looker is designed to create a visual interface to data for real time transactional stuff so you can understand it or analytical stuff. It changes the relationship of data. So the normal relationship of data is that engineer produces a data pipeline, data analyst takes the output from the pipeline and produces reports, and the consumer looks at the reports, right? Then they have another question and then it goes back to the analyst and they rework, they get in line, rework the reports and this changes that relationship. It changes how the analyst builds the semantic model and then can actually just send URLs to the people and the people can self explore the results. So that's all visual, right? That's all in the UI. So data scientists didn't like Looker very much. Data engineers don't like Looker very much because the visual is the end state of the data. It's great if you're looking for something or you're trying to do an investigation, but we haven't worked our way down the stack. Everybody is still transforming their data with SQL. Data scientists are using SQL to do transformation. How can we address them? So we realized that the semantic model is right - dimensions and measures. Looker stored their reports in databases, which was a huge mistake. You want that in the semantic models. So we spend all this time trying to figure out how to validate that the fields that are referenced in the reports are still in the semantic model. And when the semantic model changes, the reports break. So we put all of the reports in the semantic model itself in terms of views as building blocks. We now have a semantic model for the dimensions, measures, joins - the building blocks. Then we have a very simplified language for writing very complicated queries. So the semantic model describes your data and then transformation is really simple.
Michael (11:40)
What's the macro trend? You talked about when Redshift came out you thought, okay, we win. It is clear running MapReduce on Hadoop clusters is not the way that companies are going to make sense of their data. What do you feel is the kind of macro trend that Malloy might ride that makes you say, okay Malloy wins?
Lloyd (12:05)
There are things that we haven't done yet that we want to do. We’re still experimental, we're still early.
The data is everywhere and federating data is really hard, and every different SQL engine has very different semantics. To do anything besides a simple group by aggregate is hard. Arrays in every single dialect do not look at all like one another, un-nesting does not look at all like anything else. So the basic window functions and operations on a normal rectangle are the same, but everything else is different. We think that there’s an opportunity from Malloy to unify that. They're all coming to the same conclusion because they're all building the same feature, they’re just all building them very differently. So I think we have an opportunity here at Malloy to actually unify that.
Michael (13:16)
Now with Malloy supporting all these different dialects, tell us a little bit about how you're going to manage to talk to Snowflake and Redshift and others?
Lloyd (13:27)
Malloy already does this. You can take a Malloy model and just change the database it references and it will run the same code on all of them. There's a core Malloy library that's guaranteed to execute the same on all of these dialects, and that includes the array aggregation and the nesting stuff. The complicated features are all uniform in Malloy, but you can still access the individual database specific functions if you want. So there are two bets. If I walk down the street, everybody will tell me that I'm crazy and that nobody's going to give up on SQL. I mean, I hear that all the time, right? Nobody wants to learn a new language. Nobody wants to, and so SQL Glot is that bet. And there are a lot of people making this bet. It's like it doesn't really matter and people are never going away. SQL Glot is the translator for that and it's an interesting strategy, but maybe the world will move forward. And I'm an optimist. Maybe if I can make you 10X more productive with your data, maybe you'll do it. That’s my hope. If I can make you a lot more productive with your data, that’s my thought with moving forward. I want to also say something else - why I spell my name in lowercase, and I do that because I'm part of a team always. Everything that I've done in my career, I've done with a team of people and everything that I think has not necessarily come out of my head, it comes from the team thinking about things together. I may just be the collector of it, but I just wanted to call that out.
Michael (15:29)
Well, it's certainly something that I think when we talk to people who've been at Looker, one of the things that always strikes me is the culture and the people. There's a lot of heart in the folks that I think have chosen to work with you and you've chosen to work with. I think actually one of the things I've always loved about the data community is the people. It seems to draw a certain style of folks, maybe it’s empathy for all of the data munging we’ve all done. You start to have a lot of empathy for other members of your tribe.
Lloyd (16:20)
Data is such a bespoke art. No two companies do data the same. It's like everything's custom tailored to the situation. Everybody is assembling out of the cloth that they find in different places. No two stacks look alike. It's kind of nuts that we don't all wear the same thing, it's like everybody puts together their own data outfit and it's wild actually.
Michael (16:49)
Well I think for a lot of people they think about data stacks. Like one data stack should be the same as the other, but, you know every factory is different. So working with data is almost the same as entropy as working with atoms. hen you're a manufacturer, when you're remaking iPhones or you're making clothes, it's a very different right manufacturer. So I'm going to now push a little further on Malloy and some of the bets that I see as a member of the Malloy community and as someone who's dabbled in it. One of the bets I've seen is on the other side of the continuum, you talked about Redshift and saying, “gosh, we start out with Postgres and MySQL replicas, and then Redshift came along and we're running at scale”. One of the bets I see for Malloy is DuckDB. DuckDB is the opposite of Redshift. It's small and lightweight and embeddable and in process. Maybe explain a little bit why I see so much gravitation this way.
Lloyd (17:48)
When I started working on computers, they ran, and I had 640K of memory. My laptop right now with DuckDB running on it is faster than a Redshift cluster, that's why DuckDB is so important.
I can link it into my application. I can link it to Malloy in VS code and it's faster than Redshift was. And it can handle data sets, it can’t handle meta scale data sets right, but it can handle data sets that are large for most enterprises. So I really feel like they've done an amazing job with the SQL dialect. I became close to the team and early on we contracted with them to do Malloy features for us and they're just great to work with so I think it's really amazing.
Michael (18:53)
Okay the topic that a lot of people are here listening to folks on a Moscone stage somewhere in the urban jungle a few blocks away is AI. You're now a part of a company that is obviously making some multibillion dollar bets in that space. You're part of the data visionaries there, maybe without going into what Meta’s plans for AI are, I would ask you two questions as a long time practitioner creating some of the tools that many data folks use. How do you think AI might change the way that data developers and data practitioners work? And maybe the other version of it is what won't change?
Lloyd (19:46)
I think AI is pretty good at doing simple translation. At Looker, I would go around and everybody that started the company, I would do a “teach you how to use a Looker” lesson. We had this airline data set that I've been using since the beginning of time, and I would say, okay, which carrier has the most flights to JFK? How do you answer that question? And what I was trying to do was teach them to translate that into UI clicks. It’s relatively easy to do. So that level of question and answering I think is relatively easy to do. We'll get there and the semantic model makes that much better because there are things you have dimensional freedom, you can pick whatever dimension you want, you can take whatever measure you want, and the measures are named. So the semantic model makes this super easy to do. Then the filtering part is a little harder because you need to do some kind of inspection into the dataset to see what the interesting dimensions are. So Malloy is really good at this or could be really good at this. There's a bunch of people who are using Malloy for this, which I'm happy about. I think the harder questions are, I have this data warehouse with all these tables in it, do something for me with the tables? I think, good luck with that. I think that's a pipe dream. We walk into a data warehouse and there are 10,000 tables and half of them have the same column, some columns repeating some of the different tables, etc. So what makes sense out of that? Good luck.
Michael (21:29)
Right. But I think if we infuse a little bit of semantic modeling on those thousands of tables first, then we could do a lot after that. What are some of the emerging use cases you’ve seen or what are you most surprised by when you see Malloy in the wild?
Lloyd (22:00)
There's a professor in the University of Washington who has been teaching his business students the program Malloy instead of SQL. They learn basic SQL, they learn Malloy, and he's having a really successful class with it, which is the most important thing for me. Is it learnable? Can somebody be productive with this tool? That’s what we're working on right now. We're on Malloy 4.0 because we've thrown the language away three times. It’s hard to figure out how to get it so that people can understand it and operate on it and can build up familiarity. So the fact that it's learnable is the most important and surprising and hopeful thing. Then the semantic modeling thing has always been really exciting, too. That there are people who look at it for a long time and say this is the way we should solve this.
Michael (23:04)
You said you had a stint as a teacher at one point in your career, what are some of the lessons that you’ve taken from that experience in education to translate into your role as an entrepreneur?
Lloyd (23:21)
Teaching middle school shaped Looker’s culture, clearly. I worked at this great private school in Santa Cruz. I ran an after school computer club and there was a woman named Cindy Zimmerman who I loved who was not that technical, but she was really good with the kids. She ran a computer club that said you had to make something. The rule at the computer club was, if somebody asked you a question, you stopped and answered it because that was just the rule. That was the way that it worked. And no, and you could never make anybody feel bad about asking a question. And that created an amazing learning environment for these kids. And they were doing amazing things that Cindy didn't know how to do. When I was there, the kids would help each other and then it would eventually get to me. And I didn't know the answer or I would, and a lot of those kids learned to program and did all kinds of crazy things. It turns out that was around 2005. In 2011 when we started Looker, I started hiring some of those kids to work. It was totally crazy, right? And in fact, the person who started at Meta today, one of those kids that I've known since kindergarten.
Michael (24:55)
Wow. That's the long game.
Lloyd (24:57)
That's the long game. So I think that not knowing why you're doing something, but doing it anyway, and helping people is always a great thing to do.
Michael (25:15)
What a great story for anyone who's thinking about making their hiring pipeline a little healthier. Go start an after school computer club and you'll have amazing candidates…just wait 17 years. I know that a lot of folks out there are trying to build what has been variously called the metrics layer, the semantic layer. You're such a positive guy, so I'm sure you'll find a positive way to state it, but what you said when we were hanging out earlier was I think a lot of people underestimate just how hard it is to get it right. And so and especially in technology, a lot of young technologists will come in and re-discover things. What are some of the things that you feel as now on your fourth version of Malloy and in your eighth version of tools that help people look and understand and interface with data, what are some of the things that you feel like you're getting right this time, but that maybe you got wrong before and that others may also be not getting right? What are you getting right this time?
Lloyd (26:44)
So the core is that it's a developer tool. It's a programming language. You have to treat data people like developers. So that's first and foremost. And we got that right in Looker. And I think that some of the other layers have not gotten that right. You need to build the tooling around the language in order to get anywhere. Making sure that the developing experience is fast and the cycle you're thinking about is for the native developer so that they can be productive in your world. It's the first thing to get right. It's why Malloy runs as VS code extension today and it’s where it lives. We keep trying to make that better.
The other thing in the semantic layer is if you're just dealing with a rectangle, you're not going to add enough value. So the hard part is everybody knows how to work in a rectangle in data, it’s grouped by aggregate. We learned this in kindergarten. It's the first thing you learn. How many coins are in the pile? Let's separate them into two piles, nickels and pennies. Filter out the pennies. Okay, now how much is this pile? It’s simple aggregation. It turns out that’s all we ever do with data. But the hard thing is that when we start joining in data, everything gets really complicated? All of the tools try to solve this problem differently. Materialize tables is one solution and then join the materialized tables. But the thing that Looker solved that no one had solved before was that join relationships don't affect aggregate calculations. So that gives you the ability to just pick a dimension from anywhere in this graph and pick a measure from anywhere in the graph and it will produce the correct result. And that was an innovation. You can pick as many dimensions as you want, as many of the measures. One, it's always going to do the right thing and is always going to produce a good rectangle. What I learned from Malloy is that data exists in a graph, and SQL models are in a flat space. You join it into a matrix and you've lost the graph. And so Malloy doesn't lose the graph. You join the graph, you got the graph. Okay. And now it turns out that you also don't want to produce a table. You want to produce a graph. And so Malloy also produces the graph as output. And that is the thing we learned. The way we understand data is we dimensionalize by something, we measure something and filter. Those are our tools that we have. And putting it in the graph lets you do all of that simultaneously.
Michael (29:39)
In the way the tools influence each other, tools influence databases, and databases influence data collection and ETL, I think I saw you write or speak about if we think about the most common form of data these days it's JSON, right? But all of the ETL tools out there, all of the ingestion architectures, and pipelines essentially flatten those JSON events out and put them into a bunch of rectangles. Now Malloy comes along and says, okay, people write joins and then figure out the network graph and maintain it, but wouldn't it be better just to leave it in its original form instead of pulling all these rectangles out, flattening them and then having to manage this and that, and that?
Lloyd (30:30)
Yeah, that’s exactly it. BigQuery did this well, right? Originally BigQuery didn't have joins. They just pulled it in front of us and wrote them out in a column store and then they could query them. It took them years to figure out how to get the SQL right so that you could do this. It's not an easy problem. And it turns out that un-nesting these structures is the way that you want to do it or you want to be able to correlate some queries into the array. They got it right. And again, this is one of those advanced SQL things. You can do this in all the other dialects, but it's very difficult. You want to be able to take the data and run these columns, and store against the data without moving it or transforming it.
Michael (31:19)
One last question, then I'm going to open up to the audience for some questions. We saw an announcement that Tabular Data, not to be confused with CSV data, a much more valuable business that would be bought for 10 billion next week, with Tabular Data, the folks behind the Apache Iceberg project, got acquired for over $1,000,000,000 by Databricks. In general we're seeing this shift towards object storage and DuckDB in some ways represents the ability to not even have to run that Redshift cluster, so I would ask what are you seeing in terms of Malloy and its relationship with this shifting landscape of the layers below Malloy, namely object stores and whether parquet files will end up being the future of data storage rather than cloud data warehouses?
Lloyd (32:33)
The separation of compute and storage is actually really interesting. BigQuery did that along time ago. It allows you to take large compute clusters and point that at data wherever it happens to be sitting. And where it's sitting can be in different formats. I think we're there actually. One thing we haven't talked about is data privacy. Privacy and access control is tough. So while data can move around all these places, you also have to secure data in all these places. So again, it's going to be bespoke. I don’t think it’s going to be one size fits all. I think you're going to have data that you absolutely need to have in the warehouse that is secured in a way that has very tight access controls on who can see it. And then I think you're going to have other data that isn't as important to be secure, that may flow around in a different way.
Michael (33:33)
I think a tool like Malloy is key so that you can bring in those data environments in a safe and secure way.
Lloyd (33:44)
The footprint of Malloy is designed so that one Malloy query is always one SQL query. So the goal here is that it can sit anywhere that SQL sits. Instead of having to write the select statement, you write the Malloy statement there, and you write it more efficiently. So we’re hoping to be everywhere. That’s our goal. That’s our design point. I don’t know if we’ll get there, but we’re going to try. And if we don’t get there, I hope the ideas live on.