Data Talks on the Rocks

Data Talks on the Rocks 6 - Simon Späti

Michael Driscoll

Author

December 5, 2024

Date

minutes

Reading time

CONTENTS

Example H2

Example H3

Data Talks on the Rocks is a series of interviews from thought leaders and founders discussing the latest trends in data and analytics.

Data Talks on the Rocks 1 features:

Edo Liberty, founder & CEO of Pinecone
Erik Bernhardsson, founder & CEO of Modal Labs
Katrin Ribant, founder & CEO of Ask-Y, and the former founder of Dataroma

Data Talks on the Rocks 2 features:

Guillermo Rauch, founder & CEO of Vercel
Ryan Blue, founder & CEO of Tabular which was recently acquired by Databricks

Data Talks on the Rocks 3 features:

Lloyd Tabb, creator of Malloy, and the former founder of Looker

Data Talks on the Rocks 4 features:

Alexey Milovidov, co-founder & CTO of ClickHouse

Data Talks on the Rocks 5 features:

Hannes Mühleisen, creator of DuckDB

Data Talks on the Rocks 7 features:

Kishore Gopalakrishna, co-founder & CEO of StarTree

Data Talks on the Rocks 8 features:

Toby Mao, founder of Tobiko (creators of SQLMesh and SQLGlot)
Jordan Tigani, co-founder of MotherDuck
Yury Izrailevsky, co-founder of ClickHouse
Kishore Gopalakrishna, founder of StarTree

Recently, I've been collaborating with Simon Späti on a number of essays. For the sixth round of Data Talks on the Rocks, I interview Simon and we dive deep into the following topics.

The journey from being a data engineer to technical author.
The hype behind Bluesky - beyond the growing community, it has rich data that is open and available.
The job to be done for data folks - are we analysts? are we data developers? are we data engineers?
The latest trends in the data space - object storage, schema evolution, data modeling, and declarative data stacks.

Check out the video interview and full transcript below.

‍

Michael Driscoll

Welcome everybody to Data Talks on the Rocks. I'm excited today to have as my guest Simon Späti. Simon, welcome to the Data on the Rocks podcast.

‍

Simon Späti (00:00:12)

Thank you so much for having me. Yeah, I'm looking forward to our talk.

‍

Michael Driscoll (00:00:17)

I thought I would, in my own words, introduce you and say a few words about your background, and then we'll get right into some of the topics of discussion. You and I have been following each other on the Internet now for quite a while, as two folks very excited and interested in the data world. You've been a data engineer, a technical writer, a blogger. I've been following your blog for quite a while. And most recently, we've been working together on a number of essays on interesting topics in the data space. You're also an author of a book called Data Engineering Design Patterns, which is available. We'll certainly share links after this, and in the transcript for folks who are interested in learning more.

But I guess I just want to start before we dive into some of the contemporary topics in the data space, just a little bit about your background in your own words, how you got into data. It's been a couple of decades. You've been working at various roles, various companies. Now, as a writer, freelancer, and author. But yeah, tell me a bit about your journey into this data space, how you got here.

‍

Simon Späti (00:01:37)

Yeah, I have a bit of an unorthodox life path. I would say I started in 2003 as an apprenticeship as a computer scientist. So here in Switzerland we have this dual system where you can actually work and study at the same time. So that gave me a head start. So I have 4 years of apprenticeship, where I already have actual work. So there I worked with Oracle SAP. I was in the service desk, and I really got into all the different parts. And then at the end, I basically chose between SAP and Oracle databases, because that was the one that interests me. Both were database related. So at the end I chose the Oracle database. From there, the first day of course, I moved to a bank, and there we had data warehouses. We did the traditional ETL. And that's actually how it all started. We had a lot of bash scripts starting all our ETLs. We had ODS (operational data store) where it had the intermediate fast querying data, for like anomaly detection and stuff that had to be on the fly. Then we had a long term data warehouse where we do reports on if everything is okay, if the customer’s money is correct and everything. So that was basically in 2007 I would say. I also started my domain. I opened it up in 2004 quite early, and then I always tinkered with it. At some point I had a party website that was before Facebook and everything. So there I was, the guy who people came to, and when we were at parties.Then I took the pictures. I did that with PHP, HTML, and CSS and that's actually how I got into web development a bit, and that stayed with me up until today. I changed the domain, but all domains still reroute to that, to my current domain.

Fast forward. I was starting to do consulting work. Then I switched to Microsoft SQL because the demand was higher than Oracle at that time. Later I went abroad for three years to Copenhagen to practice my English. So you hear I'm not a native English speaker. But there, I really got into language. I read more books. I got into the Tim Ferriss podcast. I listened to all the great people, then I started to listen to more books. That's when I started to write my first blog post.

I think one of my first was about data engineering - the future of data warehousing. So I was actually at the right time when data engineering started. Right? So data engineering is not that old yet, but before it was a BI engineer. I was mainly writing about the convergence or the difference between them. And then also in Copenhagen, I got more into Big Data, as it was called back. That was my first experience with Python before. I was just doing a store procedure on PL/SQL or TSQL on Microsoft or Oracle. But then we did a data science hackathon. The whole Airbus was flying to Toulouse, and then we did a 2 day hackathon on the flight radar 24 data. That was really my first experience with a little bit more programming and also the fun stuff that I really liked, and then also that stayed with me. Then at some point, I went back to Switzerland. The writing got more and more involved. After that I started as a data engineering and technical author at Airbyte. So actually, I was kind of making the writing as my part-time job or a full-time job if you want. I still do that data engineering and then writing about it.

Now today, two months ago, I started my own company, so to speak. Full time author at the moment, but I'm still doing data engineering and everything related to open data engineering. From basically BI engineer to data engineer to author and writing my book. So my book is not finished. It's an unfinished book, but I'm planning to. I don't give myself a deadline, so I can really do it whenever I have something to write. I really dive deep, deep dive into a chapter. And I release chapter by chapter. It's online. That's what I'm doing on the side.

‍

Michael Driscoll (00:06:50)

You talk about your journey to becoming now a data engineer and author. I think one thing that you and I discussed over the last few weeks is…And recently there is an article from Chris Riccomini about what we call those of us who work in data? Are we analysts? Are we data developers? Are we data engineers? What do you think is the right term to describe somebody who is working on kind of from soup to nuts, from data to dashboard, on these data stacks. What's the term of art that you would use?

‍

Simon Späti (00:07:36)

I would ask the first question very early because I think it really depends on where you are living. Because here in Europe, and specifically in Switzerland, maybe things are a bit slower than in the US, right? Where I mostly connect and write about. I'm not even sure if the data analyst arrived here. So I would still say now it's already dead, but here it didn't even arrive yet. Also dbt, it's getting bigger and bigger. But I would say, it depends where you live. And I'm also not so keen on really the…for me, it's like..before it was DBA database administrator, right? But we still have DBA’s today. And I just think for me, the business intelligence engineer knows a little bit more about the business and really has that hat on. And maybe the data engineer is more like a Python and program. But there is a whole different area. Data engineer could also, depending on the size of the company, could do devops, could ramp up the whole platform. But it really depends where you live and which company you work at. But an analyst, I would say, is also kind of a BI engineer, right? Because both an analyst is really knowing the domain, is a domain expert and knows the business. And that's what a BI engineer is doing as well right? Maybe an analyst, if you really want to separate them, is doing more hands-on programming. But at the end, I guess analysts for me it's always a synonym for a dbt. If someone works heavily with dbt. and don't or and not much further, then that's kind of an analyst for me. But yeah, I think there's many definitions and overlaps.

‍

Michael Driscoll (00:09:32)

It seems like a trend at least, I've observed, is a lot of folks start out working at the application layer. You know, designing dashboards with Tableau or Metabase or Superset or Power BI or Looker, and then they end up bit by bit, moving deeper and deeper into the stack. Because that's frequently where kind of the big, the thorniest challenges or the gnarliest knots are often found right? Not the application layer, but the database layer where you have to think about data modeling. And then once they're working on data modeling, then maybe in the database, they start thinking about orchestration and start going sort of one step below.

‍

Simon Späti (00:10:23)

Yeah. Totally. Yeah.

‍

Michael Driscoll (00:10:24)

Well, one topic that I thought we would bring up, because you and I are again very active, you even more so than I am, active members of this virtual community of data practitioners. In the last couple of weeks probably the most exciting thing that's happened in the data world has been this vast migration of folks from X, formerly known as Twitter to Bluesky. The second thing that's related to Bluesky, not just that it’s a virtual community that I think has some qualities that we appreciate, but also that it itself has some incredibly rich data that is open and available. That's kind of a dream for data scientists or data engineers. Or, you know, whatever we want to call ourselves.

I guess I would ask you. And maybe we'll get into both topics. But first, why? Or what's your opinion? What changed, you know? I think the data community was kind of casting about for a place to live. There's slack channels like dbt that have a pretty healthy community. But what do you think has changed in the last few weeks? And why, suddenly, are folks landing on Bluesky?

‍

Simon Späti (00:11:54)

I think that's there's multiple reasons. First of all, it's like, yeah everybody was kind of searching for a new place right? Do you go to Threads or do you go to Linkedin? Do you go to Instagram? Every platform has their own kind of advantages, but for me, when I saw Bluesky, it felt like another Twitter. So it's like I felt immediately like home. Then the more I used it, the more I got to know the background of it. I think it's also so fascinating how it's built. I didn't know that in the beginning. But yeah, they have everything. It's an open social network, right? Even the app is open. But on top of that you can build multiple apps. But then also the protocol, it has its own protocol. They're AT Proto. So if Bluesky goes away you can still have all your data. You can build, in fact, there's already two other UIs that I know of. It's like a Hacker alternative and like event RSVP? Kind of a platform where you can approve or go to events, and they're all with the login of Bluesky or at proto. And then it just opens up the Pandora box of possibilities because you can do so many things. And on top of that, even the data is open. So yeah, you can directly go to the API. Do a DuckDB query and query your last 100 posts. In fact, I did that on my own blog post or a post on Bluesky. And on top of that there's also a jet stream that is just streaming the whole. So yeah, there's a funny implementation of that. So you can just open it. And then you see all the posts coming at you, right? So it's really interesting that you can just plug in. It's almost like, I think Twitter had it before as well right that the API was open. Now it's closed or you pay for it. Especially for data people, we can actually analyze the whole social network. And yeah, I think you also did some fun experiment there.

‍

Michael Driscoll (00:14:37)

Right the last week, I think a lot of us have, we've all been diving into this rich signal stream. Because right, it's, you know, not every day that this kind of data becomes available. And I hope it stays open and available. I think I'm optimistic. Certainly that's a core value of this

of this organization. I'm not sure. They're not quite for profit. I think they're a B Corp, but of course we all saw that OpenAI started out as a nonprofit, and then went in a different direction, so hopefully they'll stay the course here, open and public and available with all of their data the way they are today.

I think the other thing that's amazing about Bluesky is, it's such a small group of folks. It's only about 20 people in the entire company. And here they are, currently, I think last I checked, they now have over 20 million users. So it's remarkable what they've been able to achieve in terms of their infrastructure.

‍

Simon Späti (00:15:53)

On prem. Everything runs on prem, right thing.

‍

Michael Driscoll (00:15:56)

Right, and they're running their own servers, I think. I think for some cost reasons they made that choice they've got. I think Martin Kleppmann advising them. The author of Building Data Intensive Applications. So they've got some smart people involved there. And I think, obviously the fact that the site remains available, and up during this massive spike is again a testament to that very thoughtful architecture.

‍

Simon Späti (00:16:33)

Yeah, it's amazing. And also that you could even run your own service right? If you want, you could have your post run on your own server. I think they have this personal data service, PDS, The implementation of Blueksy is based on SQLite. So every user has their own SQLite database. And now they have maybe around 20 million. It's so interesting. How they build it.

‍

Michael Driscoll (00:17:06)

We talked about using Blueksy data as maybe a way to talk about data stacks more generally. I think what's exciting today is there feels like the data space has always been for those of us who are in it an exciting place, but it does feel like something's…some tectonic plates have shifted in the last couple of years in data. And I thought, you know, one of the goals of you and I connecting here is to talk about some of these big trends that are these macro trends that are happening. Speaking of SQLite. Maybe we'll talk about an analogous technology to SQLite, which is DuckDB. I had as one of my guests on Data Talks on the Rocks, Hannes, the creator of DuckDB, and his inspiration for DuckDB was to build a SQLite but for OLAP. SQLite is transactional, and DuckDB is, of course, analytical. You've been doing a lot of work with DuckDB. In your perspective, why do people love DuckDB? Where does it shine? You've been doing some writing for Motherduck, which is obviously the commercial company that's building DuckDB in the cloud, and talking about use cases for DuckDB in production. From your research and your writing, what are you seeing around? Why are people so attracted to this technology? And really, I think, most interestingly, where is it being used? What are the use cases in production for DuckDB?.

‍

Simon Späti (00:18:58)

Yeah, for me when I first encountered it, it was like a dream come true in a way. Because when I started my career we had like, we started with SSAS, so that's the analysis services of Microsoft, and we use them everywhere. I think they're still heavily used today because they're just so fast, right? If you want to open your dashboard, do sub seconds response time, you need a very fast backend. But it was just always so hard, especially with SSAS. It was like a

drag and drop or click tool. So everything needed to be clicked. And so if you want, if you had, like many cubes, you had to go through it, you couldn't really version it, because there was like IDs and graphical stuff within the code. So it was really, really hard. And then a decade later, we had an open source OLAP solution, like ClickHouse and Druid and Pinot, and all this great stuff that is out there now. But it's still very heavy to, or it gets easier and easier, but it was still hard to ramp up your own servers, right? Because they're really, really fast in also event data, so you can really stream in large amounts to it. But then, if you also need to, they have their back server. They have the front end. So they have many like this, or like processes running right…

‍

Michael Driscoll (00:20:27)

Various services, right like almost half a dozen services that you need to run.

‍

Simon Späti (00:20:34)

Yeah I think now, I don't know is that 10 years later, or 5-6 years later, now we have DuckDB that you can just brew install DuckDB. And then you have OLAP that is fast. And yeah, that was the first thing. You just have a fast OLAP, although it's not like a huge server. It's more local first, obviously, but it can handle, or the data that we handled back then can easily be handled with that now.

But I think also the fact that you can now do many more. That's only what I encountered later right? With Rill, you have an interactive way of fetching the data instead of waiting to import the data and then see what you have and then click. So you can interactively explore your data. That's one use case, or one that I really like is like the zero copy layer. So actually, DuckDB is mostly not used as a storage, right? It's mostly used as just this intermediate layer that makes everything magically faster. So once I had to export from Postgres to Postgres. Then you do a data dump, and then you import it again right? But if you use DuckDB as in memory, or a zero copy layer that speeded up the whole thing just because it has a faster reader, it uses some advanced technology. That's another cool use case because it's just a single binary. Also, same as SQLite, you can also run it as part of your pipeline. It also opens up for enterprises. You can do it more secure if the data shouldn't leave your even shouldn't leave your web browser, right? You could run it within the web browser, or there's also WASM, that is quite popular.

‍

Michael Driscoll (00:22:40)

I think you wrote that blog post about folks who are using it in production. Maybe what would you highlight as one of the most interesting use cases in production where DuckDB is being used in production? And how is it being used in that company?

‍

Simon Späti (00:23:00)

I think Okta was very interesting. There's a whole Youtube video from Jake Thomas processing 7.5 trillion records, and they have a very cool architecture right where they split it in multiple DuckDB processes. Also, just a startup that has like 3 or 4 people, Spare Course, they do cloud analytics based on cloud infrastructure. I haven't dug too deep into how they really do it. But they say they use DuckDB for, I think they do a lot in the browser and analytics on the fly, instead of having a very large backend. Also Hugging Face that built this protocol that you can directly query all their models directly from there. I'm more fascinated almost by the different use cases it allows than the act or the one single one, right? There's so many. You can use it as an interactive data app. You can do pipeline processing, or just have your lightweight compute that you can do your tests on before you go to the Cloud, where it's very expensive. So you can do a lot of things beforehand. And then, if you really need a big class, you can still use Spark. But maybe the older CI/CD stuff. It can all be done locally, and then you save a lot of costs, right? There is a big factor on many. They save a lot of money and speed. So this is, I think, the biggest one I see.

‍

Michael Driscoll (00:24:48)

It seems like one of the macro trends that DuckDB has been kind of drafting off of is this shift towards object storage. So we've obviously seen S3, Amazon's S3, as you know, this foundational fabric, for where data lives. And the fact that with DuckDB, you can…think all of us really love that ability just to write a select from an S3 bucket directly with one line of SQL. And query the data. And in fact, as you mentioned the ability to, DuckDB doesn't even need to really be storage, right? It's simply a compute engine and your storage is actually your data lake or your object store, the parquet file living there. Among choices that people are making in their data stacks are, where do we put the data? Do we put it in a data lake? Or do we put it into Snowflake? That's often one of the first questions that data engineers have to answer. They also need to, if they do put their data into a data lake, answer the question of, you know, what's the format of that data? Should it be? God forbid! CSV, or you know Parquet! Or you know, g-zipped JSON. Should it be Delta Lake or Iceberg? And now, of course, those are both under the umbrella of Databricks.

We also can talk about once they've made those choices, there's other choices that need to be made, as in the data stack of what language do they want to use to operate on that data?

Maybe we'll even go one step back, which is before we talk about thinking about the architecture of this emerging modern data stack that's out there. And “modern”, should be kind of in heavy quotes. Before we even talk about landing the data, what are…you obviously worked at Airbite in the past. The first step to sort of working with that is getting access ingesting the data. Orchestrating data from a source. What have you seen as sort of best practices and emerging technologies in the space of orchestration? Just to be concrete, you and I were talking about the jet stream from Bluesky. Well, it's very easy for anyone to say I'm going to go grab in real time a feed 5, even a few 100,000 Bluesky posts. That's easy. But it's far harder to say, I'm going to set up a service that is going to continuously ingest 100,000 posts a minute. I think right now, last I looked it was about 100,000 posts or events a minute coming across the Bluesky jet stream. And if we were to build a data stack that was making sense of this stream, getting a durable ingestion architecture up would be the first challenge. What are you observing these days in the space? You know there's Airbyte, there's Dagster, there's Airflow. There's I think Maze is a new entrant to this space. Tell us a little bit about orchestration and what you're seeing in that on that sort of layer of the stack.

‍

Simon Späti (00:28:46)

Yeah, if you were to build a data warehouse, or it depends on what you want to do, right? It's like the classic. As you said, you can just export them and do some analytics with DuckDB, and then you're fine, right? And then you can put an API tool. But if you want to really make sense of it, because at the end of the day you have to usually start small. You have some data, but then you figure out in the dashboard the data is not as good, or there's missing rows. And I need now to clean this state format, or, as we have seen, with this jet stream, there's a lot of JSON, right? You might need to normalize that data somewhere, and then they might change the format. At some point they will add new attributes. So you need to kind of, it will break at some point. I did some python code myself to actually export the jet streams. And first of all, you need to have somewhere that it runs. So I can run it on my local machine. But my local machine is not running all the time. So yeah, you would need to have a lambda function, or somewhere to run in the cloud to constantly extract it. But then you need to land it somewhere. Probably in S3 that you have cheap storage, and you can do later what you want. So, schema on read that, you do you kind of figure the schema later, when you actually want to analyze it. But usually, if you want to really implement real analytics on top of it, I would actually start way back - what is the data flow? What is my data engineering lifecycle? What do I need to do because orchestration is one part, but you would also need to have kind of the lineage, maybe, or you would need to have a data catalog where you see your data right? It doesn't need to be a tool, but it needs to be some kind of a query that you can see all your data. You might need a BI tool. So I think these thoughts or these brainstorming activities also get a little bit lost, or at least in what I read sometimes because it's just, do you just have some data? But in a real enterprise, you actually need to make it sustainable. And you need to adapt to the changes because the schema changes. I also wrote about my last chapter about the schema changes, and I analyzed the convergence evolutions between different schema changes. When I started we just had a kind of release management team that organized all the changes, and if the release was closed you couldn't change anything anymore. And then you had to wait for the next release. And then, we had slowly changing the dimensions where you the changes of the data takes on or has an evolution. You have data vault, like a methodology that actually automatically adapts, or more more automatically adapts to changes from the source system. You have some schema registries, and I compare that, then to NoSQL, because in NoSQL you don't have any schemas right or like you have a change. Flexible schema with the source and the destinations.But looking back to when I started 20 years ago, we still do this schema evolution or schema changes, right? That didn't go away. It's just maybe not the first thought when you start, but when you actually start and you think about that thoroughly, I think you will be in a better state at the end. That you can really foresee, okay, what happens if the Bluesky changes something in the jet stream. Actually, I saw a PR recently that actually said something about the jet stream, they changed something, and then the apps don't need to change. But actually, the one who consumed the jet stream, right? So that’s just some thoughts that come to mind.

‍

Michael Driscoll (00:33:17)

On the schema evolution, of course, is one of those persistent challenges in data stacks. And certainly I think, maybe one observation. There’s a quote that someone said, “all money in software is made with either one of two strategies bundling or unbundling”. And I think what we saw during the last couple of years initially with the so-called modern data stack was a pretty significant unbundling of a lot of different features of data, analytics and data engineering where you had many businesses being formed for doing all the different parts. You had businesses around data catalogs, the guys who started this company Amundsen. Or I guess it's an open source project, Amundsen that had a company that was built around it. You have companies around data contracts. You had companies around metrics, layers right? And so there was a real proliferation of a kind of unbundled data stack, and I think now fast forward. A couple of years. It does feel like several of those companies have been acquired. I think, dbt acquired one of the metrics layer companies, the authors of Metric Flow. And I think another one of those companies pivoted into a different direction. But now we're seeing more, I think at least more recently, a bundling where folks are realizing they maybe don't want to have. When they're building their data stack. They don't want to have to have 17 different SAAS vendors

in their stack. But potentially, some of the challenges of schema you talk about schema evolution is that the more moving parts you have in your stack, the more you know the, it's almost like a game of telephone. I guess the more likely things are gonna get broken as you move data from ingestion to object storage to database to metrics layer to BI tool. And another piece that obviously we collaborated on which maybe addresses some of these moving parts, and how we deal with scheme evolution and just change management right in the data stack is you wrote about the declarative data stack. I think you and I both have observed that one way to to wrestle some of the complexity of the protocol changes their JSON schema, which then leads to a whole set of knock on effects across..You know, someone who built a Bluesky metrics dashboard if they change their at protocol, JSON schema, there's a lot of stuff that needs to be managed with that change. The declarative data stack seems like it's a philosophy right where some of these problems are made a little easier, a little more soluble, maybe. Tell us a bit about how you came up with that term declarative data stack. Tell us a bit about what that means. What problems it addresses. What is probably the most valuable thing about this declarative data stack. And some examples maybe you've seen. Obviously I have a horse in this race. But you have more of an independent lens. As someone who's working with several businesses out there.

‍

Simon Späti (00:37:35)

I think why it connects well is if you have multiple tools floating around, and you need to kind of integrate them. It's very hard if you do that imperatively. Imperative means that you define the how, you implement how you want. Let's say, if something changes you, you need to implement that into your code, so to speak, and the core concept of the declarative approach is more that you only describe the what. So what you want. It's like, SQL, right? You don't SQL, well it's not so true everywhere, because you have different SQL languages or dialects. But SQL in general, you don't say how you want to query the data right? The engine itself. It can be a different engine. It can be one with ClickHouse. It can be one in a SQL server. They have their own optimization. They run their query plans, and then they try to see if there is an index on it, and they manage that themselves. You only declare what you want, and that makes things much easier, right? So if you zoom out and at the end, you just say, you want, yeah, actually, I just want to have a BI dashboard. But I still need the data right? If you somehow can declare that... We discussed this in the article, right? That if you could do that in a single function that would ease up things right. Because then you have a similar concept that Kubernetes has, that you actually define all your things in one yaml, although they can be quite large. But it's like one specification. But Kubernetes itself will then try to solve or bring it into that state. So let's say, if a pot is like down, it will try to ramp up the next one, so you don't need to. You have less complexity. You can better integrate multiple tools. That is the ultimate goal, right? That you have less complexity and you can also integrate different tools. Obviously, there needs to be some kind of interaction between the tools, but that's the vision. It's almost like another example is Markdown. You can write Markdown in your obsidian. You can write it in even Google Docs has now a markdown export and import. You actually decouple the compute or the engine from the actual data, which is the text in that case. So different engines can be built, and they can do the how right, how to. How do I convert this now to a text that everybody can read. And the same thing could be that we have one declarative stack that we can define what we want. But then there might be different engines, right? That you have maybe an engine that even sub engines, right and orchestrator could be one engine. But there's different orchestrators for every engine. But if we have one definition of let's say we want a transformation, we want to clean the date that we can define that declaratively, then each engine can actually read that and do their best in their own parts. The orchestrator will do the best orchestration. They will do it better than a BI tool or an ingestion tool. But the BI tool will much better show the data. So if we can combine these into one single function, that would also go well with the functional data engineering philosophy. That everything is repeatable, and that you can have the same. It's like item, put them that you can restart with the same data, and you will get the same outcome, because the hard part about data is like you have state right? And the state you cannot. You also have upstream data that you cannot influence. So Bluesky, we cannot tell them, please don't change anything, or please don't write any asterisks in your post. That is out of our hands. We can maybe say, okay. I also read some blogs that said they specifically didn't choose Markdown because they didn't have some weird edge cases that what if someone writes a markdown but actually don't want to. And stuff like this, right? But it’s a really fascinating topic. It's not always the best. If you have a small stack, I think, imperative workflow and implementing everything, maybe in a monolith or something might be even better. But if you have really large complex systems, that's also why Kubernetes is mostly used in large organizations, then it's really a game changer, in my opinion. That's why I'm really fascinated by tools that actually allow this like Rill as well. Because I've worked, I had built BI dashboards all my life, and it was always hard to actually just duplicate a dashboard and maybe change metrics because you need to click that. And it's hard to actually store it. So I don't know that's kind of the wish, I would say. It's very hard to have end to end declarative and then integrate that. But I think it's a good vision to have in the back of our minds and to try to. It also makes the integration easier. If you have a clear definition, it's almost like the data contracts where you define the interface. And then the orchestrator and the BI tool and the data catalog. They can always work against this interface. So you have kind of a contract, which is the declaration. So of course, then you would need to say, you need to have a standard declaration somewhere. But I think that's a good way to follow or to try at least.

‍

Michael Driscoll (00:44:05)

It seems that particularly in the data world, you look at the evolution of different technologies. One example would be in the early days of Hadoop you had MapReduce and a lot of data processing for a long time has been, when I think about data pipelines and about getting data from point A to point B, and then data transformation in particular, I think for a long time a lot of data transformations had been imperatively defined. So folks would use tools like Java. Probably Python is now one of the most popular, but it does feel that there's been an increase in, you know SQL is no longer being used just for querying data, but increasingly, SQL is being pushed down the stack into the transfer transformation layer.

I guess one observation would be that the challenge of putting SQL in the transformation layer is that as you mentioned, this sort of declarative approach separates the what from the how and when you separate that you often… In data transformations it’s kind of often important to performances, one of the highest concerns. And you need to have control over the how.

And so it's only recently it feels that SQL has started to become more and more used in data pipelines for transformation rather than Python or Java, or other code. And I guess I would observe that this ELT movement was about loading the data. I think Snowflake, for self-interested reasons, was very much pushing folks to load data into Snowflake and then do transformation there in Snowflake. And the appeal of that was great. We can use SQL, which is a much, just nicer language to work with data. We'll import all the data into a Snowflake table, call it the raw data, and then do a series of transformations. Of course then I think people woke up to the fact that that was extremely expensive. Maybe the promise of DuckDB on the data lake is that you have all of the benefits of a declarative grammar or declarative SQL for data transformation, but you don't have to pay the Snowflake tax and nor do you have to pay the the network fees, right of moving the data from your S3 bucket into Snowflake, and then paying a 5x premium on all your compute cycles.

‍

Simon Späti (00:47:23)

Yeah, I think one of the keys there is also the openness of the format. It goes with any data warehouse, right? If you put it also in an Azure data warehouse or so, they're kind of locked in into this proprietary format that you can query with their tools mostly SQL, right? But you don't have access to the engine, as you said, or the compute. So it's like, if you have an open format, I think that's also why they're so in trend, because if you, as a company can now store your data, which is the asset of every company, it's their data right? If you can now store that in a format that is open and even accessible, now it opened up so many possibilities. You can now use other tools to actually do the compute, maybe locally, or you can run an orchestrator and use something to run on it, but even more right? Now, even you don't need to move the data anymore from one data warehouse to the next. Maybe you can even use the sharing Delta shares tables, I think live tables you can actually share. So instead of sharing a Dropbox link, you just share the link. And then people can, with any tool, can actually query the data directly, instead of having a heavy or duplicating the data. That another benefit of it, that’s really the key. But it’s really nice to see also the big vendors actually like Microsoft. I think they implement Delta into their Azure data warehouse Snowflake. It has, like the Iceberg tables. So it's really a benefit for everyone that they like the open formats. And then the community can also build tools on it. But for me that is really a big win. But again, then usually what happens is you have like, kind of a data lake. Then that is a little bit of data swamp, sometimes. Because you can just easily dump data now, data. You can easily read everybody. That was the advantage of a data warehouse you had like a core team who kind of made sure the quality is okay. The data is updated every day. But if you make it more open, people also, just like the data scientists. They go there. They do their stuff. They copy the data. So it's more before I said, you can actually easily share without copying. But the copying happens because it's open and accessible. Right?

‍

Michael Driscoll (00:50:13)

Double edged sword.

Right. Well I think maybe we’ll sort of come to the last topic of our discussion, which is, I think something that those of us who've been working in data for a long time deeply appreciate is the importance of data modeling. And I think for some time, there was this feeling in the early days of big data that we could just have unstructured data. You know the data, the data lake, you know aka the data swamp, right? We can just do kind of schema on read, and we don't really need to do much modeling. We'll just dump all of the JSON event streams into a data lake and then, the application layer will take care of the transformation. But I think probably real world experience. Time and again. You know, we're talking about what's changed in the last 20 years. And what hasn't changed is that SQL continues to be pretty important. But also data modeling continues to be incredibly relevant. And without data modeling, you'll just end up with garbage in garbage out right.

Pedram Naveed, who now works at Dagster, has a long running joke on Twitter, and maybe now Bluesky where he's like, have we agreed on what the definition of a user is yet. Many of us know that just is the simple act of counting things like how much revenue did we have last month? How many users did we have last month? There can be a lot of discrepancies depending on how the data is modeled for, whether what the answers to those you know questions are.

I'd love to hear your perspective on what recent evolutions have you observed in data modeling around? The modeling of metrics, of course, is a topic of interest around. Modeling data in tables, modeling data in data lakes. Materializing models. What are some of the trends you've observed as you've been writing your book and working with a number of leading companies in the data ecosystem.

‍

Simon Späti (00:53:14)

I think the metrics is one that is really fascinating to me, that you have, again, the declarative definition of metrics, that you can have them in one place, and then you can sort of even like, give APIs to the consumers. That they can have a SQL API, but maybe some would like to have a rest API. Some would like to have a MDX Api to query it with, excel right? Excel is still quite a famous BI tool. But you have to have somewhere, a definition of metrics to actually allow this. If you have the metrics in maybe different BI tools. Or if you don't have a place where you define them. You end up with having the same definitions, multiple times. And at the end of the day, what I experience is also that, no company is the same right? So the profit in every company is different, and some have a huge history of how that is calculated. And there's so many factors that go into that simple, supposedly simple KPI. But it's also where the business modeling or the business value comes from right to really like implement this business rule. I think this SDF has also what struck me lately was they had a business rule as code. So they say they actually, if you have different rules. like a customer in one table and in the other right? They always have different names. Maybe you can define a rule in one place, and then you can have that. So that would be the next layer, right? That you have the metrics and the rules, and then you have, maybe the transformation. So this is an interesting aspect. Another one is also one that I wrote about the materialization or caching right? Where do you cache? You can cache so many places you could cache if you actually dump your Bluesky data into S3. It's a form of caching, right? You just store it there so you can faster aggregate stuff. But then you can put it all in a cube. So then it's subseconds right? It's another cache. But you can use dbt, which is basically doing aggregation on aggregation, which each of them can be cache, but it can also just be a view. But that's the whole point of modeling, right? What do you want? Do you want it fast? Do you need fast? Query time, then, do you cache? Probably at the very end. Maybe you don't mind waiting 5 min for your data? Query, right? Because it can even run overnight.

So where do you actually persist and materialize? Your data is a very fascinating question, and that's one that is answered in the data modeling part, right? That you need to define. And I think that's also sometimes the technical aspect of it. But there's also conceptual modeling right? That you have, like a logical model that you can actually discuss with the business people. And then you go, you define. Okay, do you want the Bluesky data? You want that with which granularity? Do you need it? Do you need it daily? Maybe you can have it weekly, right? Maybe that's enough for your use case. So then you can actually save a lot of money by just saving it as a yeah monthly partition. But then what dimension? You want to query it. So there we are in the modeling of the dimensional modeling. You need to kind of foresee what dimensions you want.

And yeah, I think these are things that haven't changed a lot. I think what has changed almost on a daily basis is the tools that we use. So I'm actually very happy for that. Right when I go back to 20 years at every company I was building the same ETL jobs over and over again from scratch. And you were not allowed to use the code taking the next company right? So you basically started from scratch. But nowadays many say it's a bad thing that there's so many tools. But actually, me as a data engineer, I appreciate it so much right. You can just go on, Github, pip install, and then you have a whole ingestion, and you have a BI tool. So you can actually pick and choose for free in a way, right? You just need to kind of learn the tools, build them together, but there we are at the modeling of tools. So we have the modeling of tools. But you also have the modeling of the data, and then you also have the business requirements. I think if you put them in kind of a Venn diagram, I think if you have found the sweet spot there, you're probably in a good position too. But yeah, sometimes we are, me included, we are a little bit into fancy technologies. Because that's fun, and it's also cool. How if you can in one simple query, show my Bluesky tweets on my terminal. So that's really cool. You should always start with the use case right from the right side and then go the top down.

‍

Michael Driscoll (00:58:28)

It's left right, as we say. Start with the business requirements and then move backwards. I think that right? That's probably what makes data modeling so difficult because it's not just a technical problem. It actually sits at the interface between the technologies and the actual business. The definition of how you count a user, or how you count profit or revenue. Ultimately, that's a business definition. And so it requires integrating some of some business domain knowledge into that. I think the the other interesting thing around data modeling is that there's enormous opportunity for efficiencies to be gained with better data modeling. But unfortunately, many of the businesses and many of the technologies that are out there have perverse incentives to not want their users to be efficient data modelers because they get paid for consumption. They get paid on compute cycles, and so generally Databricks and Snowflake, as kind of the two big gorillas on the technology stage. They, you know, it's not really in their interest to get people to think about more efficient data models or caching layers that would reduce their consumption or only reluctantly will they be, you know, brought along on that.

‍

Simon Späti (01:00:05)

Yeah, I think it's also the first of all, if you're not writing code, you're not working right. So if you're just thinking, and that's not really the work, right? You cannot show that off. So it's hard to quantify that value. But, on the other hand, it's also actually, when you use a Databricks or Snowflake, you can. You have something in a day, right? Or in hours. So it's really easy to get started, that also actually the benefits of compare it back to like, if you build a data warehouse back in the days we were talking about 3 months just to have one, the first dashboard or the first KPI. So it's a different mindset. I think today you can do it much faster. This data warehouse automation there is like agile data warehousing and BI approaches. You can be much faster also in the modeling phase. But it still takes time. And you need to be willing to actually pay that or actually invest in that. I see a little bit of the responsibility and responsibility of a data engineer or the engineer itself. That push a little bit back in the beginning and say, “Okay, we need to think through that” and take some time in the beginning, because usually you pay back that magnitude higher later. Because if it's something wrong, then we actually go into this CMOS changes and we need to change everything, and then we have. If you have a better model to start with, you have much less friction. I think if you start, you will end up in a much better place if you think about that.

‍

Michael Driscoll (01:01:41)

Well, probably let’s end on that insight which is one of the most important qualities in my experience of a good data engineer or a good data practitioner is judgment and taste. And that means, the ability as data folks are often sort of dropped into the fog of war right when someone says, Hey, can you know, can we make sense of this data?

‍

Simon Späti (01:02:11)

There's good memes for that.

‍

Michael Driscoll (01:02:14)

Well, we all think that. The world is well structured, and there's good documentation and annotations of I think you know in my experience, which I'm sure is also your experience. You know. We're often just thrown into, here's a login to a database. The column schema has never been documented. And Carol. Carol, who works in, you know the Utah office, and you can just ask her. You know what CT, underscore

‍

Simon Späti (01:02:51)

6, 7, 8, 9. Yeah.

‍

Michael Driscoll (01:02:54)

That means right. So I think just the ability to have good taste and sort of good judgment, and how to navigate the decisions under uncertainty is really important. It's not just technical prowess, the ability to write code that's efficient and performant. It's sort of somewhat softer skills. Maybe view of data versus IQ.

‍

Simon Späti (01:03:26)

It's actually really interesting. Because that goes really in the next topic that we didn't even discuss GenBI. Because there we have another component that actually generates stuff out. And then the human gets even more important to actually verify and to not generate things that might be just totally wrong. I think that what you just said, and it will get even more important with AI and the generative stuff.

‍

Michael Driscoll (01:03:55)

Right. I think that's a great point. And AI, whether it's cursor, or Claude, or GPT 4.0, these things are they're good at. They're increasingly good at technical execution. But what's needed from still on the human side is to provide some guidance, some judgment and some frankly, some right, some curation or evaluation of what the outputs are. Does their answer make sense?

‍

Simon Späti (01:04:35)

Yeah, sometimes, also, just a feeling of intuition or just experience. Some things you cannot. Just only know by experience.

‍

Michael Driscoll (01:04:49)

Simon, it has been such a delight and a pleasure to have you here on the Data Talks on the Rocks.

‍

Simon Späti (01:04:57)

Thank you so much for having me.

‍

Michael Driscoll (01:05:00)

I look forward to maybe in the future we'll have to get further into this topic of GenBI, which I know you recently wrote about and we collaborated on that essay of late. We will share some of the links in the blog post that we'll put up with the transcript for anyone who's interested in following some of the things we discussed, whether Bluesky jet stream and some of these different articles that we've all been reading.

We'll also share both your and my Bluesky handles for those who want to continue to further expand on these different discussion paths.

Thank you. Again I look forward to our next discussion and thanks for joining us here.

Ready for faster dashboards?

Try for free today.

Get started

Related Articles

Data Talks on the Rocks 8 - ClickHouse, StarTree, MotherDuck, and Tobiko Data

Data Talks on the Rocks 7 - Kishore Gopalakrishna, StarTree

Data Talks on the Rocks 5 - Hannes Mühleisen, DuckDB

Ready for faster dashboards?