No matter how strong your engineering team, it’s likely you haven’t run any codebase analytics. Yet, analyzing the history of your codebase can yield metrics on code quality, code complexity, and productivity – metrics that any engineering team would love to have. These analyses are best made possible by a continually-updated data pipeline and an interactive data visualization. Typically, this is more than a weekend project.
To help with this, we've developed a Rill project that analyzes GitHub repositories. The result is a live, interactive dashboard that offers insights into your engineering organization. The project's README includes a step-by-step guide on how to deploy it for your own Git repo. In this post, we'll provide examples of the metrics and insights you can gain from the dashboard.
Analysis of DuckDB’s Repo
In our example, we showcase DuckDB’s GitHub repository. DuckDB is a central dependency of Rill, and we’re big fans. First, we won’t bury the lede – check out the live, interactive dashboard, then on we go.
Measuring Engineering Productivity
To start with the basics, let’s look at how DuckDB’s engineering productivity has trended over time. In our Rill dashboard, we can filter for June and see a month-over-month comparison. The current time period (June) is in bright blue; the previous period (May) is in light gray.
It’s easy to observe that code commits are down in June. That’s interesting and likely understood better with more context. Every metric in this dashboard should be interpreted alongside other data points gathered in your organization. Next, let’s dig into some metrics that may be upstream of a productivity slowdown: code churn and code complexity.
Observing Code Churn
To analyze code churn, we can look at which files were present in the most commits in 2023. By adding a filter for files in the src directory, we can isolate the commits to the core components of DuckDB.
In this list, we can see several files related to CSV-ingestion: read_csv.cpp, parallel_csv_reader.csv, and base_csv_reader.cpp. These numbers indicate that CSV ingestion may be a tricky feature to support.
Code churn metrics can help engineers identify areas of technical debt and can help product managers understand resource requirements for a product roadmap. Let’s take a look at code churn from another angle: engineering effort spent on maintenance vs. new features.
Maintenance vs. New Features
One particularly useful metric in our Rill dashboard is the “Code deletion %” measure. This measures what percent of code changes in a file were “deletes” versus “additions.” A high percentage may indicate that the commit involved maintenance – perhaps a bug fix, refactor, or optimization. A low percentage likely indicates the commit added fresh code, and a new feature could be on its way.
Here we look at DuckDB’s directory-level deletion percentage for June:
There are a couple clear patterns. In the top image, we can see that the team worked on query performance – across the optimizer and planner – this past month. In the bottom image, we can see that the DuckDB team is adding new code to their NodeJS, Python, Julia, and R language support.
Finding Code Complexity
To get ahead of code churn, one leading indicator is code complexity. One heuristic we can use to assess code complexity is the number of files touched per commit. A low number of files per commit likely indicates the code is more encapsulated. Many files per commit could identify code that is less encapsulated, harder to review, and as a result, more error-prone. If files-per-commit trends up, code churn may follow.
Here, we look at DuckDB’s June commits that touched the most files in the src directory.
The commits above may indicate complex parts of the codebase. However, note that files-per-commit isn’t always a good proxy for complexity. For example, simple refactors that rename or move a common utility function will touch many files.
Try Rill for Yourself!
We hope these example analyses encourage you to analyze your own repository. You can check out Rill’s GitHub Analytics README for a step-by-step guide on how to set up the dashboard with your own Git data.
At a high level, the steps are:
- Set up a cloud storage bucket and related authentication
- Enter your credentials in the provided download_commits.py, and run the script
- Edit the Rill code artifacts to point to your provided storage bucket
- Run rill start to explore the source data and preview dashboard on your local machine
- Run rill deploy to publish the dashboard to Rill Cloud
We’d like to continue to improve this dashboard. If you have ideas, please let us know! You can design more metrics yourself in Rill’s responsive SQL editor. Check out our quick start guide for an introduction.
If you get stuck or want to share what you're building, connect with us in our Discord channel! We’d love to hear from you.