import duckdb
import pandas as pd
import numpy as np
import random
from IPython.display import display, Markdown
Lately I’ve been struggling through some SQL query development, and I thought it would be useful to write out some of what I’ve been learning in a blog post. Specifically, I’ve been thinking about how to do various grouping tasks when data have a hierarchical structure. I’ll cover the following SQL concepts:
- The
GROUP BY
clause - The
PARTITION BY
clause - An aggregation task complicated by a dataset’s hierarchy
This post is also an opportunity to overview how I display and run SQL code with syntax highlighting in Quarto – a nontrivial task as it turns out!
To start, I’ll synthesize a simple dataset that examines car accidents on the Interstate 280 and 580 highways in the Bay Area. I’ll go over the specifics later, but for now just note that this is a standard pandas
dataframe stored in local memory as accidents
. It contains data on cars’ trips on the two highways, and a flag denoting whether the trip was associated with an accident or not.
42)
np.random.seed(= 1000
n = {
data "license_plate": [f"{random.randint(90000, 90099)}" for _ in range(n)],
"highway": [random.choice(["I280", "I580"]) for _ in range(n)],
"trip_date": [f"2024-11-{random.randint(13, 15)}" for _ in range(n)],
"accident_flag": np.random.binomial(1, .1, n),
}= pd.DataFrame(data) accidents
Connecting to a pandas
data frame in duckdb
The data I simulated are quite small, so using SQL isn’t going to be very practically useful. However, I find that it’s helpful to play with small SQL databases like this to build general comfort with the language and to get a better intuition of how operations will play out when I’m are working at scale.
I’m going to use the duckdb
database system to directly query the pandas
data frame I have in local memory (Note that duckdb
can also directly query parquet files, csv files, and more).
"SELECT * FROM accidents LIMIT 5;") duckdb.sql(
┌───────────────┬─────────┬────────────┬───────────────┐
│ license_plate │ highway │ trip_date │ accident_flag │
│ varchar │ varchar │ varchar │ int32 │
├───────────────┼─────────┼────────────┼───────────────┤
│ 90092 │ I580 │ 2024-11-14 │ 0 │
│ 90095 │ I580 │ 2024-11-14 │ 1 │
│ 90080 │ I280 │ 2024-11-15 │ 0 │
│ 90026 │ I280 │ 2024-11-15 │ 0 │
│ 90093 │ I280 │ 2024-11-13 │ 0 │
└───────────────┴─────────┴────────────┴───────────────┘
Executing and displaying SQL in Quarto
The fact that duckdb
can directly query data in memory is great, but two issues arise – the SQL code doesn’t have any syntax highlighting, and it would be preferable if the output were a data frame rather than plain text. I’ll create a wrapper function, run_query
that solves these issues:
def run_query(sql_str: str) -> pd.DataFrame:
"```sql\n" + sql_str.strip("\n") + "\n```"))
display(Markdown(return duckdb.sql(sql_str).df()
Note the following code elements:
display(Markdown(...))
will displaysql_str
, the string code for the query, but with a wrapper around it as follows:
The wrapper alerts Quarto that it should provide SQL syntax-highlighting for the code output.
duckdb.sql(sql_str).df()
returns the results of the SQL query as apd.DataFrame
object rather than the text-table rendered above.
For the rest of the post, I will use that function, within cells with the setting echo: false
to carry out my SQL queries. Thus, you won’t see the python function, just a code chunk of the SQL query code and the query result. Here’s an example query, looking at the first five rows of accidents
:
license_plate | highway | trip_date | accident_flag | |
---|---|---|---|---|
0 | 90092 | I580 | 2024-11-14 | 0 |
1 | 90095 | I580 | 2024-11-14 | 1 |
2 | 90080 | I280 | 2024-11-15 | 0 |
3 | 90026 | I280 | 2024-11-15 | 0 |
4 | 90093 | I280 | 2024-11-13 | 0 |
Note the hierarchy of these trip data. We start with Dates and proceed downward:
- Date
- Highway
- A highway will have multiple cars that use it.
- Car (license plate)
- A car can make multiple trips.
- Trip (each row is a trip)
- Associated with:
- Accident flag
- Associated with:
- Trip (each row is a trip)
- A car can make multiple trips.
- Car (license plate)
- A highway will have multiple cars that use it.
- Highway
I’ll look at a few key SQL methods using these data.
Aggregating with GROUP BY
Let’s say we were interested in counting total trips and total accidents for each highway on each day. This is a group aggregation task, where we want to calculate summary statistics (two counts) for subsets of the data grouped by one or more columns. In SQL this is done using the GROUP BY
clause and an aggregation function (COUNT
and SUM
in this case). Equivalent pandas
code is available via the navbar. In this case, I like the SQL and pandas
syntax about equally.
highway | trip_date | n_trips | n_accidents | |
---|---|---|---|---|
0 | I280 | 2024-11-13 | 151 | 19.0 |
1 | I280 | 2024-11-14 | 181 | 22.0 |
2 | I280 | 2024-11-15 | 176 | 15.0 |
3 | I580 | 2024-11-13 | 171 | 16.0 |
4 | I580 | 2024-11-14 | 167 | 13.0 |
5 | I580 | 2024-11-15 | 154 | 15.0 |
(
accidents
.groupby(['highway', 'trip_date'], as_index=False)
.agg(
n_trips=('trip_date', 'size'),
n_accidents_date_highway=('accident_flag', 'sum')
)
.sort_values(by=['highway', 'trip_date'])
)
highway | trip_date | n_trips | n_accidents_date_highway | |
---|---|---|---|---|
0 | I280 | 2024-11-13 | 151 | 19 |
1 | I280 | 2024-11-14 | 181 | 22 |
2 | I280 | 2024-11-15 | 176 | 15 |
3 | I580 | 2024-11-13 | 171 | 16 |
4 | I580 | 2024-11-14 | 167 | 13 |
5 | I580 | 2024-11-15 | 154 | 15 |
Note that the GROUP BY
operation reduces our data granularity down to the unique highway-date pairs. This is typical aggregation behavior. Also note in lines 4 and 5 of the SQL code, that for accidents, I use COUNT(*)
. This simply counts all rows within the group. In the case of the accident flags, I use SUM(accident_flag)
within groups since the column is a 1/0 integer flag.
Aggregating with PARTITION BY
The Partition is an odd cousin of the Group that I became aware of only recently. In essence, it accomplishes the same thing as GROUP BY
, but it doesn’t compress the data or otherwise change granularity. Instead, it returns a dataframe with a group’s aggregate result replicated at every row where that group exists. I can express this more clearly via example.
license_plate | highway | trip_date | accident_flag | n_accidents_date_highway | |
---|---|---|---|---|---|
0 | 90021 | I280 | 2024-11-13 | 0 | 19.0 |
1 | 90031 | I280 | 2024-11-13 | 0 | 19.0 |
2 | 90019 | I280 | 2024-11-13 | 0 | 19.0 |
3 | 90059 | I280 | 2024-11-13 | 1 | 19.0 |
4 | 90054 | I280 | 2024-11-13 | 0 | 19.0 |
... | ... | ... | ... | ... | ... |
995 | 90002 | I580 | 2024-11-15 | 0 | 15.0 |
996 | 90000 | I580 | 2024-11-15 | 0 | 15.0 |
997 | 90062 | I580 | 2024-11-15 | 0 | 15.0 |
998 | 90037 | I580 | 2024-11-15 | 0 | 15.0 |
999 | 90003 | I580 | 2024-11-15 | 0 | 15.0 |
1000 rows × 5 columns
(
accidents
.assign(n_accidents_date_highway=
accidents
.groupby(['highway', 'trip_date'])['accident_flag']
.transform('sum')
)
.sort_values(by=['highway', 'trip_date'])
)
license_plate | highway | trip_date | accident_flag | n_accidents_date_highway | |
---|---|---|---|---|---|
4 | 90093 | I280 | 2024-11-13 | 0 | 19 |
12 | 90053 | I280 | 2024-11-13 | 0 | 19 |
23 | 90086 | I280 | 2024-11-13 | 0 | 19 |
27 | 90023 | I280 | 2024-11-13 | 0 | 19 |
29 | 90057 | I280 | 2024-11-13 | 0 | 19 |
... | ... | ... | ... | ... | ... |
964 | 90004 | I580 | 2024-11-15 | 0 | 15 |
969 | 90070 | I580 | 2024-11-15 | 0 | 15 |
983 | 90050 | I580 | 2024-11-15 | 0 | 15 |
988 | 90000 | I580 | 2024-11-15 | 0 | 15 |
993 | 90095 | I580 | 2024-11-15 | 0 | 15 |
1000 rows × 5 columns
Note in the output that the number of accidents on I280 on 11/15/2024 is the same as it was in the grouped output above, but it repeats for every row from highway I280 on 11/15/2024 in the partitioned version. The pandas
implementation for this task, available via the navbar, relies on the combination of groupby()
and transform()
, which I find unintuitive.
Thus, PARTITION BY
preserves the long format of the data, while still allowing for the creation of aggregate columns. This can be useful behavior, though can introduce some confusion, since we now have a column of group-level results in a data frame that is not compressed to the group level. I try to use column naming to make that clear, but it’s a little difficult.
Aggregation across common table expressions
Now let’s use GROUP BY
and PARTITION BY
together in a complex query. At this point I’ll stop providing pandas
translations because I view it as less analogous – were I doing queries like this in pandas
, I would be stopping to define intermediary tables and things would generally look different.
Suppose we want to know the number of accidents on each highway each day, but we also want to have some information about whether that number of accidents is typical for that highway, or if it’s an outlier. We can use a z-score to measure the position of an accident count within the highway’s historical distribution, but computing both the count of accidents at the highway-date level, then computing z-scores per highway is complicated.
In SQL, complex queries like this often involve Common Table Expressions (CTEs). I find working with CTEs more than a little difficult because it’s hard to determine what each CTE outputs. Below, I show a full query that uses both GROUP BY
and PARTITION BY
to complete my task. I also include a navigation bar that allows one to see what each of the three intermediary select statements output, thus expressing the query as a series of smaller queries and making the CTE content more transparent.
WITH grouped_table as (
SELECT
highway,
trip_date,
COUNT(*) as n_trips,
SUM(accident_flag) as n_accidents
FROM
accidents
GROUP BY
highway,
trip_date
),
partitions_table as (
SELECT
*,
AVG(n_accidents) OVER (PARTITION BY highway) as highway_average,
STDDEV_POP(n_accidents) OVER (PARTITION BY highway) as highway_std,
FROM grouped_table
)
SELECT
highway,
trip_date,
n_trips,
n_accidents,
(n_accidents - highway_average) / highway_std as highway_z_score
FROM partitions_table;
highway | trip_date | n_trips | n_accidents | highway_z_score | |
---|---|---|---|---|---|
0 | I280 | 2024-11-13 | 151 | 19.0 | 0.116248 |
1 | I280 | 2024-11-14 | 181 | 22.0 | 1.162476 |
2 | I280 | 2024-11-15 | 176 | 15.0 | -1.278724 |
3 | I580 | 2024-11-15 | 154 | 15.0 | 0.267261 |
4 | I580 | 2024-11-13 | 171 | 16.0 | 1.069045 |
5 | I580 | 2024-11-14 | 167 | 13.0 | -1.336306 |
We start with the same basic grouped table, where each row is a unique highway-date combination with its total trips and total accidents.
highway | trip_date | n_trips | n_accidents | |
---|---|---|---|---|
0 | I280 | 2024-11-15 | 176 | 15.0 |
1 | I580 | 2024-11-13 | 171 | 16.0 |
2 | I280 | 2024-11-14 | 181 | 22.0 |
3 | I580 | 2024-11-14 | 167 | 13.0 |
4 | I280 | 2024-11-13 | 151 | 19.0 |
5 | I580 | 2024-11-15 | 154 | 15.0 |
We proceed to use PARTITION BY
statements on that grouped table, finding the average and standard deviation across dates for each highway.
highway | trip_date | n_trips | n_accidents | highway_average | highway_std | |
---|---|---|---|---|---|---|
0 | I580 | 2024-11-13 | 171 | 16.0 | 14.666667 | 1.247219 |
1 | I580 | 2024-11-14 | 167 | 13.0 | 14.666667 | 1.247219 |
2 | I580 | 2024-11-15 | 154 | 15.0 | 14.666667 | 1.247219 |
3 | I280 | 2024-11-15 | 176 | 15.0 | 18.666667 | 2.867442 |
4 | I280 | 2024-11-14 | 181 | 22.0 | 18.666667 | 2.867442 |
5 | I280 | 2024-11-13 | 151 | 19.0 | 18.666667 | 2.867442 |
Finally, we can write a straightforward select statement that gets our desired final columns out of partitions_table
.
highway | trip_date | n_trips | n_accidents | highway_z_score | |
---|---|---|---|---|---|
0 | I580 | 2024-11-13 | 171 | 16.0 | 1.069045 |
1 | I580 | 2024-11-14 | 167 | 13.0 | -1.336306 |
2 | I580 | 2024-11-15 | 154 | 15.0 | 0.267261 |
3 | I280 | 2024-11-15 | 176 | 15.0 | -1.278724 |
4 | I280 | 2024-11-14 | 181 | 22.0 | 1.162476 |
5 | I280 | 2024-11-13 | 151 | 19.0 | 0.116248 |
Multiple aggregations in one table
Let’s say that, in addition to trips that were associated with accidents on highways, we were interested in counting cars that were associated with accidents on highways. This would give us a sense of if some cars are involved in multiple accidents on the same highway in the same day. Thus, we want to know:
- The number of unique cars that traveled on each highway on each day
- The number of unique cars that were associated with an accident on each highway on each day, alongside
- The trip count and
- The accident count, all in one table.
This implies two distinct levels of aggregation within a table, which can be complex in SQL. To start this, I’ll just look at how I would count up unique cars alone. We’ll lean on the COUNT(DISTINCT {column})
aggregation function in SQL, which counts unique rows in some column over a grouping.
When the DISTINCT clause is provided, only distinct values are considered in the computation of the aggregate. This is typically used in combination with the count aggregate to get the number of distinct elements; but it can be used together with any aggregate function in the system.
– DuckDB Documentation
highway | trip_date | n_cars | |
---|---|---|---|
0 | I280 | 2024-11-14 | 83 |
1 | I280 | 2024-11-15 | 83 |
2 | I580 | 2024-11-13 | 86 |
3 | I580 | 2024-11-14 | 83 |
4 | I580 | 2024-11-15 | 78 |
5 | I280 | 2024-11-13 | 82 |
That’s fine and good, but we also want to know the count of those cars that were involved in accidents on each highway/day. This is a little complicated due to the structure of the data. Recall that the accident flag is defined at the trip level, not the car level.
- Date
- Highway
- A highway will have multiple cars that use it.
- Car (license plate)
- A car can make multiple trips.
- Trip (each row is a trip)
- Associated with:
- Accident flag
- Associated with:
- Trip (each row is a trip)
- A car can make multiple trips.
- Car (license plate)
- A highway will have multiple cars that use it.
- Highway
We’ll need to move that flag up a level in the hierarchy, so that we have something like the following:
- Date
- Highway
- A highway will have multiple cars that use it.
- Car (license plate)
- Accident flag
- Car (license plate)
- A highway will have multiple cars that use it.
- Highway
One approach to this problem is to use GROUP BY
, with grouping on highway, date, and car, then figuring out if that group (car-highway-date) had any accident flag via CASE WHEN SUM(accident_flag) > 0 THEN 1 ELSE 0 END
. Then, I can use GROUP BY
again to return to the highway-date level and aggregate the number of cars and number of cars with accidents. After that, I can do a similar grouping exercise to count up the number of trips and trips resulting in accident, and then I can merge the two distinct tables. In other words:
- define the car with accident flag
- group by date-highway, aggregate cars
- group by date-highway, aggregate trips
- merge (1) and (2)
Here’s the whole process, executed.
WITH car_table as (SELECT
license_plate,
highway,
trip_date,
CASE WHEN SUM(accident_flag) > 0 THEN 1 ELSE 0 END AS car_w_accident
FROM
accidents
GROUP BY
license_plate,
highway,
trip_date
),
cars_w_accident_count as (SELECT
highway,
trip_date,
COUNT(DISTINCT license_plate) as n_cars,
SUM(car_w_accident) as n_cars_w_accident
FROM
car_table
GROUP BY
highway,
trip_date),
accident_count as (SELECT
highway,
trip_date,
COUNT(*) as n_trips,
SUM(accident_flag) as n_accidents
FROM
accidents
GROUP BY
highway,
trip_date
ORDER BY
highway,
trip_date)
SELECT
a.highway,
a.trip_date,
a.n_trips,
a.n_accidents,
b.n_cars,
b.n_cars_w_accident
FROM
accident_count a
JOIN cars_w_accident_count b
ON (a.highway = b.highway AND a.trip_date = b.trip_date)
highway | trip_date | n_trips | n_accidents | n_cars | n_cars_w_accident | |
---|---|---|---|---|---|---|
0 | I280 | 2024-11-13 | 151 | 19.0 | 82 | 18.0 |
1 | I280 | 2024-11-14 | 181 | 22.0 | 83 | 20.0 |
2 | I280 | 2024-11-15 | 176 | 15.0 | 83 | 15.0 |
3 | I580 | 2024-11-13 | 171 | 16.0 | 86 | 16.0 |
4 | I580 | 2024-11-14 | 167 | 13.0 | 83 | 12.0 |
5 | I580 | 2024-11-15 | 154 | 15.0 | 78 | 15.0 |
Create an accident flag at the car/day/highway level. Note the use of CASE WHEN
.
license_plate | highway | trip_date | car_w_accident | |
---|---|---|---|---|
0 | 90093 | I280 | 2024-11-13 | 0 |
1 | 90062 | I580 | 2024-11-13 | 0 |
2 | 90086 | I280 | 2024-11-13 | 0 |
3 | 90050 | I580 | 2024-11-13 | 0 |
4 | 90069 | I580 | 2024-11-14 | 0 |
... | ... | ... | ... | ... |
490 | 90006 | I580 | 2024-11-13 | 0 |
491 | 90021 | I280 | 2024-11-13 | 0 |
492 | 90009 | I580 | 2024-11-14 | 0 |
493 | 90024 | I280 | 2024-11-14 | 0 |
494 | 90068 | I580 | 2024-11-13 | 0 |
495 rows × 4 columns
Aggregate cars.
highway | trip_date | n_cars | n_cars_w_accident | |
---|---|---|---|---|
0 | I580 | 2024-11-13 | 86 | 16.0 |
1 | I280 | 2024-11-15 | 83 | 15.0 |
2 | I280 | 2024-11-14 | 83 | 20.0 |
3 | I580 | 2024-11-14 | 83 | 12.0 |
4 | I280 | 2024-11-13 | 82 | 18.0 |
5 | I580 | 2024-11-15 | 78 | 15.0 |
Aggregate trips (this is a table we’ve already created once before).
highway | trip_date | n_cars | n_cars_w_accident | |
---|---|---|---|---|
0 | I580 | 2024-11-13 | 86 | 16.0 |
1 | I280 | 2024-11-15 | 83 | 15.0 |
2 | I280 | 2024-11-14 | 83 | 20.0 |
3 | I580 | 2024-11-14 | 83 | 12.0 |
4 | I280 | 2024-11-13 | 82 | 18.0 |
5 | I580 | 2024-11-15 | 78 | 15.0 |
Merge the trip and car aggregates.
highway | trip_date | n_accidents | n_accidents_1 | n_cars | n_cars_w_accident | |
---|---|---|---|---|---|---|
0 | I580 | 2024-11-13 | 16.0 | 16.0 | 86 | 16.0 |
1 | I280 | 2024-11-15 | 15.0 | 15.0 | 83 | 15.0 |
2 | I280 | 2024-11-14 | 22.0 | 22.0 | 83 | 20.0 |
3 | I580 | 2024-11-14 | 13.0 | 13.0 | 83 | 12.0 |
4 | I280 | 2024-11-13 | 19.0 | 19.0 | 82 | 18.0 |
5 | I580 | 2024-11-15 | 15.0 | 15.0 | 78 | 15.0 |
Seems great, right? Maybe not – it’s a very long and overly complicated query because it goes down two separate paths of aggregation:
- group by date-highway, aggregate cars
- group by date-highway, aggregate trips
If we can avoid entering two distinct group/aggregation paths, the merge can be eliminated, we won’t need CTEs, and this can be simplified to the following:
- group by date-highway, aggregate cars, aggregate trips
It certainly isn’t immediately obvious (to me), but we can actually exploit a combination of CASE WHEN
and COUNT (DISTINCT)
to accomplish this simplification:
- group by date-highway.
- aggregate cars with
CASE WHEN
andCOUNT (DISTINCT)
- aggregate trips with
COUNT(*)
andSUM
- aggregate cars with
highway | trip_date | n_trips | n_accidents | n_cars | n_cars_w_accident | |
---|---|---|---|---|---|---|
0 | I280 | 2024-11-13 | 151 | 19.0 | 82 | 18 |
1 | I580 | 2024-11-15 | 154 | 15.0 | 78 | 15 |
2 | I280 | 2024-11-14 | 181 | 22.0 | 83 | 20 |
3 | I280 | 2024-11-15 | 176 | 15.0 | 83 | 15 |
4 | I580 | 2024-11-13 | 171 | 16.0 | 86 | 16 |
5 | I580 | 2024-11-14 | 167 | 13.0 | 83 | 12 |
This exploits the fact that NULL
values are ignored by most aggregate functions. Thus, we create a column that is essentially the same as the suspicious transaction flag, but instead of 1/0, it contains the license plate number for the car (if it was in a trip that resulted in an accident) or NULL
. When we count distinct values for that column, grouped by highway and date, we then effectively count up how many unique cars associated with accidents there were in that highway-date group.
Citation
@online{amerkhanian2024,
author = {Amerkhanian, Peter},
title = {Grouping {Problems} with {SQL}},
date = {2024-11-20},
url = {https://peter-amerkhanian.com/posts/sql-count-cases/},
langid = {en}
}