Grouping Problems with SQL – Peter Amerkhanian

import duckdb
import pandas as pd
import numpy as np
import random
from IPython.display import display, Markdown

Lately I’ve been struggling through some SQL query development, and I thought it would be useful to write out some of what I’ve been learning in a blog post. Specifically, I’ve been thinking about how to do various grouping tasks when data have a hierarchical structure. I’ll cover the following SQL concepts:

The GROUP BY clause
The PARTITION BY clause
An aggregation task complicated by a dataset’s hierarchy

This post is also an opportunity to overview how I display and run SQL code with syntax highlighting in Quarto – a nontrivial task as it turns out!

To start, I’ll synthesize a simple dataset that examines car accidents on the Interstate 280 and 580 highways in the Bay Area. I’ll go over the specifics later, but for now just note that this is a standard pandas dataframe stored in local memory as accidents. It contains data on cars’ trips on the two highways, and a flag denoting whether the trip was associated with an accident or not.

np.random.seed(42)
n = 1000
data = {
    "license_plate": [f"{random.randint(90000, 90099)}" for _ in range(n)],
    "highway": [random.choice(["I280", "I580"]) for _ in range(n)],
    "trip_date": [f"2024-11-{random.randint(13, 15)}" for _ in range(n)],
    "accident_flag": np.random.binomial(1, .1, n),
}
accidents = pd.DataFrame(data)

Connecting to a `pandas` data frame in `duckdb`

The data I simulated are quite small, so using SQL isn’t going to be very practically useful. However, I find that it’s helpful to play with small SQL databases like this to build general comfort with the language and to get a better intuition of how operations will play out when I’m are working at scale.

I’m going to use the duckdb database system to directly query the pandas data frame I have in local memory (Note that duckdb can also directly query parquet files, csv files, and more).

duckdb.sql("SELECT * FROM accidents LIMIT 5;")

┌───────────────┬─────────┬────────────┬───────────────┐
│ license_plate │ highway │ trip_date  │ accident_flag │
│    varchar    │ varchar │  varchar   │     int32     │
├───────────────┼─────────┼────────────┼───────────────┤
│ 90092         │ I580    │ 2024-11-14 │             0 │
│ 90095         │ I580    │ 2024-11-14 │             1 │
│ 90080         │ I280    │ 2024-11-15 │             0 │
│ 90026         │ I280    │ 2024-11-15 │             0 │
│ 90093         │ I280    │ 2024-11-13 │             0 │
└───────────────┴─────────┴────────────┴───────────────┘

Executing and displaying SQL in Quarto

The fact that duckdb can directly query data in memory is great, but two issues arise – the SQL code doesn’t have any syntax highlighting, and it would be preferable if the output were a data frame rather than plain text. I’ll create a wrapper function, run_query that solves these issues:

def run_query(sql_str: str) -> pd.DataFrame:
  display(Markdown("```sql\n" + sql_str.strip("\n") + "\n```"))
  return duckdb.sql(sql_str).df()

Note the following code elements:

display(Markdown(...)) will display sql_str, the string code for the query, but with a wrapper around it as follows:

```{sql}
sql_str 
```

The wrapper alerts Quarto that it should provide SQL syntax-highlighting for the code output.

duckdb.sql(sql_str).df() returns the results of the SQL query as a pd.DataFrame object rather than the text-table rendered above.

For the rest of the post, I will use that function, within cells with the setting echo: false to carry out my SQL queries. Thus, you won’t see the python function, just a code chunk of the SQL query code and the query result. Here’s an example query, looking at the first five rows of accidents:

SELECT * FROM accidents
LIMIT 5;

	license_plate	highway	trip_date	accident_flag
0	90092	I580	2024-11-14	0
1	90095	I580	2024-11-14	1
2	90080	I280	2024-11-15	0
3	90026	I280	2024-11-15	0
4	90093	I280	2024-11-13	0

Note the hierarchy of these trip data. We start with Dates and proceed downward:

Date
- Highway
  - A highway will have multiple cars that use it.
    - Car (license plate)
      - A car can make multiple trips.
        
        Trip (each row is a trip)
        
        Associated with:
        
        Accident flag

I’ll look at a few key SQL methods using these data.

Aggregating with `GROUP BY`

Let’s say we were interested in counting total trips and total accidents for each highway on each day. This is a group aggregation task, where we want to calculate summary statistics (two counts) for subsets of the data grouped by one or more columns. In SQL this is done using the GROUP BY clause and an aggregation function (COUNT and SUM in this case). Equivalent pandas code is available via the navbar. In this case, I like the SQL and pandas syntax about equally.

SQL
pandas

SELECT
  highway,
  trip_date,
  COUNT(*) as n_trips,
  SUM(accident_flag) as n_accidents
FROM
  accidents
GROUP BY
  highway,
  trip_date
ORDER BY
  highway,
  trip_date

	highway	trip_date	n_trips	n_accidents
0	I280	2024-11-13	151	19.0
1	I280	2024-11-14	181	22.0
2	I280	2024-11-15	176	15.0
3	I580	2024-11-13	171	16.0
4	I580	2024-11-14	167	13.0
5	I580	2024-11-15	154	15.0

(
    accidents
    .groupby(['highway', 'trip_date'], as_index=False)
    .agg(
        n_trips=('trip_date', 'size'),
        n_accidents_date_highway=('accident_flag', 'sum')
    )
    .sort_values(by=['highway', 'trip_date'])
)

	highway	trip_date	n_trips	n_accidents_date_highway
0	I280	2024-11-13	151	19
1	I280	2024-11-14	181	22
2	I280	2024-11-15	176	15
3	I580	2024-11-13	171	16
4	I580	2024-11-14	167	13
5	I580	2024-11-15	154	15

Note that the GROUP BY operation reduces our data granularity down to the unique highway-date pairs. This is typical aggregation behavior. Also note in lines 4 and 5 of the SQL code, that for accidents, I use COUNT(*). This simply counts all rows within the group. In the case of the accident flags, I use SUM(accident_flag) within groups since the column is a 1/0 integer flag.

Aggregating with `PARTITION BY`

The Partition is an odd cousin of the Group that I became aware of only recently. In essence, it accomplishes the same thing as GROUP BY, but it doesn’t compress the data or otherwise change granularity. Instead, it returns a dataframe with a group’s aggregate result replicated at every row where that group exists. I can express this more clearly via example.

SQL
pandas

SELECT
    *,
    SUM(accident_flag)
        OVER (PARTITION BY highway, trip_date)
    as n_accidents_date_highway
FROM accidents
ORDER BY
  highway,
  trip_date;

	license_plate	highway	trip_date	accident_flag	n_accidents_date_highway
0	90021	I280	2024-11-13	0	19.0
1	90031	I280	2024-11-13	0	19.0
2	90019	I280	2024-11-13	0	19.0
3	90059	I280	2024-11-13	1	19.0
4	90054	I280	2024-11-13	0	19.0
...	...	...	...	...	...
995	90002	I580	2024-11-15	0	15.0
996	90000	I580	2024-11-15	0	15.0
997	90062	I580	2024-11-15	0	15.0
998	90037	I580	2024-11-15	0	15.0
999	90003	I580	2024-11-15	0	15.0

1000 rows × 5 columns

(
    accidents
    .assign(n_accidents_date_highway=
            accidents
            .groupby(['highway', 'trip_date'])['accident_flag']
            .transform('sum')
    )
    .sort_values(by=['highway', 'trip_date'])
)

	license_plate	highway	trip_date	accident_flag	n_accidents_date_highway
4	90093	I280	2024-11-13	0	19
12	90053	I280	2024-11-13	0	19
23	90086	I280	2024-11-13	0	19
27	90023	I280	2024-11-13	0	19
29	90057	I280	2024-11-13	0	19
...	...	...	...	...	...
964	90004	I580	2024-11-15	0	15
969	90070	I580	2024-11-15	0	15
983	90050	I580	2024-11-15	0	15
988	90000	I580	2024-11-15	0	15
993	90095	I580	2024-11-15	0	15

1000 rows × 5 columns

Note in the output that the number of accidents on I280 on 11/15/2024 is the same as it was in the grouped output above, but it repeats for every row from highway I280 on 11/15/2024 in the partitioned version. The pandas implementation for this task, available via the navbar, relies on the combination of groupby() and transform(), which I find unintuitive.

Thus, PARTITION BY preserves the long format of the data, while still allowing for the creation of aggregate columns. This can be useful behavior, though can introduce some confusion, since we now have a column of group-level results in a data frame that is not compressed to the group level. I try to use column naming to make that clear, but it’s a little difficult.

Aggregation across common table expressions

Now let’s use GROUP BY and PARTITION BY together in a complex query. At this point I’ll stop providing pandas translations because I view it as less analogous – were I doing queries like this in pandas, I would be stopping to define intermediary tables and things would generally look different.

Suppose we want to know the number of accidents on each highway each day, but we also want to have some information about whether that number of accidents is typical for that highway, or if it’s an outlier. We can use a z-score to measure the position of an accident count within the highway’s historical distribution, but computing both the count of accidents at the highway-date level, then computing z-scores per highway is complicated.

In SQL, complex queries like this often involve Common Table Expressions (CTEs). I find working with CTEs more than a little difficult because it’s hard to determine what each CTE outputs. Below, I show a full query that uses both GROUP BY and PARTITION BY to complete my task. I also include a navigation bar that allows one to see what each of the three intermediary select statements output, thus expressing the query as a series of smaller queries and making the CTE content more transparent.

WITH grouped_table as (
  SELECT
    highway,
    trip_date,
    COUNT(*) as n_trips,
    SUM(accident_flag) as n_accidents 
  FROM
    accidents
  GROUP BY
    highway,
    trip_date
),
          
partitions_table as (
  SELECT
    *,
    AVG(n_accidents) OVER (PARTITION BY highway) as highway_average,
    STDDEV_POP(n_accidents) OVER (PARTITION BY highway) as highway_std,
  FROM grouped_table
)

SELECT
  highway,
  trip_date,
  n_trips,
  n_accidents,
  (n_accidents - highway_average) / highway_std as highway_z_score
FROM partitions_table;

	highway	trip_date	n_trips	n_accidents	highway_z_score
0	I280	2024-11-13	151	19.0	0.116248
1	I280	2024-11-14	181	22.0	1.162476
2	I280	2024-11-15	176	15.0	-1.278724
3	I580	2024-11-15	154	15.0	0.267261
4	I580	2024-11-13	171	16.0	1.069045
5	I580	2024-11-14	167	13.0	-1.336306

We start with the same basic grouped table, where each row is a unique highway-date combination with its total trips and total accidents.

WITH grouped_table as (
  SELECT
    highway,
    trip_date,
    COUNT(*) as n_trips,
    SUM(accident_flag) as n_accidents 
  FROM
    accidents 
  GROUP BY
    highway,
    trip_date
)
SELECT * FROM grouped_table

	highway	trip_date	n_trips	n_accidents
0	I280	2024-11-15	176	15.0
1	I580	2024-11-13	171	16.0
2	I280	2024-11-14	181	22.0
3	I580	2024-11-14	167	13.0
4	I280	2024-11-13	151	19.0
5	I580	2024-11-15	154	15.0

We proceed to use PARTITION BY statements on that grouped table, finding the average and standard deviation across dates for each highway.

WITH partitions_table as (
  SELECT
    *,
    AVG(n_accidents) OVER (PARTITION BY highway) as highway_average,
    STDDEV_POP(n_accidents) OVER (PARTITION BY highway) as highway_std,
  FROM grouped_table
)
SELECT * FROM partitions_table;

	highway	trip_date	n_trips	n_accidents	highway_average	highway_std
0	I580	2024-11-13	171	16.0	14.666667	1.247219
1	I580	2024-11-14	167	13.0	14.666667	1.247219
2	I580	2024-11-15	154	15.0	14.666667	1.247219
3	I280	2024-11-15	176	15.0	18.666667	2.867442
4	I280	2024-11-14	181	22.0	18.666667	2.867442
5	I280	2024-11-13	151	19.0	18.666667	2.867442

Finally, we can write a straightforward select statement that gets our desired final columns out of partitions_table.

SELECT
  highway,
  trip_date,
  n_trips,
  n_accidents,
  (n_accidents - highway_average) / highway_std as highway_z_score
FROM partitions_table;

	highway	trip_date	n_trips	n_accidents	highway_z_score
0	I580	2024-11-13	171	16.0	1.069045
1	I580	2024-11-14	167	13.0	-1.336306
2	I580	2024-11-15	154	15.0	0.267261
3	I280	2024-11-15	176	15.0	-1.278724
4	I280	2024-11-14	181	22.0	1.162476
5	I280	2024-11-13	151	19.0	0.116248

Multiple aggregations in one table

Let’s say that, in addition to trips that were associated with accidents on highways, we were interested in counting cars that were associated with accidents on highways. This would give us a sense of if some cars are involved in multiple accidents on the same highway in the same day. Thus, we want to know:

The number of unique cars that traveled on each highway on each day
The number of unique cars that were associated with an accident on each highway on each day, alongside
The trip count and
The accident count, all in one table.

This implies two distinct levels of aggregation within a table, which can be complex in SQL. To start this, I’ll just look at how I would count up unique cars alone. We’ll lean on the COUNT(DISTINCT {column}) aggregation function in SQL, which counts unique rows in some column over a grouping.

When the DISTINCT clause is provided, only distinct values are considered in the computation of the aggregate. This is typically used in combination with the count aggregate to get the number of distinct elements; but it can be used together with any aggregate function in the system.
– DuckDB Documentation

SELECT
  highway,
  trip_date,
  COUNT(DISTINCT license_plate) AS n_cars
FROM
  accidents 
GROUP BY
  highway,
  trip_date

	highway	trip_date	n_cars
0	I280	2024-11-14	83
1	I280	2024-11-15	83
2	I580	2024-11-13	86
3	I580	2024-11-14	83
4	I580	2024-11-15	78
5	I280	2024-11-13	82

That’s fine and good, but we also want to know the count of those cars that were involved in accidents on each highway/day. This is a little complicated due to the structure of the data. Recall that the accident flag is defined at the trip level, not the car level.

Date
- Highway
  - A highway will have multiple cars that use it.
    - Car (license plate)
      - A car can make multiple trips.
        
        Trip (each row is a trip)
        
        Associated with:
        
        Accident flag

We’ll need to move that flag up a level in the hierarchy, so that we have something like the following:

Date
- Highway
  - A highway will have multiple cars that use it.
    - Car (license plate)
      - Accident flag

One approach to this problem is to use GROUP BY, with grouping on highway, date, and car, then figuring out if that group (car-highway-date) had any accident flag via CASE WHEN SUM(accident_flag) > 0 THEN 1 ELSE 0 END. Then, I can use GROUP BY again to return to the highway-date level and aggregate the number of cars and number of cars with accidents. After that, I can do a similar grouping exercise to count up the number of trips and trips resulting in accident, and then I can merge the two distinct tables. In other words:

define the car with accident flag
group by date-highway, aggregate cars
group by date-highway, aggregate trips
merge (1) and (2)

Here’s the whole process, executed.

WITH car_table as (SELECT
  license_plate,
  highway,
  trip_date,
  CASE WHEN SUM(accident_flag) > 0 THEN 1 ELSE 0 END AS car_w_accident
FROM
  accidents 
GROUP BY
  license_plate,
  highway,
  trip_date
),

cars_w_accident_count as (SELECT 
  highway,
  trip_date,
  COUNT(DISTINCT license_plate) as n_cars,
  SUM(car_w_accident) as n_cars_w_accident
FROM
  car_table
GROUP BY
  highway,
  trip_date),

accident_count as (SELECT
  highway,
  trip_date,
  COUNT(*) as n_trips,
  SUM(accident_flag) as n_accidents
FROM
  accidents 
GROUP BY
  highway,
  trip_date
ORDER BY
  highway,
  trip_date)

SELECT
a.highway,
a.trip_date,
a.n_trips,
a.n_accidents,
b.n_cars,
b.n_cars_w_accident
FROM 
  accident_count a
  JOIN cars_w_accident_count b
    ON (a.highway = b.highway AND a.trip_date = b.trip_date)

	highway	trip_date	n_trips	n_accidents	n_cars	n_cars_w_accident
0	I280	2024-11-13	151	19.0	82	18.0
1	I280	2024-11-14	181	22.0	83	20.0
2	I280	2024-11-15	176	15.0	83	15.0
3	I580	2024-11-13	171	16.0	86	16.0
4	I580	2024-11-14	167	13.0	83	12.0
5	I580	2024-11-15	154	15.0	78	15.0

Create an accident flag at the car/day/highway level. Note the use of CASE WHEN.

WITH car_table as (SELECT
  license_plate,
  highway,
  trip_date,
  CASE WHEN SUM(accident_flag) > 0 THEN 1 ELSE 0 END AS car_w_accident
FROM
  accidents 
GROUP BY
  license_plate,
  highway,
  trip_date
)
SELECT * FROM car_table

	license_plate	highway	trip_date	car_w_accident
0	90093	I280	2024-11-13	0
1	90062	I580	2024-11-13	0
2	90086	I280	2024-11-13	0
3	90050	I580	2024-11-13	0
4	90069	I580	2024-11-14	0
...	...	...	...	...
490	90006	I580	2024-11-13	0
491	90021	I280	2024-11-13	0
492	90009	I580	2024-11-14	0
493	90024	I280	2024-11-14	0
494	90068	I580	2024-11-13	0

495 rows × 4 columns

Aggregate cars.

WITH cars_w_accident_count as (SELECT 
  highway,
  trip_date,
  COUNT(DISTINCT license_plate) as n_cars,
  SUM(car_w_accident) as n_cars_w_accident
FROM
  car_table
GROUP BY
  highway,
  trip_date)
SELECT * FROM cars_w_accident_count;

	highway	trip_date	n_cars	n_cars_w_accident
0	I580	2024-11-13	86	16.0
1	I280	2024-11-15	83	15.0
2	I280	2024-11-14	83	20.0
3	I580	2024-11-14	83	12.0
4	I280	2024-11-13	82	18.0
5	I580	2024-11-15	78	15.0

Aggregate trips (this is a table we’ve already created once before).

WITH accident_count as (SELECT
  highway,
  trip_date,
  COUNT(*) as n_trips,
  SUM(accident_flag) as n_accidents 
FROM
  accidents 
GROUP BY
  highway,
  trip_date
ORDER BY
  highway,
  trip_date)
SELECT * FROM accident_count;

	highway	trip_date	n_cars	n_cars_w_accident
0	I580	2024-11-13	86	16.0
1	I280	2024-11-15	83	15.0
2	I280	2024-11-14	83	20.0
3	I580	2024-11-14	83	12.0
4	I280	2024-11-13	82	18.0
5	I580	2024-11-15	78	15.0

Merge the trip and car aggregates.

SELECT
a.highway,
a.trip_date,
a.n_accidents,
a.n_accidents,
b.n_cars,
b.n_cars_w_accident
FROM 
  accident_count a
  JOIN cars_w_accident_count b
    ON (a.highway = b.highway AND a.trip_date = b.trip_date)

	highway	trip_date	n_accidents	n_accidents_1	n_cars	n_cars_w_accident
0	I580	2024-11-13	16.0	16.0	86	16.0
1	I280	2024-11-15	15.0	15.0	83	15.0
2	I280	2024-11-14	22.0	22.0	83	20.0
3	I580	2024-11-14	13.0	13.0	83	12.0
4	I280	2024-11-13	19.0	19.0	82	18.0
5	I580	2024-11-15	15.0	15.0	78	15.0

Seems great, right? Maybe not – it’s a very long and overly complicated query because it goes down two separate paths of aggregation:

group by date-highway, aggregate cars
group by date-highway, aggregate trips

If we can avoid entering two distinct group/aggregation paths, the merge can be eliminated, we won’t need CTEs, and this can be simplified to the following:

group by date-highway, aggregate cars, aggregate trips

It certainly isn’t immediately obvious (to me), but we can actually exploit a combination of CASE WHEN and COUNT (DISTINCT) to accomplish this simplification:

group by date-highway.
- aggregate cars with CASE WHEN and COUNT (DISTINCT)
- aggregate trips with COUNT(*) and SUM

SELECT
    highway,
    trip_date,
    COUNT(*) as n_trips,
    SUM(accident_flag) as n_accidents,
    COUNT(DISTINCT license_plate) AS n_cars,
    COUNT(DISTINCT
        (CASE WHEN accident_flag = 1 THEN license_plate ELSE NULL END))
        AS n_cars_w_accident
FROM
  accidents 
GROUP BY
  highway,
  trip_date

	highway	trip_date	n_trips	n_accidents	n_cars	n_cars_w_accident
0	I280	2024-11-13	151	19.0	82	18
1	I580	2024-11-15	154	15.0	78	15
2	I280	2024-11-14	181	22.0	83	20
3	I280	2024-11-15	176	15.0	83	15
4	I580	2024-11-13	171	16.0	86	16
5	I580	2024-11-14	167	13.0	83	12

This exploits the fact that NULL values are ignored by most aggregate functions. Thus, we create a column that is essentially the same as the suspicious transaction flag, but instead of 1/0, it contains the license plate number for the car (if it was in a trip that resulted in an accident) or NULL. When we count distinct values for that column, grouped by highway and date, we then effectively count up how many unique cars associated with accidents there were in that highway-date group.

Citation

BibTeX citation:

@online{amerkhanian2024,
  author = {Amerkhanian, Peter},
  title = {Grouping {Problems} with {SQL}},
  date = {2024-11-20},
  url = {https://peter-amerkhanian.com/posts/sql-count-cases/},
  langid = {en}
}

For attribution, please cite this work as:

Amerkhanian, Peter. 2024. “Grouping Problems with SQL.” November 20, 2024. https://peter-amerkhanian.com/posts/sql-count-cases/.

Connecting to a pandas data frame in duckdb