Postgres Window Functions

What is a Window Function?

A window function is a function that in SQL that is used to perform calculations on a set of rows that are related to the current row, for example ordering by row number. These functions are particularly useful when you need to perform calculations such as getting the rank of a row compared to others or grouping and then ordering by rank.

A window function is made up of 2 parts:

A function. The function is the actual calculation that is performed on the set of rows
The window. The window is the set of rows that are related to the current row and are used to perform the calculation

Window Functions in Postgres

Postgres supports window functions in the form of OVER clauses. The OVER clause is used to specify the window that the function should be applied to.

The syntax for a window function in postgres is:

() OVER ([PARTITION BY xx, xx] [ORDER BY xx, xx])

In use it would look something like

select
    id, name, score,
    row_number() over (order by score desc) as rank
from
    scores;

This would return all the rows, and with each row a rank column which will be filled with the row index ordered by their score descending, which means the first row (highest scorer) will have rank 1, the second row, rank 2 and so on.

`RANK` vs `DENSE_RANK` vs `ROW_NUMBER`

There are 3 common window functions in postgres:

`ROW_NUMBER`

Takes each row - regardless of ties - and assigns a unique number to it, starting and 1 and ending at n. This is useful for when you need a unique value for every row.

id	score	rank
1	12	1
5	10	2
4	10	3
2	8	4
3	7	5

`RANK`

Rank takes each row like ROW_NUMBER but in the event of a tie, it will give both rows the lower rank and then skip the higher rank.

id	score	rank
1	12	1
5	10	2
4	10	2
2	8	4
3	7	5

`DENSE_RANK`

This is similar to RANK except it will not skip the next number on a tie, useful if you do not wants gaps in the rank.

id	score	rank
1	12	1
5	10	2
4	10	2
2	8	3
3	7	4

Examples

Top 3 Scorers

Given the schema below, find the top 3 scorers for both male and female. If there is a tie, treat that as one rank.

create table scores (
    id serial primary key,
    name text not null, -- for this example don't worry about duplicate handling
    gender text not null check (gender in ('m', 'f')),
    score int not null, check (score >= 0)
);

insert into scores (name, gender, score) values
    ('Alice', 'f', 10),
    ('Bob', 'm', 12),
    ('Charlie', 'm', 8),
    ('Diana', 'f', 7),
    ('Eve', 'f', 10),
    ('Frank', 'm', 10),
    ('Grace', 'f', 8),
    ('Hank', 'm', 7),
    ('Ivy', 'f', 12),
    ('Jack', 'm', 10);

The answer:

with ranked_scores as (
    select
        name, gender, score,
        dense_rank() over (partition by gender order by score desc) as rank
    from scores
)
select *
from ranked_scores
where rank <= 3;

name	gender	score	rank
Ivy	f	12	1
Alice	f	10	2
Eve	f	10	2
Grace	f	8	3
Bob	m	12	1
Jack	m	10	2
Frank	m	10	2
Charlie	m	8	3

In the answer we use a CTE, this is beacuse the where clause is executed before the select so we would get rank not defined. The way to solve this is to execute the select before the where by use of a CTE. We also have chosen to use dense_rank instead or rank or row_number, this is because we want the top 3 ranks, so we include all people whos ranks are 1, 2 or 3.

Running Total

Given the schema below find the running total of sales for each product over time.

create table sales (
    id serial primary key,
    product_id int not null,
    sale_date date not null,
    amount numeric not null
);

insert into sales (product_id, sale_date, amount) values
    (1, '2024-01-01', 100),
    (2, '2024-01-01', 200),
    (3, '2024-01-02', 150),
    (2, '2024-01-02', 250),
    (1, '2024-01-03', 200),
    (2, '2024-01-03', 300),
    (3, '2024-01-03', 250),
    (1, '2024-01-04', 300),
    (2, '2024-01-04', 400),
    (3, '2024-01-04', 350),
    (4, '2024-01-05', 200);

The answer:

select
    product_id,
    sale_date,
    amount,
    sum(amount) over (partition by product_id order by sale_date) as running_total
from sales
order by product_id, sale_date;

product_id	sale_date	amount	running_total
1	2024-01-01	100	100
1	2024-01-03	200	300
1	2024-01-04	300	600
2	2024-01-01	200	200
2	2024-01-02	250	450
2	2024-01-03	300	750
2	2024-01-04	400	1150
3	2024-01-02	150	150
3	2024-01-03	250	400
3	2024-01-04	350	750
4	2024-01-05	200	200

In this example we have used a different function to get the running total: sum. This is the right function as we want ot get the total (sum) of the current row + the rows before it. We partition it by the product_id so we can see per product the sales, and then we order it by sale_date so we can see each sale how much the running total goes up by. As we are not using a where clause there is no need to use a CTE, we can use the window function in the regular query.

Published 12/31/2024