For one of our Just-BI customers we implemented a Year-to-Date calculation in a Azure Synapse Backend. We encountered a couple of approaches and in this series I’d like to share some sample code, and discuss some of the merits and benefits of each approach.

**Short Summary**

A Year-to-Date solution based on a ` SUM()` window function is simple to code and maintain as well as efficient to execute. This as compared to a number of alternative implementations, namely a self-

`JOIN`(combined with a

`GROUP BY`), a subquery, and a

`UNION`(also combined with a

`GROUP BY`).

Note: this is the 5th post in a series.

- For sample data and setup, please see the 1st post in this series.
- For a solution based on a self-
`JOIN`and`GROUP BY`, please find the 2nd post in this series. - For a solution based on a subquery, please find the 3rd post in this series.
- For a solution based on a
`UNION`, please find the 4th post in this series.

(While our use case deals with Azure Synapse, most of the code will be directly compatible with other SQL Engines and RDBMS-es.)

#### Using window functions

Nowadays, many SQL engines and virtually all major RDBMSes support window functions (sometimes called analytic functions). A window function looks like a classic aggregate function. In some respects it also behaves like one, but at the same time there are essential differences.

##### Aggregate functions

Consider the following example:

selectsum(SalesAmount)asSumOfSalesAmount ,count(*)asRowCountfromSalesYearMonth

The example uses two aggregate functions, `SUM()` and `COUNT()`. It returns a result like this:

SumOfSalesAmount | RowCount |
---|---|

109,846,381.43 | 38 |

Two things are happening here:

- Even though there are multiple rows in the
`SalesYearMonth`table, the result consists of just one row. In other words, a collection of source rows have been*aggregated*into fewer (in this case, only one) result row. - The functions have caclculated a value based on some aspect of the individual rows in the source collection. In the case of
`SUM(SalesAmount)`, the value of the`SalesAmount`column of each individual row was added to obtain a total. In the case of`COUNT(*)`, each row was counted, adding up to the total number of rows.

Because the previous example uses aggregate functions, we cannot also select any non-aggregated columns. For example, while `SalesYear and SalesMonth are present in the individual underlying rows, we cannot simpy select them, because they do not exist in the result row, which is an aggregate.`

##### Window functions

Now, `SUM()` and `COUNT()` also exist as window functions. Consider the following query:

selectSalesYear , SalesMonth , SalesAmount ,sum(SalesAmount)over()asTotalOfSalesAmount ,count(*)over()asRowCountfromSalesYearMonth

You might notice the last two expressions in the ` SELECT`-list look almost identical to the aggregate functions in the previous example. The difference is that in this query, the function call is followed by an

`-clause. Syntactically, this is what distinguishes ordinary aggregate functions from window functions.`

**OVER()**Here is its result:

SalesYear | SalesMonth | SalesAmount | TotalOfSalesAmount | RowCount |
---|---|---|---|---|

2011 | 5 | 503,805.92 | 109,846,381.43 | 38 |

2011 | 6 | 458,910.82 | 109,846,381.43 | 38 |

…more rows… | ||||

2014 | 6 | 49,005.84 | 109,846,381.43 | 38 |

Note that we now get all the rows from the underlying `SalesYearMonth` table: no aggregation has ocurred. But the window functions do return a result that is identical to the one we got when using them as aggregate functions, and that for each row of the `SalesYearMonth` table.

It’s as if for each row of the underlying table, the respective aggregate function was called over all rows in the entire table. Conceptually this is quite like the construct we used in the subquery-solution. The following example illustrates this:

selectSalesYear , SalesMonth , SalesAmount , (selectsum(SalesAmount)fromSalesYearMonth )asTotalOfSalesAmount , (selectcount(*)fromSalesYearMonth )asRowCountfromSalesYearMonth

#### The *window* and the **OVER()**-clause

**OVER()**

Thinking about window functions as a shorthand for a subquery helps to understand how they work and also explains their name: a window function returns the result of an aggregate function on a particular subset of the rows in the query scope. This subset is called the *window* and it is defined by the ` OVER()`-clause.

The parenthesis after the ` OVER`-keyword can be used to define which rows will be considered as window. When left empty (like in the example above) all rows are considered.

##### Controlling the *window* using the **PARTITION BY**-clause

**PARTITION BY**

If you compare the previous example with our prior subquery-solution, you’ll notice that here, we do not have a ` WHERE`-clause to tie the subquery to the current row of the outer query. That’s why our result is calculated over the entire table, rather than with respect to the current year and preceding months, as in our prior subquery-solution. This is equivalent to the empty parenthesis following the

`-keyword in the corresponding window-function example.`

**OVER**In the subquery-solution we wrote a ` WHERE`-clause to specify a condition to tie the rows of the subquery to the current row. For window functions, we can control which rows make up the window window by writing a

`-clause inside the parenthesis following the`

**PARTITION BY**`-keyword.`

**OVER**The ` PARTITION BY`-clause does not let you specify an arbitrary condition, like we could in a subquery. Instead, the relationship between the current row and rows in the window must be expressed through one or more attributes for which they share a common value. The following example may illustrate this:

selectSalesYear , SalesMonth , SalesAmount ,sum(SalesAmount)over(partition bySalesYear)asYearTotalOfSalesAmountfromSalesYearMonth

In the example above, ` sum(SalesAmount) over(partition by SalesYear)` means: calculate the total of

`SalesAmount`over all rows where the value of

`SalesYear`is equal to the value of the

`SalesYear`in the current row.

The equivalent query using subqueries would be:

selectOriginalSales.SalesYear , OriginalSales.SalesMonth , OriginalSales.SalesAmount , (selectsum(YearSales.SalesAmount)fromSalesYearMonthasYearSaleswhereYearSales.SalesYear = OriginalSales.SalesYear )asYearTotalOfSalesAmountfromSalesYearMonthasOriginalSales

The result is shown below:

SalesYear | SalesMonth | SalesAmount | YearTotalOfSalesAmount | |
---|---|---|---|---|

2011 | 5 | 503,805.92 | 12,641,672.21 | |

2011 | 6 | 458,910.82 | 12,641,672.21 | |

…more rows… | ||||

2014 | 6 | 49,005.84 | 20,057,928.81 |

`(Note that ``12,641,672.21` is the sum of the `SalesAmount` for `SalesYear 2011`; `20,057,928.81` is the total for `2014`.)

##### A partition for the preceding months?

It’s great that the ` PARTITION BY`-clause allows us to specify a window for relevant year, but it’s still too wide: we want the window to contain only the rows from the current year, but only for this month and its preceding months. In the subquery-solution this was easy, as we could write whatever condition we want in the

`-clause. So we wrote:`

**WHERE**whereSalesYtd.SalesYear = SalesOriginal.SalesYearandSalesYtd.SalesMonth <= SalesOriginal.SalesMonth

Specifying `SalesYear` in the window functions’ ` PARTITION BY`-clause is equivalent to the first part of the subquery’s

`-clause condition.`

**WHERE**It’s less clear what our partition expression should look like to select all months preceding the current month. It’s not impossible though. For example, we can write an expression to mark whether the current `SalesMonth` is equal to or less than a *specific* month. For example:

-- every month up to and including june is 1, all months beyond june is 0casewhenSalesMonth <=6then1else0end

If we can write such an expression, then of course, we can also use it in a ` PARTITION BY`-clause, like so:

sum(SalesAmount)over(partition bySalesYear ,casewhenSalesMonth <=6then1else0end)

Let’s try and think what this brings us.

Suppose the value for `SalesMonth` is 6 (june), or less? The **CASE** expression would return 1, and the window function would take all rows into account for which this is the case. So january, february, march and so on, up to june would all get the total of those six months – that is, the YTD value for june.

On the other hand, if `SalesMonth` is larger than 6, the **CASE** expression evaluates to 0. So all months beyond june (that is: july, august, and so on up to december) form a partition as well, and for those months, whatever is the sum over those months would be returned.

Now, it’s not really clear what the outcome means in case the month is beyond june. But it doesn’t really matter – what is important, is that we now know how to calculate the correct YTD value for a given month. And, what we did for june, we can do for any other month. So, once we have the YTD expressions for each individual month, we can set up yet another **CASE**-expression to pick the right one according to the current `SalesMonth`.

Putting all that together, we get:

selectSalesYear , SalesMonth , SalesAmount ,caseSalesMonth -- januarywhen1thenSalesAmount -- februarywhen2thensum(SalesAmount)over(partition bySalesYear ,case whenSalesMonth <=2then1else0end) ...more cases for the other months... -- decemberwhen12thensum(SalesAmount)over(partition bySalesYear ,case whenSalesMonth <=12then1else0end)end asYtDOfSalesAmountfromSalesYearMonth

Like with the ` UNION`-solution, we are taking advantage of our knowledge of the calendar, which allows us to create these static expressions. We would not be able to do this in a general case, or where the number of distinct values is very large. But for 12 months, we can manage.

While it’s nice to know that this is possible, there is a much, much nicer way to achieve the same effect – the frame specification.

##### Frame Specification

The frame specification lets you specify a subset of rows *within* the partition. The way you can specify the frame feels a bit odd (to me at least), as it is specified in terms of the current row’s position in the window. Hopefully the following example will make this more clear:

selectSalesYear , SalesMonth , SalesAmount ,sum(SalesAmount)over(partition bySalesYearorder bySalesMonthrows between unbounded precedingand current row)asSalesYtdfromSalesYearMonth

We already discussed the ` PARTITION BY`-clause, all the clause after that are new.

The ` ORDER BY`-clause sorts the rows within the window, in this case by

`SalesMonth`. We need to rows to be ordered because of how the frame specification works: it lets you pick rows by position, relative to the current row. The position of the rows is undetermined unless we sort them explicitly, so if we want to pick rows reliably we need the

`-clause to guarantee the order.`

**ORDER BY**The frame specification follows the ` ORDER BY`-clause. There are a number of possible options here, but I will only discuss the one in the example. In this case, it almost explains itself: we want to use the current row, and all rows that precede it. Since we ordered by

`SalesMonth`, this means all the rows that chronologically precede it. As this selection applies to the current partition, we will only encounter months here that are within the current year.

So here we have it: a YTD calculation implemented using a window functions. It’s about the same amount of code as compared to the subquery solution, but more delcarative, as we do not need to specify the details of a condition. On the other hand, it is also less flexible than a subquery, but in general one should expect the window functions to perform better than the equivalent subquery.

#### Generalizing the solutions

So far all our examples were based on the `SalesYearMonth` table, which provides `SalesYear` and `SalesMonth` as separate columns. One might wonder what would it would take to apply these various methods to a realistic use case.

For example, it is likely that in a real dataset, the time would be available as a single column of a `DATE` or `DATETIME` data type. A single date column potentially affects the YTD calculation in two ways:

- Year: As the YTD is calculated over a period of a year and almost all solutions we described used the
`SalesYear`column explicitly to implement that logic. - Preceding rows: To calculate the YTD for a specific row, there has to be a clear definition of what rows are in the same year, but which precede it. In our examples we could use the
`SalesMonth`column for that, but this might be a but different in a realistic case. - Lowest Granularity: The lowest granularity of the
`SalesMonthYear`table is at the month level, and we collected the YTD values at that level. (If we’d want to be precise we’d have to call that year-to-month).

Apart from the time aspect, the definition of the key affects all solutions that generate “extra” rows and require a ` GROUP BY` to re-aggregate to the original granularity.

##### The Year

The ` ON`-condition of the

`-solution and the`

**JOIN**`-condition of the subquery-solution both rely on a condition that finds other rows in the same year, and the window function-solution uses the year in its`

**WHERE**`-clause.`

**PARTITION BY**It is usually quite simple to extract the year from a date, date/time or timestamp. In Synapse Analytics or MS SQL one can use the ` DATEPART` or

`function to do this.`

**YEAR**The ` UNION`-solution has no direct dependency on the year.

##### The preceding rows

The need to find the preceding rows applies to all solutions that use the year to find the rows to apply the YTD calculation on. In our samples, this could all be solved using the `SalesMonth` column.

Again, it are the ` JOIN`-solution and subquery-solution that used it in their condition, whereas the window function-solution uses it in its

`-clause.`

**ORDER BY**In this case, the fix is more straighforward then with the year: instead of the month column, these solutions can simply use the date or date/time column directly. No conversion or datepart extraction is required.

##### Lowest granularity

The granularity is of special concern to the ` UNION`-solution. The solution relies on an exhaustive and static enumeration of all possible future dates within the year. Already at the month level, this already required a lot of manual code.

Below the month, the next level would be day. While it would in theory be possible to extend the solution to that level, it is already bordering the impractible at the month level.

##### The Key

The key definition affects both the ` JOIN`–solution and the

`-solution, as that both require a`

**UNION**`GROUP BY`over the key.

*Image by James Henry from Pixabay *