For one of our Just-BI customers we implemented a Year-to-Date calculation in a Azure Synapse Backend. We encountered a couple of approaches and in this series I’d like to share some sample code, and discuss some of the merits and benefits of each approach.

**Short Summary**

A Year-to-Date solution based on a ` SUM()` window function is simple to code and maintain as well as efficient to execute. This as compared to a number of alternative implementations, namely a self-

`JOIN`(combined with a

`GROUP BY`), a subquery, and a

`UNION`(also combined with a

`GROUP BY`).

Note: this is the 2nd post in a series. For sample data and setup, please see the first post in this series. (While our use case deals with Azure Synapse, most of the code will be directly compatible with other SQL Engines and RDBMS-es.)

#### Using a self-`JOIN`

The recipe for the set-oriented approach can be directly translated to SQL:

selectSalesOriginal.SalesYear , SalesOriginal.SalesMonth ,max(SalesOriginal.SalesAmount)asSalesAmount ,sum(SalesYtd.SalesAmount)asSalesYtdfromSalesYearMonthasSalesOriginalinner joinSalesYearMonthasSalesYtdonSalesOriginal.SalesYear = SalesYtd.SalesYearandSalesOriginal.SalesMonth >= SalesYtd.SalesMonthgroup bySalesOriginal.SalesYear , SalesOriginal.SalesMonth

##### The self-`JOIN`

In our discussion of the set-oriented approach we mentioned combining the rows from the table with each other to produce all different combinations. In the code sample about, the `JOIN` -clause takes care of that aspect.

As you can see, the `SalesYearMonth` table appears twice: on the left hand and on the right hand of the `JOIN`-keyword, but using different aliases: `SalesOriginal` and `SalesYtd`. It is a so-called *self-join*.

Even though both aliases refer to an instance of the same `SalesYearMonth` base table, each has a very different role. We can think of the one with the `SalesOriginal` alias as really the `SalesYearMonth` table itself. The `SalesYtd` alias refers to an instance of the `SalesYearMonth` table that, for any given row from `SalesOriginal`, represents a subset of rows that chronologically precedes the row from `SalesOriginal`.

The `ON`-clause that follows controls which combinations should be retained: for each particular row of `SalesOriginal` we only want to consider rows from `SalesYtd` from the same year, which is why the first predicate in the `ON`-clause is:

SalesOriginal.SalesYear = SalesYtd.SalesYear

Within that year, we only want to consider rows that precede it chronologically, and that explains the second predicate:

SalesOriginal.SalesMonth >= SalesYtd.SalesMonth

`GROUP BY` and `SUM()`

It is is important to realize the `JOIN` is only half of the solution.

While the `JOIN` takes care of gathering and combining all related rows necessary to compute the YTD value, the actual calculation is done by the `SUM()` function in the `SELECT`-list, and the `GROUP BY` defines which rows should be taken together to be summed.

In summary:

- the
`JOIN`generates new rows by combining rows from its left-hand table with the rows from its right-hand table, bound by the condition in the`ON`-clause. - The
`GROUP BY`partitions the rows into subsets having the same combinations of values for`SalesYear`and`SalesMonth`. - The
`SUM()`aggregates the rows in each`SalesYear, SalesMonth`partition, turning its associated set of rows into one single row, while adding the values of the`SalesAmount`column together.

Note that the columns in the `GROUP BY` list are qualified by the `SalesOriginal` alias – and not `SalesYtd`. Also note that the `GROUP BY` columns form the key of the original `SalesYearMonth` table – together they uniquely identify a single row from the `SalesYearMonth` table. This is not a coincidence: it expresses precisely that `SalesOriginal` really has the role of being just itself – the `SalesYearMonth` table.

##### What about the other columns?

The `GROUP BY` affects treatment of the non-key columns as well. In this overly simple example, we had only one other column – `OriginalSales.SalesAmount`.

(Note that this is different from `YtdSales.SalesAmount`, which we aggregated using `SUM()` to calculate the YTD value)

Since `OriginalSales.SalesAmount` comes from the `SalesOriginal` instance of the `SalesYearMonth` table, we can reason that after the `GROUP BY` on the key columns `SalesYear` and `SalesMonth`, there must be exactly one `SalesAmount` value for each distinct combination of `SalesYear` and `SalesMonth`. In other words, `SalesAmount` is functionally dependent on `SalesYear` and `SalesMonth`.

Some SQL engines are smart enough to realize this and will let you refer to any expression that is functionally dependent upon the expressions in the `GROUP BY`-list in the `SELECT`-list. Unfortunately, Synapse and MS SQL Server are not among these and if we try we will get an Error:

Msg 8120, Level 16, State 1, Line 11 Column 'Sales.SalesAmount' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

The error message suggets we can do two things to solve it:

- either we aggregate by wrapping the
`SalesOriginal.SalesAmount`-expression into some aggregate function - or we expand the
`GROUP BY`-list and add the`SalesOriginal.SalesAmount`-expression there.

To me, neither feels quite right.

`SalesAmount` is clearly intended as a measure, and it feels weird to treat them the same as the attributes `SalesYear` and `SalesMonth`. So adding it to the `GROUP BY`-list feels like the wrong choice. Besides, it also makes the code less maintainable, as each such column will now appear twice: once in the `SELECT`-list, where we need it no matter what, and once again in the `GROUP BY`-list, just to satisfy the SQL engine.

So, if we don’t want to put it in the `GROUP BY`-list, we are going to need to wrap it in an aggregate function. We just mentioned that `SalesAmount` is a measure and therefore that does not sound unreasonable. However, we have to be careful which one we choose.

One would normally use `SalesAmount` as an additive measure and be able to use `SUM()` for that. But here, in this context, `SUM()` is definitely the wrong choice!

All we want to do is to “get” back” whatever value we had for `SalesAmount`, in other words, unaffected by the whole routine of join-and-then-aggregate, which we did only to calculate the YTD value. The “extra” rows generated by the `JOIN` are only needed to do the YTD calculation and should not affect any of the other measures. Using `SUM()` would simply add the `SalesAmount` just as many times as there are preceding rows in the current year, which simply does not have any meaningful application.

What we want instead is to report back the original `SalesAmount` for any given `SalesYear, SalesMonth` combination. We just reasoned that there will be just one distinct `SalesOriginal.SalesAmount` value for any combination of values in `SalesOriginal.SalesYear, SalesOriginal.SalesMonth`, and it would be great if we had an aggregate function that would simply pick the `SalesOriginal.SalesAmount` value from any of those rows. To the best of my knowledge, no such aggregate function exists in MS SQL Server or Synapse Analytics.

We can use `MAX()` or `MIN()`, or even `AVG()`. While this would all work and deliver the intended result, it still feels wrong as it seems wasteful to ask the SQL engine to do some calculation on a set of values while it could pick just any value.

#### Next installment: Solution 2 – using a subquery

In the next installment we will present and discuss a solution based on a subquery.

*Image by PublicDomainPictures from Pixabay *