Making full use of Azure Synapse capabilities

This blog is part 3 of a series. For the introduction post, click HERE.

When implementing a global e-commerce and digital marketing data warehouse, some factors influence the choice of technology. Naturally, cost is a component, but we were looking for a solution that also has scalable architecture with a pay-as-you go, options to load, process and query big data and connectivity to semi-structured data sources. Hence the choice to work with Azure Synapse, formerly known as SQL Data Warehouse. What is it? Azure Synapse is basically a pool of data warehousing units (dwu). This means that, based on the scale you choose, a bunch of resources run in parallel which you can use T-SQL upon. One control node manages multiple compute nodes. So if you store and query data wisely across these nodes, you can make use of that parallel architecture to process way more data at once than you could with a traditional SQL server. The language used to define, manipulate and query might look similar to your average MS SQL Server, but there are some differences. We point out a few things we learned below.

First of all, whereas in an SQL server you would be busy with table partitioning and indexing, in a SQL pool you need to spend time understanding table distribution and statistics creation. Let’s say you are running a pool of 60 compute nodes. If you were to create a table without setting up any distribution, the records in that table would be spread randomly over those 60 nodes. This is great for the ‘data skew’: the weight of the data is spread evenly over the 60 nodes. However, if you choose a smart column to hash partition, for instance one that is often used to join or group data upon (such as a customer ID), records with the same hash key will be co-located on the same node. This means that the join can be executed within that node, minimizing data transfer. For some excellent information on this click here. Distributing the table only makes sense if it is of a certain volume though: if it remains small, e.g. country master data, it might be better to simply replicate the complete table across all 60 nodes for best query performance. Next to this, creating statistics with full scan on important columns in tables helps to boost query performance a lot.

Secondly, there’s a big difference between traditional databases and these scalable cloud data warehouses which can be noticed across the board. For instance, last year when we were checking out Snowflake, and now again, we have noticed there is no primary key enforcement. So you can create a table with a primary key, and subsequently load data into it that violates the key. We think this choice has been made in these cloud services because the compute cost of primary key enforcement would be too high, degrading data manipulation performance. The SQL optimizer however, does assume the primary key is adhered to by data in the tables. So if your data is not compliant, you will notice very interesting results when doing (distinct) counts on the key for instance!

Another feature which we really like to use is the Display Actual Execution plan in SQL Server. In Synapse you can only view the estimated plan. Combined with the massive parallel architecture, it is a bit harder to find out why your query is running slow if you are new to Synapse: you will have to revert to some procedures to find out that in the end your query is slow because of large data movements between nodes for instance.

Finally, there is quite some T-SQL that you are used to using which is not supported. But these usually come with well-documented workarounds. For example, sometimes you want to perform an update with a join: Synapse doesn’t support that in the from clause. Here you would need to create a new table as select (CTAS) from the old one, and then replace the old one. Similarly, merge statements can be replaced with a CTAS. Additionally, as you might have guessed from the previous paragraph, foreign keys cannot be defined in tables, as well as sequences for which you can often use identities instead.

In the past few months, we have gained quite some experience working with Synapse and we are really excited  about it. We learned a lot more than described above we would love to share more experiences with you. If you are in an organization that is thinking about implementing Azure Synapse, please do reach out in case you need help!

Author: Paul Schoondermark

Image by Colin Behrens from Pixabay 

This article belongs to
  • #Azure #Synapse #Polybase #blobstorage
  • Paul Schoondermark