Automating Snowflake - chamila.dev

Only photo with me for now that's closest to the theme

I had an opportunity to work on automating a Snowflake workflow last year. The scenario involved a typical Snowflake based setup for a Data Warehouse where data would be ingested from an Azure Blob Storage or AWS S3 and stored on multiple databases. I’m not going to dive into intricate details, but I’d like to write down some of the approaches I wanted to take during this exercise. And yes, that makes this blog post way overdue.

The Story

There would be a need to define a couple of data warehouses, and databases with each having different schemas, stages, and a couple of roles, resource monitors. In other words a typical setup on a Snowflake deployment for Data Warehousing. Given the requirements, creating these resources should be fairly straightforward, and maintaining them should also be made easy. The data model can change from time to time, hence the emphasis on maintenance.

Approach 1 - Terraform

This was my first foray into automating something that resembled a database (and only the Data Warehousing aspects of Snowflake were being considered in the scenario). Coming from a DevOps engineering angle, my first instinct was to go with something like Terraform. The ChanZuckerbergFoundation Snowflake Terraform provider made this a lot easier, since a lot of Snowflake constructs that were considered in the scenario were already written into Terraform Resources. As someone who wrestled with the older Terraform provider plugin interface(almost completed writing a provider for the WSO2 API Cloud), having this provider out of the box played a major role in deciding on this approach.

Being somewhat of a database, managing Snowflake with Terraform brings out a totally different set of problems to work with than a typical infrastructure project. This is after all natural. Most providers in Terraform wrap a symmetric API provided by some kind of a public Cloud Service Provider. This makes it easy to write the provider plugin code around certainities that come with the design.

For Snowflake though, the “API” is the SQL interface that allows executing the Snowflake SQL. This isn’t exactly what can be called symmetric, and it’s not exactly an API in realistic terms.

Defining the “user” interface

The first hurdle was to design the input contract, the variable structure. For a typical infrastructure project, this is almost always straightforward. The expected input is well structured, well documented, and fairly normalized around RESTful principles. The project’s main variables.tf file ends up as almost a mirror of the main infrastructure provider API. For a module that creates a VPC with a few subnets in AWS, the variables.tf file ends up with variables like availability_zones, cidr_range, private_cidrs, public_cidrs etc, which are directly then passed into the AWS provided modules or the resource calls.

For an open ended SQL interface, this becomes a bit blurrier. SQL statements are open ended, and could have a number of permutations for a given command. The relationships between different constructs are sometimes decoupled (ex: resource monitor vs data warehouse), but then other times tightly coupled (different types of grants). There could be different combinations of statements that could end up with the same state.

Therefore, a stable but flexible variables.tf was needed to be defined. This would need to allow for changing number of data warehouses, databases, schemas and other resources, while not forcing the business logic to duplicate for differences in the resource counts. At the same time, it should allow for different combinations of inputs to end up with the same state where possible (ex: it should be possible to define database role grants when defining a database structure, and it should also be possible to define new role grants and database attachments to the role grants later).

Terraform Variables allow complex scenarios to be defined like this through the use of list, map, and object types. With the object type, we can define a structure for a single entity (a database), and then make use of map to add some flexibility into the variable structure.

Single `database` object

variable "database" {
  type = object({
    comment        = string,
    retention_days = number,
    schemas = list(object({
      name    = string,
      comment = string,
      privileges = list(object({
        roles     = set(string),
        privilege = string,
      })),
    })),
    privileges = list(object({
      roles     = set(string),
      privilege = string,
    })),
  })
}

Flexible `databases` input

variable "databases" {
  type = map(object({
    comment        = string,
    retention_days = number,
    schemas = list(object({
      name    = string,
      comment = string,
      privileges = list(object({
        roles     = set(string),
        privilege = string,
      })),
    })),
    privileges = list(object({
      roles     = set(string),
      privilege = string,
    })),
  }))
}

This way, any number of databases can be defined with the same structure without having to mess around with the business logic of the infrastructure code. The same pattern goes for any other resource type, data warehouses, schemas, and resource monitors.

There is a small caveat here though. With the object type, you can’t really define meaningful defaults for the fields inside the object definition. A default value can be defined, however, that will be for the whole object, i.e. no deep merging is done (at least at the time this code was written). No partially defined inputs survive the first round of parsing. A workaround here is to define the default value object as a different variable and merge the two at runtime as a local. However, when the object definition is wrapped with a map, this is impossible to do, at least not without seriously degrading the readability of the code. As evident from the above codelet, this is already somewhat complex to read through.

Code Structure

With Terraform, there is enough flexibility to structure the code into different layers of modules. However, the most common pattern is to modularise the reusable components, define “patterns” that compose the modules together, with the final layer containing the environment specific details. This final layer becomes the root module that gets invoked with the terraform binary. The level of abstraction, or more accurately the depth of abstraction could vary, but this is almost always a pattern anyone ends up with, when they practice the radical notion of reusability (hint: it’s not radical).

This is exactly how I evolved the code base. There would different scenarios where the infrastructure patterns would need to be changed. These could be the patterns that different modules like processing (for data warehousing), storage for databases and schemas etc. Modularisation makes it easy to compose new patterns as the need arises, which for everyone’s mental health, should be based on customer requirements.

However working with this model requires software engineers with software engineering capabilities like understanding the basic principles like resusability, modularity, and even why a module should be cohesive but decoupled. I’ll just leave it at that.

Maintenance, or not being able to do that

Terraform, in its bare sense is an API wrapper. A Terraform provider’s contract is to implement how the CRUD operations for a given set of resources, and only that. To do this, the provider code should have a reference point to hold on to during the various CRUD operations for a given resource. This is where the id of a Terraform state resource comes in.

A symmetric API almost guarantees resource identifiers.For AWS, this is the resource ID for each resource (ex: instance ID for EC2 instances). A resource identifier that’s unique and persistent through changes is key for a successful Terraform provider implementation. Otherwise you have to use unreliable resource identifiers like name labels in the SetId() method call in the CRUD implementation code.

Turns out, treating SQL as an API brings out this exact issue. Resources in Snowflake do not have any unique persistent ID. Well, there could be uniqueness between the names of resources of the same type, but they are never persistent. Therefore, the following scenario breaks completely.

create database db1
create a schema in db1 named sch1
apply changes
change database name to newdb1
apply breaks

This is because during step #5, a series of things happen that breaks halfway through.

a diff of the current state and intended state is generated. it’s decided to change db1 to newdb1 and by dependency schema sch1’s entry in Terraform state will have to be updated to reflect the parent resource name change
db1 gets renamed to newdb1
when the API call, or in this case the SQL statement is built to update the sch1 schema, the SQL execution fails. Because by this time there is no database named db1

Why did Terraform use db1 instead of the new name newdb1 when issuing the second SQL statement? It’s because the id field for both the database resource and the parent resource of the schema resource has been persisted with the SQL resource name and not an ID. When the name changed, i.e. behaved in a volatile way, the SQL statement breaks.

So either you have to get the naming right for most resources that can act as a dependency for other resources, or you should learn to live with the shitty names like db1.

The root cause of this issue arises from the basic design principles of Terraform, having a state. Declarative code usually follows this approach since storing the previous state is easier than computing it everytime a diff is run.

We could try out something like Ansible where being declarative or imperative is a matter of the code and not the tool. However this issue cannot be worked around like that. Even with an Ansible module, the returns and facts will have to be passed meticulously to resolve this situation. And we would still end up with brittle code that no one can touch without bring in the entire team for a code review.

Permissions

Unlike most Cloud or other infrastructure providers, Snowflake resources inherit the roles they were created under. I’m not sure if this design has been changed, but when executing pure SQL, a data engineer would need to switch to the non-privileged roles to create resources like databases and schema.

While this is not the usual way for Terraform, getting around this is relatively easy to do. Terraform already has a similar pattern where the state backend and actual resources being managed by Terraform could be in different Cloud providers or different accounts in the same Cloud provider.

For this scenario, each role could be defined with its own provider config, which can then be referenced by the resource or module calls.

Pets, not cattle

Pets vs Cattle is a central consideration around most architectural discussions these few years. Elasticity provided by Cloud Service Providers have normalised replacing older infrastructure resources with new ones. Periodic refreshes even without any kind of incident is the norm now, not the exception.

Tools like Terraform are perfectly designed on top of this thinking. State diff? Delete and recreate. Someone accidentally deleted an instance? Delete and recreate.

However, a Snowflake database is not cattle by any standard. These are precious datasets that are sometimes refined or to be refined and stored again. These would be datasets important in training new models for critical business functions. These would be results of ELT pipelines that aggregated key business impacting intelligence. In any scenario, data is not something that’s reproducible easily.

Using Terraform to manage Snowflake resources is bit reckless in this sense. Terraform is a bit too trigger happy to delete resources and an engineer tired after 6 hours of continuous meetings could easily key in yes to a Terraform diff that calculated a list of databases should be deleted and recreated to reconcile state. Lifecycle hooks could be somewhat effective gatekeeping here, but ultimately the danger would still be there, lurking.

Pros

Easy to get started with thanks to third party modules available
Single source of truth that can be grepped, updated, and understood
DevOps engineers can easily write the code and incorporate into the usual toolset
Can build an elaborate hierarchy of reusable modules increasing maintainability in the long run
Can integrate easily into workflow orchestration tools

Cons

Data engineers cannot easily understand what’s written in Terraform
Can easily break with some simple changes
A corrupted state file can clear out an entire Data Lake

Approach 2 - Imperative Migrations

Snowflake has released tool named schemachange (previously known as snowchange, they have changed the name so if you’re confused as I am from the search results, don’t be), which is a FlywayDB like migration tool for Snowflake. In brief, you can write the changes to Snowflake deployments in the SQL scripts numbered in order as you would write database migration scripts in any software (at least I hope every enterprise application follows some kind of migration format). schemachange will execute those scripts in the numbered order. That’s it, it does nothing more.

This approach allows us to work around a few issues in the previous approach. It gives Data Engineers direct freedom to incorporate their scratch notes into pre-production code. It doesn’t need to work around resource ID issues mentioned in the Terraform approach. There is no state file to protect. And unless explicitly written, it would not delete resources as a side effect.

However this approach has its own set of issues.

Idempotency

Idempotency, or being able to end up with the same state when run multiple times, is something Terraform can boast about right from the start. It’s what it’s born to do. It’s what a declarative approach to Infrastructure-as-Code is supposed to deliver.

However, with this approach, where SQL scripts are the basis of how the resources are being created, the scripts themselves will have to be written in an idempotent way. This time, idempotency cannot be assumed from the tool, rather the business logic should contain it. This isn’t something new for the SQL migration scripts since IF NOT EXISTS is something you’ll be able to grep for in almost every code base with SQL scripts.

Tracking changes

With Terraform, you’d have one version of the code to check at a given point in time. For a given database resource, you’ll be able to track the changes made in the Git Log. The entire list of resources being managed by Terraform will be in one place (multiple files and directories, yeah, but all at the same temporal point, now).

With migration scripts, to understand why certain Snowflake resources are the way they are, you have to wade through every migration script written going backwards from the current state. This gets more complicated when standards are not followed and the same migration script is used to add new resources.

Batteries not included

schemachange takes care of migration script execution, but defining an input contract is almost impossible for changes. With Terraform, this is a matter of building better UX to bridge the gap between the variables structure and human interaction. However, with schemachange using SQL directly the only direct improvement that can be made is to build UI that allows users to write SQL on it. This SQL has to be then parsed for validation and stored properly. With Terraform, validation is built into variables with strict schema being defined for each variable as described above.

Reusability

With the migration script approach, there is a certain flexibility provided by the use of SQL. There could be a halfway built set of SQL scripts that can avoid the boilerplate SQL code needed to build the Snowflake resources.

However, the modularity and reusability of this code has to be built into the file structure and at most an ad-hoc compiler tool. There is no inherent moularity that can be made use here, in contrast to Terraform where modularity is one of the key design considerations.

Pros

Data engineers can directly write provisioning automation
No danger of data being accidentally deleted
Not easily broken if sanitary code is being written

Cons

Idempotency isn’t a given
Logic spread across different migration scripts
Reusability of the code isn’t built in
Has to build tooling on top of it to be user friendly
Validation has to be done explicitly, not built into the tool

Approach 3 - Excel sheets

Wait. Take a deep breath, and think twice before you suggest this at a meeting. An Excel sheet is not a tool you should use in production automation. If this is your idea of automation and user input, stop, and let the people who know what’s actually involved do their job.

Having had to hear this being suggested in real life and worse, seeing dumbass product managers who have no idea about their product celebrating the idea like the next disruption in the industry, and even worse, seeing engineers having to spend their time converting Excel CSV to Terraform .tfvars format, I’ve made it a point to staunchly oppose ideas like this. If you think it’s bit a too harsh, you have better patience than me.

Please do not be deluded into thinking Microsoft Office products are DevOps automation tools.

Conclusion

Some details might have changed since I worked with these tools. snowchange is now schemachange. The Terraform provider for Snowflake seems to have improved, and Snowflake has also moved into new areas. In any case, hope the above learnings help you if you find yourself having to automate Snowflake operations.

Only photo with me for now that's closest to the theme

The Story

Approach 1 - Terraform

Defining the “user” interface

Single database object

Flexible databases input

Code Structure

Maintenance, or not being able to do that

Permissions

Pets, not cattle

Pros

Cons

Approach 2 - Imperative Migrations

Idempotency

Tracking changes

Batteries not included

Reusability

Pros

Cons

Approach 3 - Excel sheets

Conclusion

Single `database` object

Flexible `databases` input