Thomas Step

← Blog

I recently launched a Slack app to help with channel bloat. Simply installing it would help me out a bunch. I have 1/10 installations required to submit my app to the Slack Marketplace. To say thank you for helping me, I am giving you a free 6 month trial.

re:Invent: Inside A Working Serverless SaaS Reference Solution

This is an overview of a session that I went to during re:Invent 2021. I start by providing the notes I took during the session, and then I will give my take and comments if I have any at the end.

Wednesday 9:15

ARC405

Tod Golding from the AWS SaaS factory team

He has given talks and written content about various SaaS architecture concepts
EKS reference architecture and serverless reference architecture; this one is serverless
Will go over what the moving parts are and the design decisions
It is a 400 level - so there will be implementation details

The fit between serverless and SaaS

Agility
Cost optimization
Operational efficiency
Smaller blast radius
- Deployment and operational footprint is smaller which helps limit the blast radius
Don’t have to worry about forecasting consumption
- Consumed compute infrastructure follows tenant consumption

Key architectural considerations

Design decisions for uniquely serverless architectures
Which deployment models will you support? Tenants in silos with their own infrastructure, or use shared/pooled infrastructure
- Influences authentication, authorization, and identity
How will you automate tier-based deployments?
- Basic vs advanced tenant
- How to deploy all that for a tenant
How do you manage noisy neighbor conditions?
- Serverless stack gives a unique set of tools
How will you isolate tenant resources?
- Don’t let one tenant access another’s resources
How will you support tiered tenant experiences?

High-level architecture

Control plane
- They are dividing SaaS architectures into two planes
- Control is services and capabilities
- Registration, tenant management, user management, automated onboarding, identity, billing, metrics, analytics
- Nothing multi-tenant in the control plane - it manages the multi-tenant environment
Application plane
- SPA hosted in S3, goes through API Gateway to access microservices on the backend
- A lot of the multi-tenancy is in the API Gateway
- The way the microservices are built and deployed has specifics to multi-tenancy
- The microservices are a group of Lambdas - REST API with multiple Lambdas behind it

Provisioning the control plane - this is in the reference architecture code

There is a bootstrap template that creates the following
- Tables that support multi-tenancy
- User pools (AWS Cognito)
- Microservices for control plane (reference architecture has a smaller set than what an actual SaaS would have)
- API Gateway - routes, permissions, etc.
- Web applications - SaaS app and admin app and landing page; this is S3 and CloudFront
- Custom resources for CloudFormation

Two distinct tenant deployment models (silos vs pools)

They tell people to build a pool architecture and then whenever someone comes along and wants a silo, spin it out
- Silos are for higher-end customers (he calls them “platinum”)
Platinum tenants get their own application plane stack and Cognito user pool
Basic tenants get onboarded into the pooled app plane stack

Registering new tenants (back into the control plane)

One experience is for a tenant to hit the landing page, register, then get access to the SaaS app
Another would be to have an internal process that isn’t self-service from the landing page
Request comes into the registration microservice which provisions tenant and configures settings (silo or pool)
- Registration service hits user management service which creates user pool (silo) or user pool group (pool) and creates a tenant admin user
- Registration service then hits tenant management service which creates tenant and stores their config
- Registration service then hits tenant provisioning service which creates stack/infrastructure for siloed tenants

Tier-driven onboarding

Even with Lambdas, we can still have noisy neighbor issues due to concurrency limits
Isolation experience for siloed tenants is also slightly different than pool tenants
Siloed environments are clones of the pool environment, not just infrastructure but also code
- Allowing variations is a step backward and away from the SaaS model

Inside tenant provisioning

The bootstrap template has a custom CloudFormation resource that adds an entry to a DDB table called “tenant creation pipeline table” which is picked up by a CodePipeline called “tenant provisioning pipeline” which creates/provisions the pool infrastructure
Whenever a siloed tenant onboards, a new entry is added to the “tenant creation pipeline table” and the infrastructure is provisioned for them
The tenant creation template sets up DDB tables, Lambdas, API Gateways, and provisions Cognito pools

Mapping the deployments

Need to know which tenants are using which deployment model (silo vs pool)
Uses a “tenant stack mapping” DDB table
Seems cumbersome to me
There is currently no way to deploy blue/green but they are playing with it

Now looking at a little bit of code

Applying tiers and configuring tenants

Setup API keys (this will come into play later in the code and my notes section “Tier based throttling policies”)
Not much else on this slide

Triggering the tenant provisioning

Putting an item into the tenant stack mapping table
Then triggering the CodePipeline that provisions infrastructure
^^that’s all in the REST API that registers users

Managing tenants and users

Tenants and users are managed in the admin console
The SaaS application also gets access to the user management service

Creating the user and tenant

User gets custom claims and is mapped back to the tenant by the tenant ID
Tenant gets info and then puts it into the DDB table

Authentication and authorization

Sign in through Cognito
Lambda authorizer attached to API Gateway
Tenant context gets injected and isolation is applied using that context

Tenant routing

Knowing which stack to point someone at
When a user is logging in:
- Get user pool mapping and then API Gateway URL which is stored in the tenant management DDB table
- Then route the user to the correct user pool for authentication
- JWT from authentication/Cognito is then used to authenticate with API Gateway

Authorization with the two isolation models

Siloed model is easier to isolate because the Lambdas can be scoped to a specific pool
- Lambda is allowed to hit certain tenant tables so we don’t need to worry about making sure it is valid
Pool model needs to resolve isolation policy in Lambda at runtimes
- Lambda authorizer dynamically allows certain DDB tables through the IAM policy returned (So does each tenant get its own DDB table in the provisioning process? He never actually answered this.)
- Would have also been valid to auth in the Lambda business logic level instead of the gateway level but you wouldn’t get the caching benefit of Lambda authorizer
  - Best approach probably depends on the complexity of what users would access

Pool based partition with DDB

Partition key is used as a shard ID
Tenant ID is used to build the shard ID
So the primary key would look something like <tenant ID>-<another ID non-unique between tenants>
Multiple tenants’ data in the same table

Inside application microservices

They use Lambda layers to handle tenant awareness so that microservice can handle business logic only then rely on the layer to handle any tenant-specific stuff
- log_with_tenant_context passes in the event and log message, then layer handles the intricacies
They use tenant IDs as XRay annotations

Tier-based throttling policies

Tenant doesn’t know anything about the API key (see “Applying tiers and configuring tenants” section)
The Lambda authorizer gets the API key from the tenant config table (was stored during provisioning/onboarding)
Authorizer policy then applies API key to a usage plan
I didn’t know that a Lambda authorizer could dynamically apply an API key and usage plan
- “API key can originate from Lambda authorizer”

Takeaways

Serverless and SaaS are a good fit
Find the approach that best addresses your isolation and tiering needs
- Being all pooled all the time is a completely valid option
Embed all provisioning requirements in your onboarding automation
- Create siloed infrastructure auto if needed
Use API keys and usage plans to apply tier-based throttling
Lambda layers to provide shared multi-tenant constructs
Use the serverless model to reduce the operational and development complexity of your SaaS environments
Use the serverless SaaS reference solution to jumpstart your development

My notes:

The approach taken to create infrastructure (pooled or siloed) in the “Inside tenant provisioning” section was very interesting to me. First, I like (and always have liked) the idea of having infrastructure that can be replicated on-demand as many times as requested. Creating IaC in this way allows for anyone to come through and create their own stack and version without messing with anyone else or an “official” environment that is relied on by external users. Of course, this also lends itself perfectly to this application of replicating the production environment for siloed users. Second, provisioning multiple stacks using CodePipeline outside of a traditional CI/CD process involving code lifecycle-based webhooks sounded genius to me. I do not know why I had not thought about doing something like that before to dynamically create infrastructure stacks.

Lambda layers are something that I have thought about using before, but I ultimately did not use them because I was unsure of the proper use cases. After hearing about them in several sessions including this one, Lambda layers would have been the perfect fit for what I originally wanted to use them for: shared code amongst different Lambdas. I think a good application for layers would be crow-api where I currently copy code from the shared/ directory to each of the Lambda’s code directories, but that would be much easier to accomplish by creating a layer and then adding that layer to each Lambda.

My curiosity was piqued when he started talking about data modeling for a multi-tenant application. I believe that I have done something similar to the model that he was discussing in the past, which is reassuring and makes me feel good. However, I might try to dig through his code a little more to make sure that I correctly understood his approach. From where I am coming from a good way to handle multiple tenants is to simply prepend the partition key with a unique property of the tenant (email, ID, etc.) which then forces the tenant to only be able to query against partitions explicitly with their ID.

Categories: aws