I recently started diving into the vast world of DynamoDB. One area that I was almost immediately introduced to was that of modeling data for DynamoDB, and I wanted to share an experience I just came across related to how I modeled some data for a project. Since DynamoDB is a NoSQL database, data storage and models need to be thought of differently than traditional SQL databases. To get the maximum benefit out of this incredibly quick and reliable data storage service, a system’s data and use cases need to be thought of and planned out. Planning data models is something that I have little experience with and almost turned my project into a mess had I not caught myself early on.
Before I dive into too many details, I want to preface this post by saying that I know there are much smarter and experienced people in the world on this topic. I’m looking at Alex DeBrie and Rick Houlihan. There are great resources out there to learn about best practices for DynamoDB, but this post probably is not one of them. Just a semi-entertaining story of what could have been a large mishap. Check out Rick Houlihan’s re:Invent talks on YouTube if you want a cram-packed hour full of DynamoDB.
Over the past few months I have been reading about DynamoDB, it’s power, and its funky data modeling best practices. The era of deduplication is over. I remember reading about using primary and secondary keys to model various types of relationships like one-to-one and one-to-many. I noticed that the primary key was normally a single ID, name, type, whatever, and then the secondary key was a cluster of various attributes separated by a hash (
#). At the time (and in my memory) I could not see a pattern closely enough while reading to make sense of the reasoning behind it all.
Not too long along I needed to remodel data in one of my projects called Crow Authenticaion. Crow is authentication as a service which means it is storing user credentials but it is also a multitenant application since I offer it to the public. Originally my partition key was a has of the tenant’s ID and their user’s ID. This is similar to what I mentioned reading about for secondary keys in the prior paragraph. Veterans might cringe at what I did, so let me explain why that was a bad idea.
If I ever needed all records for a given tenant, then I would need to perform what is called a
scan operation with a condition on my partition key for
begins_with. The problem with a
scan is that it checks literally every single item in my table. As my table grows, my read request units would increase whenever I needed to run a
scan, which would have been fairly frequent since the data access pattern that I was coding at the time would have happened on every visit by a user. Take a second to read that last sentence. “As my table grows.” Not “as my tenant’s data grows.” Over time, that
scan would have cost more and more time and money without much of a solution to slow it down.
An alternative method to gathering multiple records in DynamoDB is a
query instead of a
query takes less time and read request units but in turn, requires the partition key. So a
query is a
scan for a specific partition key. What this meant to me was that I needed to remodel my data to exist based on a partition key and then have any extra sorting information stored in the sort key. Does this pattern sound familiar? It should because it is the exact same pattern I referenced in the third paragraph. I couldn’t tell you why I didn’t do this from the start, but here we are. After a few code changes the service was back up and running using a single ID for the partition key and a multipart sort key (multiple pieces of sorting information separated by a hash
Here’s what I learned from all that. Partition keys should be somewhat large in scope. I now think of the composite keys as a series of information going from broadest to narrowest in scope until it reaches a specific identifying attribute which exists as the last part of the sort key. For the example, my partition key turned from a broad scope (the tenant’s ID) combined with a specific identifying attribute (the user’s ID) to only the broadly scoped information. If all of an entity’s records need to be known, I can issue a query on that entity or partition key (in this example, the tenant’s ID). If a subset of that entity’s records need to be known, I can issue another query on that entity with a condition on the sort key for
begins_with containing more narrowly scoped identifying information of the records I am looking for.
Of course, other considerations might need to be made around access patterns and the use of secondary indices, but I’ll leave that explaining to the pros for now.tags: aws - databases - dev - ops - serverless