admin管理员组

文章数量:1026654

I understand the concept of hot keys in Dynamodb: if there is a video game database and country_code is used as partition key while player_id is used as sort key, then when everyone from the same country is playing in the early evening in that country's timezone, there will be an outsized amount of reads and writes on the table in the area of that country_code while the other partitions are mostly untaxed, leading to latency which could have been avoided by using player_id as the partition key.

However, my use case is not a distributed application. I am creating a repository for the results of a data scraping script. The updates to the table will only ever take the form of a single delivery source (aws lambda) looping through json elements. Yes, the updates will be 'hot' in the sense that I'll be updating for a single country_code while all other countries' data is un-accessed, but because it will just be an iterative process from exactly one delivery script, multiple operations will never be occurring at once.

A reasonable question is, why do this at all for the increased thinking required; why not just use the actual unique ID as the partition key? The actual usage pattern for the data will be in infrequent machine learning analyses, so a query across the entire sort key [unique] ID range for a given partition key [non-unique] country_code might allow me to avoid scanning the entire table all at once to kick that process off and later filtering by country_code.

I understand the concept of hot keys in Dynamodb: if there is a video game database and country_code is used as partition key while player_id is used as sort key, then when everyone from the same country is playing in the early evening in that country's timezone, there will be an outsized amount of reads and writes on the table in the area of that country_code while the other partitions are mostly untaxed, leading to latency which could have been avoided by using player_id as the partition key.

However, my use case is not a distributed application. I am creating a repository for the results of a data scraping script. The updates to the table will only ever take the form of a single delivery source (aws lambda) looping through json elements. Yes, the updates will be 'hot' in the sense that I'll be updating for a single country_code while all other countries' data is un-accessed, but because it will just be an iterative process from exactly one delivery script, multiple operations will never be occurring at once.

A reasonable question is, why do this at all for the increased thinking required; why not just use the actual unique ID as the partition key? The actual usage pattern for the data will be in infrequent machine learning analyses, so a query across the entire sort key [unique] ID range for a given partition key [non-unique] country_code might allow me to avoid scanning the entire table all at once to kick that process off and later filtering by country_code.

Share Improve this question asked Nov 16, 2024 at 16:45 slothish1slothish1 1592 silver badges15 bronze badges 1
  • I think you should remove the large question and ask directly what you want to know. – Leeroy Hannigan Commented Nov 16, 2024 at 17:20
Add a comment  | 

1 Answer 1

Reset to default 0

You can use a low cardinality key in DynamoDB so long as you don't exceed 1000 WCU or 3000 RCU per key per second. If you exceed either of those you will force the partition to become hot and result in throttling.

For your use-case, you're concerned about batch loading into DynamoDB, if you can rate limit your Lambda to not consume more than 1000 WCU per second per country_code then you'll have no issue.

I understand the concept of hot keys in Dynamodb: if there is a video game database and country_code is used as partition key while player_id is used as sort key, then when everyone from the same country is playing in the early evening in that country's timezone, there will be an outsized amount of reads and writes on the table in the area of that country_code while the other partitions are mostly untaxed, leading to latency which could have been avoided by using player_id as the partition key.

However, my use case is not a distributed application. I am creating a repository for the results of a data scraping script. The updates to the table will only ever take the form of a single delivery source (aws lambda) looping through json elements. Yes, the updates will be 'hot' in the sense that I'll be updating for a single country_code while all other countries' data is un-accessed, but because it will just be an iterative process from exactly one delivery script, multiple operations will never be occurring at once.

A reasonable question is, why do this at all for the increased thinking required; why not just use the actual unique ID as the partition key? The actual usage pattern for the data will be in infrequent machine learning analyses, so a query across the entire sort key [unique] ID range for a given partition key [non-unique] country_code might allow me to avoid scanning the entire table all at once to kick that process off and later filtering by country_code.

I understand the concept of hot keys in Dynamodb: if there is a video game database and country_code is used as partition key while player_id is used as sort key, then when everyone from the same country is playing in the early evening in that country's timezone, there will be an outsized amount of reads and writes on the table in the area of that country_code while the other partitions are mostly untaxed, leading to latency which could have been avoided by using player_id as the partition key.

However, my use case is not a distributed application. I am creating a repository for the results of a data scraping script. The updates to the table will only ever take the form of a single delivery source (aws lambda) looping through json elements. Yes, the updates will be 'hot' in the sense that I'll be updating for a single country_code while all other countries' data is un-accessed, but because it will just be an iterative process from exactly one delivery script, multiple operations will never be occurring at once.

A reasonable question is, why do this at all for the increased thinking required; why not just use the actual unique ID as the partition key? The actual usage pattern for the data will be in infrequent machine learning analyses, so a query across the entire sort key [unique] ID range for a given partition key [non-unique] country_code might allow me to avoid scanning the entire table all at once to kick that process off and later filtering by country_code.

Share Improve this question asked Nov 16, 2024 at 16:45 slothish1slothish1 1592 silver badges15 bronze badges 1
  • I think you should remove the large question and ask directly what you want to know. – Leeroy Hannigan Commented Nov 16, 2024 at 17:20
Add a comment  | 

1 Answer 1

Reset to default 0

You can use a low cardinality key in DynamoDB so long as you don't exceed 1000 WCU or 3000 RCU per key per second. If you exceed either of those you will force the partition to become hot and result in throttling.

For your use-case, you're concerned about batch loading into DynamoDB, if you can rate limit your Lambda to not consume more than 1000 WCU per second per country_code then you'll have no issue.

本文标签: