admin管理员组文章数量:1025552
I’m currently working on optimizing a PySpark job that involves a couple of aggregations across large datasets. I’m fairly new to processing large-scale data and am encountering issues with disk usage and job efficiency. Here are the details:
Chosen cluster:
• Worker Nodes: 6
• Cores per Worker: 48
• Memory per Worker: 384 GB
Data:
• Table A: 158 GB
• Table B: 300 GB
• Table C: 32 MB
Process:
1. Read dfs from delta tables
2. Perform a broadcast join between Table B and the small Table C.
3. The resulting DataFrame is then joined with Table A, on three different column id, family, part_id
4. The final job includes upsert operations into the destination.
5. The destination table is partitioned by id, family, *date*
Only thing comes to my mind is to update cluster with more disk optimized instances, My question is how can I interpret storage tab and find a way to understand optimizing this job.
I’m currently working on optimizing a PySpark job that involves a couple of aggregations across large datasets. I’m fairly new to processing large-scale data and am encountering issues with disk usage and job efficiency. Here are the details:
Chosen cluster:
• Worker Nodes: 6
• Cores per Worker: 48
• Memory per Worker: 384 GB
Data:
• Table A: 158 GB
• Table B: 300 GB
• Table C: 32 MB
Process:
1. Read dfs from delta tables
2. Perform a broadcast join between Table B and the small Table C.
3. The resulting DataFrame is then joined with Table A, on three different column id, family, part_id
4. The final job includes upsert operations into the destination.
5. The destination table is partitioned by id, family, *date*
Only thing comes to my mind is to update cluster with more disk optimized instances, My question is how can I interpret storage tab and find a way to understand optimizing this job.
本文标签: apache sparkOptimizing PySpark Job with Large Parquet Data and High Disk UsageStack Overflow
版权声明:本文标题:apache spark - Optimizing PySpark Job with Large Parquet Data and High Disk Usage - Stack Overflow 内容由热心网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://it.en369.cn/questions/1745611991a2159055.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论