multithreading - How to download, convert, and process files in three queues at the same time using Python

admin管理员组
文章数量:1026989

I am trying to speed up a process I have been doing for a long time. I currently download all the files. Then I convert them all to CSV. Then I use bokeh to create an interactive chart for looking at the data. I would like to as soon as the first file downloads, start converting it to CSV while still downloading files. Then after the CSV is created immediately start creating the interactive chart while still downloading files and creating new CSVs.

Is this possible in python?

The data files range from 100-500mb each and there are generally about 40 to process daily. This process generally takes about 15min to complete and if this new process could cut it by third that would help greatly.

I not sure if multiprocessing/mutli-threading or async/await would help.

Is this possible in python?

I not sure if multiprocessing/mutli-threading or async/await would help.

Share Improve this question asked Mar 8 at 16:10 Brent Hodges 31 bronze badge

maybe first try to use multiprocessing/mutli-threading or async/await and you will see if it can help in your situation. We have no access to your code so we can't check it. – furas Commented Mar 8 at 17:56

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

You need to use Pipelines (this blog helps you to understand whats pipeline).
First of all use some library like queue, where you need to set up separate queues for each processing stage. That allows files to flow through the pipeline as soon as they're ready.

Then you can use from concurrent.futures import ThreadPoolExecutor, this is ideal for I/O tasks cause it allows the program to continue processing while waiting for downloads or what are your program doing...
Also, in each stage you should have some worker threads that pull from input queue and push to the next stage queue.

So if you make everything right, soon as a file is downloaded, it's immediately available for conversion while other downloads continue. Same with converted files, they're immediately sent for chart creation.

It should looks like this:

class ProcessingPipeline:
    def __init__(self, urls, download_dir, csv_dir, chart_dir, max_workers):
        self.urls = urls
        self.download_dir = download_dir
        self.csv_dir = csv_dir
        self.chart_dir = chart_dir
        self.max_workers = max_workers
    
        self.download_queue = queue.Queue()
        self.convert_queue = queue.Queue()
        self.chart_queue = queue.Queue()

     # Below should be your functions that processing the data
     ... ... ...

     def run(self):
        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
            download_futures = [executor.submit(self.download_worker) 
                              for _ in range(min(self.max_workers, len(self.urls)))]
            convert_futures = [executor.submit(self.convert_worker) 
                             for _ in range(self.max_workers)]
            chart_futures = [executor.submit(self.chart_worker) 
                           for _ in range(self.max_workers)]

# waiting for prociding all queues 
        self.download_queue.join()
        self.convert_queue.join()
        self.chart_queue.join()
    
if __name__ == "__main__":
    res = ProcessingPipeline(
      urls="some_urls",
      download_dir="your_data",
      csv_dir="your_files",
      chart_dir="your_charts",
      max_workers=3 
    )
    res.run()

Hoping, this helps

Is this possible in python?

I not sure if multiprocessing/mutli-threading or async/await would help.

Is this possible in python?

I not sure if multiprocessing/mutli-threading or async/await would help.

Share Improve this question asked Mar 8 at 16:10 Brent Hodges 31 bronze badge

maybe first try to use multiprocessing/mutli-threading or async/await and you will see if it can help in your situation. We have no access to your code so we can't check it. – furas Commented Mar 8 at 17:56

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

It should looks like this:

class ProcessingPipeline:
    def __init__(self, urls, download_dir, csv_dir, chart_dir, max_workers):
        self.urls = urls
        self.download_dir = download_dir
        self.csv_dir = csv_dir
        self.chart_dir = chart_dir
        self.max_workers = max_workers
    
        self.download_queue = queue.Queue()
        self.convert_queue = queue.Queue()
        self.chart_queue = queue.Queue()

     # Below should be your functions that processing the data
     ... ... ...

     def run(self):
        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
            download_futures = [executor.submit(self.download_worker) 
                              for _ in range(min(self.max_workers, len(self.urls)))]
            convert_futures = [executor.submit(self.convert_worker) 
                             for _ in range(self.max_workers)]
            chart_futures = [executor.submit(self.chart_worker) 
                           for _ in range(self.max_workers)]

# waiting for prociding all queues 
        self.download_queue.join()
        self.convert_queue.join()
        self.chart_queue.join()
    
if __name__ == "__main__":
    res = ProcessingPipeline(
      urls="some_urls",
      download_dir="your_data",
      csv_dir="your_files",
      chart_dir="your_charts",
      max_workers=3 
    )
    res.run()

Hoping, this helps

本文标签：

版权声明：本文标题：multithreading - How to download, convert, and process files in three queues at the same time using Python - Stack Overflow 内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://it.en369.cn/questions/1744892002a2122642.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

369IT编程

multithreading - How to download, convert, and process files in three queues at the same time using Python - Stack Overflow

1 Answer 1

1 Answer 1

更多相关文章

javascript - Next.js 13 Error: Byte Index Out of Bounds on &#39;npm run dev&#39; - Stack Overflow

permalinks - Custom optional parameter in page URL

for loop - Output each checksum with its corresponding filename to seperate lines in a text file - Stack Overflow

python - Exclude inherited class methods in sphinx autosummary - Stack Overflow

python - How do I include the Kaggle api key into a google cloud function so that the Kaggle api can find it? - Stack Overflow

javascript - Fabricjs detect mouse over object path - Stack Overflow

Eigen c++ - SelfAdjointEigenSolver - eigenvectors are not eigenvectors - Stack Overflow

javascript - iframe auto adjusting its height to fit to the content height - Stack Overflow

javascript - JQuery bind click event on div - Stack Overflow

javascript - What is the canonical way to manually fire page_view in Google Tag Manager and Google Analytics 4 (GA4)? - Stack Ov

javascript - How to open modal when a button is clicked - Stack Overflow

javascript how to move element on route on scroll - Stack Overflow

Connecting to Oracle database with JavaScript - Stack Overflow

c# - Correctly save email field through registration form in ASP.NET Core MVC app - Stack Overflow

rvest - Using R to gather data from dynamic webpage - Stack Overflow

javascript - &quot;Attempting to use a disconnected port object&quot; with Long-lived connections in chrome extension -

javascript - How to access content text in HTML Custom Element? - Stack Overflow

plugin: rewrite rules are lost when WP updates

javascript - Jquery Change background color of DIV to #333 only when all checkbox is not checked - Stack Overflow

javascript - VS 2015 Angular 2 import modules cannot be resolved - Stack Overflow

发表评论

推荐文章

javascript - Use JS to add browser version to &lt;html&gt; or &lt;body&gt; as class - Stack Overflow

javascript - Adding the sum of variable to itself - Stack Overflow

Woocommerce - Default product image by user role

javascript - Is it possible to load and parse file on remote server? - Stack Overflow

javascript - tabs.executeScript: Cannot access a chrome: URL - Stack Overflow

热门文章

javascript - How to get only the year from unix time stamp? - Stack Overflow

C# AWS Lambda Annotation Functions never passes through middleware - Stack Overflow

javascript - How can I have a Tampermonkey userscript play a sound when a specific word appears on a page? - Stack Overflow

javascript - jQuery keyup only keys that affects textarea content - Stack Overflow

javascript - how to print this data in table format using ajax - Stack Overflow

How to Implement Controlled Parallelism with Azure Powershell Durable Functions to Work with Exchange Module Session Limits? - S

node.js - Fastify Autoload with Typescript using Vitest - Stack Overflow

javascript - Strapi: How to upload image and link it to Model? - Stack Overflow

javascript - Angularjs Make drop down disabled when pressing checkbox inside ng-repeat - Stack Overflow

javascript - Asp.Net Mvc Render Partial View With Knockout - Stack Overflow

最新文章

windows设置断电重启开机后自动输入锁屏密码登录

Windows系统设置开机默认开启数字小键盘

Windows11 开机自动同步时间（开机时间不更新问题）

windows配置开机自启动软件或脚本

【Redis】Windows设置Redis为开机自启动

程序员刚毕业，先去大厂镀金还是先去小厂攒经验？

万象2008清空boss账户密码

【Tools】GitBook简明教程

oracle exadata celldisk 闪存盘受损导致性能下降

SDUT 2138 图结构练习——BFSDFS——判断可达性

javascript - Type &#39;undefined&#39; is not assignable to type &#39;menuItemProps[]&#39; - Stack Overflow

javascript - VS 2015 Angular 2 import modules cannot be resolved - Stack Overflow

javascript - Get the JSON objects that are not present in another array - Stack Overflow

javascript - How to dismiss a phonegap notification programmatically - Stack Overflow

c - Solaris 10 make Error code 1 Fatal Error when trying to build python 2.7.16 - Stack Overflow

javascript - Next.js 13 Error: Byte Index Out of Bounds on 'npm run dev' - Stack Overflow

javascript - "Attempting to use a disconnected port object" with Long-lived connections in chrome extension -

javascript - Use JS to add browser version to <html> or <body> as class - Stack Overflow

javascript - Type 'undefined' is not assignable to type 'menuItemProps[]' - Stack Overflow