Web Scraping hepsiburada Part-2: parallel & scalable analyzer with multiple windows
So, with some encouraging from a friend , after a rough-edge system requirements calculations, and some curiosity, I decided to add scaling capacity to the code base , using multiple browser windows. You can find the new modules , and the old ones here on GITHUB , and part-1 here.
I added 2 new modules , parallel-starter and parallel-analyzer .
parallel-starter , is taking the same old output of get-urls module as input , and splitting it to a number of input files , depending on the CONCURRENCY_LEVEL set. And later , using the threading module on python , starting new threads of parallel-analyzer , again according to the CONCURRENCY_LEVEL set. After starting a number of threads, it waits for all of them to finish , and then combine their outputs in a single CSV file. I was expecting the thread management to be a little tricky , but the threading module really does well , at least on my simple scenario.
So how is the resource consumption ? I can say that , for running 4 windows at once , hence finishing the analysis of the whole list 4 times quicker , you roughly need 4 times the Memory and CPU. Running multiple windows doesn’t help much with the Memory consumption of the browser processes. The whole memory print , between the python modules, chromedriver and chrome , seemed to tally up at 2.5GB , after running for 5 minutes.You can see the release of memory on free output, as megabytes.
Though , I stumbled upon something , where one of the browser processes is using significantly less memory than the other 3 , i don’t know if it’s just a misinterpretation of me or the htop , or is there some implicit explanation.
You can find the new modules , and the old ones here on GITHUB. I think my next move will be using multiple tabs in one browser window and I definitely expect savings on memory usage this time !