Preprint Article Version 1 This version is not peer-reviewed

SpecCA:A Parallel Crawling Approach based on Thread Level Speculation

Version 1 : Received: 29 October 2024 / Approved: 29 October 2024 / Online: 30 October 2024 (02:19:33 CET)

How to cite: Li, Y.; Su, Y. SpecCA:A Parallel Crawling Approach based on Thread Level Speculation. Preprints 2024, 2024102331. https://doi.org/10.20944/preprints202410.2331.v1 Li, Y.; Su, Y. SpecCA:A Parallel Crawling Approach based on Thread Level Speculation. Preprints 2024, 2024102331. https://doi.org/10.20944/preprints202410.2331.v1

Abstract

The World Wide Web today is growing at a phenomenal rate. The crawling approach is of vital importance to leverage the efficiency of web crawling. The existing crawling algorithms on multicore platforms easily suffer from time consuming and can not support large data well. In order to exploit the potential parallelism and efficiency of crawling on Spark, based on the software thread level speculation, this paper proposes a Speculative parallel crawler approach (SpecCA) on Apache Spark. By analyzing the process of web crawler, the SpecCA firstly hires a function to divide the whole crawling process into several subprocesses which can be implemented independently and then spawns a number of threads to speculatively implement every subprocess in parallel. At last, the speculative results are merged to form the final outcome. Compared with the conventional parallel approach on multicore platform, SpecCA is very efficiency and leverages a high parallelism degree by adequately using the resources of the cluster. Experiments show that SpecCA could achieve a significant speedup improvement with compare to the traditional approach in average. Additionally, with the growing number of working nodes, the execution time decreases gradually while the speedup scales linearly. The results indicate that the efficiency of web crawling can be significantly enhanced by adopting this speculative parallel algorithm.

Keywords

crawling approach; parallel; Apache Spark

Subject

Computer Science and Mathematics, Computer Science

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.