Version 1
: Received: 29 October 2024 / Approved: 29 October 2024 / Online: 30 October 2024 (02:19:33 CET)
How to cite:
Li, Y.; Su, Y. SpecCA:A Parallel Crawling Approach based on Thread Level Speculation. Preprints2024, 2024102331. https://doi.org/10.20944/preprints202410.2331.v1
Li, Y.; Su, Y. SpecCA:A Parallel Crawling Approach based on Thread Level Speculation. Preprints 2024, 2024102331. https://doi.org/10.20944/preprints202410.2331.v1
Li, Y.; Su, Y. SpecCA:A Parallel Crawling Approach based on Thread Level Speculation. Preprints2024, 2024102331. https://doi.org/10.20944/preprints202410.2331.v1
APA Style
Li, Y., & Su, Y. (2024). SpecCA:A Parallel Crawling Approach based on Thread Level Speculation. Preprints. https://doi.org/10.20944/preprints202410.2331.v1
Chicago/Turabian Style
Li, Y. and Yaning Su. 2024 "SpecCA:A Parallel Crawling Approach based on Thread Level Speculation" Preprints. https://doi.org/10.20944/preprints202410.2331.v1
Abstract
The World Wide Web today is growing at a phenomenal rate. The crawling approach is of vital importance to leverage the efficiency of web crawling. The existing crawling algorithms on multicore platforms easily suffer from time consuming and can not support large data well. In order to exploit the potential parallelism and efficiency of crawling on Spark, based on the software thread level speculation, this paper proposes a Speculative parallel crawler approach (SpecCA) on Apache Spark. By analyzing the process of web crawler, the SpecCA firstly hires a function to divide the whole crawling process into several subprocesses which can be implemented independently and then spawns a number of threads to speculatively implement every subprocess in parallel. At last, the speculative results are merged to form the final outcome. Compared with the conventional parallel approach on multicore platform, SpecCA is very efficiency and leverages a high parallelism degree by adequately using the resources of the cluster. Experiments show that SpecCA could achieve a significant speedup improvement with compare to the traditional approach in average. Additionally, with the growing number of working nodes, the execution time decreases gradually while the speedup scales linearly. The results indicate that the efficiency of web crawling can be significantly enhanced by adopting this speculative parallel algorithm.
Keywords
crawling approach; parallel; Apache Spark
Subject
Computer Science and Mathematics, Computer Science
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.