Version 1
: Received: 5 August 2024 / Approved: 6 August 2024 / Online: 6 August 2024 (15:56:32 CEST)
How to cite:
Ejaz, S.; Naseer, A.; Ahmad, A.; Tamoor, M.; Naz, S. A Novel Hashcode-based Duplication Reduction via Thresholding Approach for Large-scale Web Documents. Preprints2024, 2024080443. https://doi.org/10.20944/preprints202408.0443.v1
Ejaz, S.; Naseer, A.; Ahmad, A.; Tamoor, M.; Naz, S. A Novel Hashcode-based Duplication Reduction via Thresholding Approach for Large-scale Web Documents. Preprints 2024, 2024080443. https://doi.org/10.20944/preprints202408.0443.v1
Ejaz, S.; Naseer, A.; Ahmad, A.; Tamoor, M.; Naz, S. A Novel Hashcode-based Duplication Reduction via Thresholding Approach for Large-scale Web Documents. Preprints2024, 2024080443. https://doi.org/10.20944/preprints202408.0443.v1
APA Style
Ejaz, S., Naseer, A., Ahmad, A., Tamoor, M., & Naz, S. (2024). A Novel Hashcode-based Duplication Reduction via Thresholding Approach for Large-scale Web Documents. Preprints. https://doi.org/10.20944/preprints202408.0443.v1
Chicago/Turabian Style
Ejaz, S., maria Tamoor and Samina Naz. 2024 "A Novel Hashcode-based Duplication Reduction via Thresholding Approach for Large-scale Web Documents" Preprints. https://doi.org/10.20944/preprints202408.0443.v1
Abstract
Modern search engines encounter a significant challenge when it comes to handling duplicate and nearly identical web pages, particularly during the indexing process for vast amounts of web content. This issue can lead to slow search results and increased costs due to the accumulation of storage space necessary for storing indexes. To tackle this issue, different techniques have been proposed to find similar websites. However, it has long been a challenge in research to distinguish between web pages. In the current study, sentence-level features i.e., hashcode and thresholding are used to determine the nearly identical web pages. We employ an adaptive threshold that enables the application of our model in both large- and small-scale settings. The benchmark datasets consisting of Shakespeare’s collections, free text, job descriptions, and Reuters-21578 are used to test the proposed approach. With an accuracy of 0.99 and an F1-score of 0.97, the proposed technique outperforms existing methods.
Keywords
Duplicate detection; Hash keys; Information Retrieval; Threshold; Web documents; and Web pages
Subject
Computer Science and Mathematics, Computer Science
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.