Preprint Article Version 1 This version is not peer-reviewed

A Novel Hashcode-based Duplication Reduction via Thresholding Approach for Large-scale Web Documents

Version 1 : Received: 5 August 2024 / Approved: 6 August 2024 / Online: 6 August 2024 (15:56:32 CEST)

How to cite: Ejaz, S.; Naseer, A.; Ahmad, A.; Tamoor, M.; Naz, S. A Novel Hashcode-based Duplication Reduction via Thresholding Approach for Large-scale Web Documents. Preprints 2024, 2024080443. https://doi.org/10.20944/preprints202408.0443.v1 Ejaz, S.; Naseer, A.; Ahmad, A.; Tamoor, M.; Naz, S. A Novel Hashcode-based Duplication Reduction via Thresholding Approach for Large-scale Web Documents. Preprints 2024, 2024080443. https://doi.org/10.20944/preprints202408.0443.v1

Abstract

Modern search engines encounter a significant challenge when it comes to handling duplicate and nearly identical web pages, particularly during the indexing process for vast amounts of web content. This issue can lead to slow search results and increased costs due to the accumulation of storage space necessary for storing indexes. To tackle this issue, different techniques have been proposed to find similar websites. However, it has long been a challenge in research to distinguish between web pages. In the current study, sentence-level features i.e., hashcode and thresholding are used to determine the nearly identical web pages. We employ an adaptive threshold that enables the application of our model in both large- and small-scale settings. The benchmark datasets consisting of Shakespeare’s collections, free text, job descriptions, and Reuters-21578 are used to test the proposed approach. With an accuracy of 0.99 and an F1-score of 0.97, the proposed technique outperforms existing methods.

Keywords

Duplicate detection; Hash keys; Information Retrieval; Threshold; Web documents; and Web pages

Subject

Computer Science and Mathematics, Computer Science

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.