Submitted:
09 October 2024
Posted:
10 October 2024
You are already at the latest version
Abstract
Keywords:
I. Introduction
- (1)
- Initialize a queue with the starting page.
- (2)
- Set a maximum depth (if applicable) and initialize the current depth to zero.
- (3)
-
While the queue is not empty and the maximum depth is not exceeded:
- (a)
- Dequeue the next page from the queue.
- (b)
-
If the page has not been visited:
- (i)
- Navigate to the page.
- (ii)
- Extract the desired data and store it as a node.
- (iii)
- Extract all hyperlinks from the page.
- (iv)
- Add all unseen and unvisited hyperlinks to the queue.
- (v)
- Mark the current page as visited.
- (c)
- Increment the depth if moving to a new level.
- (4)
- Stop when all pages are visited or the maximum depth is reached.
II. Background
III. Methodology
A. State
- -
- Screenshot: A visual capture of the web application’s interface at a particular moment, serving as a reference for the graphical presentation as perceived by the user.
- -
- Page Source: The HTML and Document Object Model (DOM) structure that constitutes the web page. This includes critical elements such as forms, buttons, and in- teractive components that define the layout and available functionalities.
- -
- Metadata: Ancillary data related to the current web session, including HTTP headers, cookies, and session- specific variables. This metadata provides additional con- text regarding the state of the application, reflecting conditions like user authentication, session persistence, or dynamic content adjustments.
B. Action
- -
- Action Type: Actions can vary widely, from simple navigation (e.g., following a hyperlink) to complex in- teractions (e.g., filling out and submitting a form). These actions are categorized into types based on the nature of the interaction, such as clicks, form submissions, key- board inputs, or dynamic event triggers (e.g., JavaScript events).
- -
- Action Context: Each action is tied to a specific element in the DOM structure, such as a button, link, or form field. The context includes metadata such as the element’s attributes (e.g., ID, class) and its location within the page hierarchy. This context helps the system understand how the action relates to the structure of the web application.
- -
- Effect on State: Actions are only significant if they result in a state change, meaning they transition the web application from one distinct state to another. The Functionality Inferring Module analyzes the effect of each action on the state, ensuring that only meaningful transi- tions are captured. For example, submitting a form might transition the user from a login page to a dashboard, whereas clicking a non-interactive element would not result in a state change.
- -
- Action Priority: Not all actions contribute equally to the exploration of the application’s functionality. The system prioritizes actions that lead to new or unexplored states. Actions that produce trivial or redundant transitions (e.g., right-clicking or hovering over an element without caus- ing a meaningful change) are deprioritized by the Re- ranking Module, ensuring that the exploration process is efficient.

C. Functionality Inferring Module
- (1)
-
Reasoning Agent: This agent processes the current ob- servation—comprising the page source, screenshot, and meta- data—alongside the record of previously explored function-alities. It synthesizes multiple queries to interface with the database, determining what functionalities have already been explored and what actions are possible given the current state and past interactions. The Reasoning Agent outputs a list of possible actions that can be performed based on the current and previous states, ensuring a thorough exploration of the application’s functionalities.
- (a)
- Multi-modal LLM for State Understanding: The Rea- soning Agent employs a multi-modal LLM to comprehensively understand the current state of the web application. By ana- lyzing various inputs—including the page source, screenshots, and session metadata—the LLM can generate a semantic and structural understanding of the application’s current state.
- (b)
- Database Interface for Explored Functionalities: In addition to understanding the current state, the Reason- ing Agent interfaces with a database of previously explored functionalities. This ensures that the agent avoids redundant actions and focuses on unexplored areas of the application. The database stores all previously visited states and actions taken, forming a history of interactions with the web application. By querying this database, the agent can identify which actions have already been executed and which states have already been visited, allowing it to prioritize new interactions.
- (c)
- Generation of Possible Actions: Based on the current state and the record of explored functionalities, the ReasoningFigure 2. Functionality Inferring Module.

- (2)
- Re-ranking Module: Once the Reasoning Agent gener- ates a list of possible actions, the Re-ranking Module evaluates these actions and reorders them based on metrics such as entropy and expected reward. The objective is to prioritize actions that are most likely to uncover new functionalities or lead to significant state transitions, while deprioritizing trivial or redundant interactions (e.g., non-functional actions like right-clicking on an element). This dynamic re-ranking process ensures that the exploration of the application remains focused on discovering meaningful user flows and interactions, ultimately maximizing the system’s efficiency and effective- ness in navigating complex web applications.
- (3)
- Next Actions Prediction Agent: The Next Actions Pre- diction Agent uses a finetuned multi-modal LLM that refines the list of actions. It selects the top-ranked actions from the Possible Actions List and predicts the next best steps to take. This agent combines insights from both the re-ranked list and the system’s understanding of the web application to choose actions that will maximize the system’s overall reward.
D. Action Executor
- (1)
- Action Execution: The Executor applies selected actions, which may involve single interactions (e.g., clicking a button) or multi-step sequences (e.g., form submissions). It handles various action types, including user interface actions, naviga- tional transitions, and event-driven triggers like JavaScript.
- (2)
- State Validation: After executing actions, the Executor verifies whether the action resulted in a meaningful state change by capturing the updated page source, screenshots, and metadata. This validation is critical for updating the knowledge graph.
- (3)
- Error Handling and Recovery: In the case of failed
- (4)
- actions due to issues like incorrect inputs or unhandled edge cases, the Executor logs the error and retries or performs recovery actions to restore the application to a stable state.
E. Reward/Penalty Model
- (1)
- Rewarding Significant Progress: Actions that lead to new state transitions or the discovery of unexplored function- alities receive positive rewards (closer to +1). For example, navigating from the home page to product listings or accessing the checkout process are high-reward actions as they reveal critical application behaviors.
- (2)
- Penalizing Redundant Actions: When actions result in trivial transitions (e.g., reaching a leaf node where no further meaningful actions can be taken), the model assigns penalties (closer to -1). This prevents the system from getting stuck in unproductive states, such as a ”Thank you” page after purchase completion.
- (3)
- Stopping Exploration: If no actions produce a positive reward, the system halts exploration for that path. This ensures resources are not wasted on dead ends and exploration focuses on uncovering valuable transitions.
- (4)
- Retrials: In cases where the action taken results in a reward score close to the defined threshold but not sufficiently positive, this model will initiate a retrial if the maximum number of retries has not been exhausted.
IV. Experiments
A. Traditional Parser Setup
- -
- Max crawl depth: 3 levels
- -
- Follow redirects: Enabled
- -
- Concurrent requests: 8
- -
- User-agent rotation: Implemented to mimic various browsers
B. Proposed Solution Setup
- -
- minreward: 0 — The minimum threshold for reward- based transitions, used to eliminate low-value or redun- dant actions.
- -
- maxleafbranches: 999 — The maximum number of branches a leaf node can have before being pruned.
- -
- maxconsecutiveactions: 5 — The maximum number of consecutive actions allowed within a single state before forcing a transition.
- -
- maxretries: 3 — The maximum number of attempts for each action before considering it a failed interaction.
C. Evaluation Metrics
- (1)
- State coverage: The number of unique states visited by each method. For traditional parsers, a unique state is typically identified by a unique URL within the domain. Higher state coverage is better because it reflects a more comprehensive exploration of the web application, including all key functionalities and dynamic states.
- (2)
- Edge complexity: The total number of edges (interac- tions) captured between states. Ideally, this value is close to n − 1, where n is the total number of states. This indicates minimal distractions between flows, with no unrelated or re- dundant transitions, suggesting that each transition contributes meaningfully to the functionality being explored.
- (3)
- Failure recovery: The ratio of actions that failed on the first attempt but succeeded within the maxretries allowed by the system. A higher value indicates better robustness, as the system is able to recover from failures and explore alternative paths or retry actions successfully.
- (4)
- Time to completion: The total time taken by each method to complete the crawl. In this metric, lower values are better, as faster completion means more efficient exploration.
- (5)
- Graph density: This metric measures the ratio of actual edges to the total possible edges in the graph. Lower density implies that the graph is not overly crowded with meaningless connections, which would indicate more structure and clarity in the transitions between states.
- (6)
- Shortest path length: The average shortest path length between any two nodes in the graph, which measures the overall connectivity. A longer shortest path may indicate more unique states and deeper exploration of the application. For traditional parsers, the path length may be shorter due to fewer unique states being captured, while in our approach, it is likely longer because of the broader state coverage and complex interactions.
- (7)
- Betweenness centrality: This metric measures the im- portance of nodes in connecting different parts of the graph. A higher value suggests that certain states (nodes) serve as crucial junctions in the web application’s flows. This can be useful in identifying critical pages, such as login screens or checkout processes, that play a significant role in navigating through the application. A higher betweenness centrality is often desirable for identifying key interaction points in the user journey.
V. Results
A. Key Metrics
B. Procedurally Generated Test Cases by Graph Traversal
References
- ”World Wide Wanderer,” 9780199571444.
- Octoparse, “What is Web Scraping and How Does It Work,” 2024.
- Octoparse Development Team, “Top 7 Web Scraping Tools for Data Extraction,” 2024.
- DEV Development Team, “Advantages and Disadvantages of Using Selenium,” 2024.
- Octoparse Development Team, “Octoparse Alternatives: 25+ Web Scrap- ing Tools & Similar Apps,” 2024.
- G. Pellegrino, C. Tschu¨rtz, and C. Rossow, “jA¨ k: Using Dynamic Analysis to Crawl and Test Modern Web Applications,” 2015. [CrossRef]
- A. Mesbah and A. van Deursen, “Crawljax: Crawling AJAX-Based Web Applications through Dynamic Analysis,” 2012.
- Squirrel Project Team, “Squirrel: A Knowledge Graph-based Crawler for the Semantic Web,” 2017.
- A. Heydon and M. Najork, “Mercator: A Scalable, Extensible Web Crawler,” 1999. [CrossRef]
- M. Gray, “World Wide Web Wanderer: The First Web Robot,” 1993.
- F. Yin, X. He, and Z. Liu, “Research on Scrapy-Based Distributed Crawler System for Crawling Semi-structure Information at High Speed,” 2018. [CrossRef]
- J. Alarte, J. Silva, and S. Tamarit, “What Web Template Extractor Should I Use? A Benchmarking and Comparison for Five Template Extractors,” *ACM Trans. Web*, vol. 13, no. 2, pp. 1–19, 2019. [CrossRef]
- A. Barbaresi, “Efficient construction of metadata-enhanced web cor- pora,” in *Proc. 10th Web as Corpus Workshop*, Assoc. Comput. Linguistics, 2016, pp. 7–16. [CrossRef]
- A. Barbaresi and G. Lejeune, “Out-of-the-Box and Into the Ditch? Multilingual Evaluation of Generic Text Extraction Tools,” in *Proc. 12th Web as Corpus Workshop*, Assoc. Comput. Linguistics, 2020, pp. 5–13.
- M. Baroni, F. Chantree, A. Kilgarriff, and S. Sharoff, “Cleaneval: A Competition for Cleaning Web Pages,” in *Proc. 6th Conf. Lang. Resour. Eval.*, ELRA, 2008, pp. 638–643.
- V. Crescenzi, G. Mecca, and P. Merialdo, “Roadrunner: Towards Auto- matic Data Extraction From Large Web Sites,” in *Proc. 27th VLDB Conf.*, 2001, pp. 109–118.
- T. Weninger, W. H. Hsu, and J. Han, “CETR: Content Extraction via Tag Ratios,” in *Proc. 19th Int. Conf. World Wide Web*, 2010, pp. 971–980.
| Metric | Traditional Parser | Proposed Solution |
|---|---|---|
| State complexity (no. of states) | 24 | 95 |
| Edge complexity (no. of edges) | 86 | 94 |
| Failure recovery rate | N/A | 0.72 |
| Time to completion (seconds) | 300 | 5500 |
| Graph density | 0.72 | 0.15 |
| Shortest path length | 2.1 | 6.4 |
| Betweenness centrality (avg) | 0.59 | 0.02 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
