3.2. Filtering Score
Statistical information regarding the graph is collected to calculate the filtering score used for determining the search order. The information collected as statistics includes the graph’s labels and degree. In general, path subgraph queries first search for vertices with the same label as vertices included in the query. In the graph, vertices with fewer labels matching the query can produce relatively fewer candidate result sets than vertices that do not. Therefore, it is necessary to first search for vertices that match the labels included in the query less. It is efficient to exclude the vertices with a smaller degree than the degree of the query because even if they are searched, they pertain to paths that are not likely to develop into the query. We calculate the probability of a particular degree occurring at a certain vertex and use it for the cost prediction of the path.
Suppose G and Q are the vertices of a graph and a query with the same label. If matching vertices are searched starting with every G, then as the number of labels L increases, the number of vertices where the search starts will increase linearly. In
Figure 2, more searches are required in the case of searching for the path starting with
where the number of the same labels is 1 in
Figure 2 (a), compared to the case of searching for the path starting with
and
where the number of the same labels is 2. Therefore, our proposed method collects the statistics for the number of labels at each vertex of G, and the vertices with a small number of labels are selected as the starting vertices.
When vertices matching the query are searched, vertices having smaller degrees than the degree of the query vertex do not need to be searched. For example, if vertices with the label A of the query are searched, as shown in
Figure 3,
and
will be searched. If the degree of vertex
is
,
and
are 4 and 1, respectively. Since the degree of the vertex with the label A of the query is 2,
with degree 4 has to be searched, but
with degree 1 has no possibility of becoming the query result. Vertices having degrees smaller than the degree of the query can be filtered out. Therefore, we calculate the filtering score for avoiding the vertices to be searched depending on the difference in the degree.
When real world data are modeled as a graph, the distribution of the vertices and their edges shows a particular tendency depending on which data are used. Simply, the normal distribution will be shown where the vertices of the graph and the edges connected to the vertices are evenly distributed. However, many graph applications show a power-law distribution where a small number of vertices has many edges [
23,
49,
50]. If the vertices of the graph and the edges connected to the vertices show a certain distribution, we herein calculate the probabilistic filtering scores that will be excluded from the search and determine the search order to decide which vertices of the query will be searched with higher priorities. The search order is determined based on the filtering score, and the filtering score is calculated in two cases. The first case divides a normal distribution where the vertices of the graph and their degrees are evenly distributed and a case of showing a power-law distribution where the degrees are concentrated on a certain vertex. Subsequently, the degree of the pertinent vertex is predicted through the probability density function of the corresponding distribution, and the probabilistic search order is determined based on the predicted value. The second case determines the search order of the vertices through the average degree of the vertices having the label corresponding to the query.
The proposed method defines a filtering score to determine the search order for query processing. The filtering score is the filtering probability calculated through the probability density function and the proportion of a certain label in the total labels obtained through the statistics collection stage. The filtering probability considers different probability density functions because it can show a different distribution depending on the graph characteristics. The first method calculates through the probability density function of a normal distribution, denoted by normalFS. The second method calculates through the probability density function of a power-law distribution, denoted by powerFS. The last method calculates using the average degree of a particular label without considering the probabilistic distribution of data, denoted by avgDgFS.
Assuming that the vertices of the graph and their degrees follow an evenly distributed normal distribution, Equation (1) shows the probability density function
where
is the degree of the vertices with label L as a variable for calculating the probability density function,
is the average degree of the vertices, and
is the standard deviation of the degree of the vertices.
Assuming that a vertex
with the same label
exists in the graph and query,
and
in the probability density function denote the probability that the corresponding degree will occur, respectively, and the area of
~
is the probability that can filter, without searching, the vertex
that has a smaller degree than the degree of
according to the aforementioned difference in the degree. Therefore, the area from the degree of G to that of
for the selection of the starting vertex is called
, as shown in Equation (2).
According to [
51], the relation between the vertices and their degrees in a graph of the real world follows a power-law distribution, and the probability that such vertices and degrees occur is generally
, with
. Furthermore, [
52] showed that a power-law distribution can be expressed as a probability distribution called a zeta distribution or Pareto distribution. If the vertices and edges in the power-law distribution have an exponential relationship of 2, then the variable s in the zeta function can be specified as 2. The zeta function converges to a specific value, and the probability density function of the Pareto distribution can be expressed as Equations (3) and (4). Therefore, the probability density function of the power-law distribution can be represented by Equation (5).
Let
be the vertices of the graph having the same label as the query. In the probability density function,
and
represent the probability that the corresponding degree will occur, and the area of
~
is the probability that can filter, without searching, the vertex
that has a smaller degree than
based on the aforementioned difference in the degree. Therefore, the area from the degree of
to that of
is called
, as shown in Equation (6), for the selection of the starting vertex.
When a query is entered as an input, the filtering score is calculated for all vertices of
based on the collected statistical information. For the filtering score, we consider the value calculated using the three methods explained in the previous section and the proportion of a certain label in all labels collected in the statistics collection stage. The filtering scores are classified into normalFS, powerFS, and avgDgFS according to the distribution characteristics, as shown in Equations (7)~(9), respectively.
Table 1 shows the description of the parameters in Equations (7)~(9).
3.3. Query partitioning
A vertex that has the highest filtering score among the vertices of the query is selected as the starting vertex, and the query is partitioned into triplets comprising a vertex having the next highest filtering score among the neighbor vertices connected to the starting vertex. The triplet consists of . Here, stores tail vertex information with the highest filtering score in the query, stores head vertex information with high filtering score among neighbor vertices of , and stores edge information. The initial start vertex is set to NULL by and . Query partitioning is performed for all neighbor vertices connected to the starting vertex. When the partitioning with the neighbor vertices is finished, the same operation is performed at the vertex with the next highest filtering score among the vertices connected to the neighbor vertices of the starting vertex.
Algorithm 1 shows the query partitioning. Once the starting vertex and the search order are determined, they are registered as a round in accordance with the determined search order. The round represents the search order of performing a parallel process at each slave node, and the next round is performed only after finishing the previous round. The starting vertex with the highest filtering score is registered as
. The vertex that has the next highest score among the neighbor vertices connected to the starting vertex is registered as the next round. When the round registration for all neighbor vertices connected to the starting vertex is finished, the same operation is performed for the neighbor vertices of the starting vertex. As a result, all vertices of the query are partitioned in the order from the highest to the lowest filtering score and the partitioned triplets are registered in
qRound.
Algorithm 1 Query partitioning |
|
Figure 4 shows the process of dividing a query into triplets. First, we construct triplets by finding the highest vertex in the query graph and performing a depth-first search (DFS) based on that vertex. Suppose query Q consists of five vertices and the number next to the vertex is the filtering score. Since the vertex with label A has the highest filtering score, we set the vertex with label A as the starting vertex and add the initial triplet
to qRound. When the starting vertex is selected, triplets are added to qRound while continuously performing DFS based on the filtering score. With a vertex with label A as the starting vertex, a neighbor vertex with a high filtering score is selected as
and added to qRound. While this process is repeated until DFS is finished, triplets for the query are registered in qRound.
3.4. Distributed Query Processing
The proposed method works as a master-slave architecture in a Spark environment. The master node calculates a filtering score through statistical information collection. The query is partitioned based on the filtering score and the query processing rounds are registered. The master node instructs each slave node to perform the query according to the search order. The slave nodes that have received the instruction perform the search in their own partitions and record the results in the result table. Each slave node generates intermediate results according to the search order registered in the qRound and combines the intermediate results generated from each slave through join operations to deliver the final results to the master node.
Algorithm 2 represents a distributed query processing process performed in a slave. Triplets are stored in the qRound according to the order in which query processing is performed. Each slave searches the graph according to the triplet order registered in the qRound, generating intermediate results with matching labels. When the first round
is performed for the subgraph query, each slave node searches all vertices with the label selected as the starting vertex. After that, in the second round
, the neighbor vertex with the label of
connected to the vertex with the label of the starting vertex is searched. The next round is searched through the ID of the vertex recorded in the search result. When
is performed according to the query processing rounds, each slave node searches for all vertices with the label selected as the starting vertex. Subsequently,
is performed, and the neighbor vertices with the label
connected to the vertex labeled as the starting vertex are searched and recorded. In the next round, the search is performed based on the ID of the vertices recorded in the search results. Based on this, the search is only performed for the neighbors of the vertices obtained in the previous round, and the processing cost can be efficiently reduced because there is no need to search again from the vertices that were not needed in the previous search. Each slave stores partitions divided by a vertex-cut method, and the vertex that becomes the basis for division is replicated to each slave. When we encounter replication vertices while searching a graph using triplets according to the qRound order, we set a join tag because query processing is no longer possible on the current slave node. Candidate results in slave nodes are joined based on the vertex to which the join tag is assigned.
Algorithm 2 Distributed Query Processing |
|
Figure 5 shows the intermediate search results by round. As a result of performing
,
connected to
can be searched, but the graph has no vertex with a label A connected to
-
in
. Therefore, (
,
) is excluded in the result of
. As (
,
) has been removed in
, there is no need to search for the vertices with a label D connected to
. If a vertex due to the graph partitioning is encountered in the process of searching according to the round in each partition, there may be more vertices connected to the pertinent vertex in the replicated partition, but this information cannot be known in the current partition. Therefore, a join tag is marked on the corresponding vertices of all partitions that have replicated the vertex, and the next search is commenced. The join tag is used when sending/receiving the search result between the slave nodes after the local search. The join tag is a mark used to store only intermediate results that require joins to avoid storing unnecessary intermediate results. In
, a replicated
is obtained, and a join tag is marked. As the vertex with a join tag indicates that there is a path in another partition, the search begins again at the vertex with the join tag in
, which is the next round. As such, owing to the method of performing the next round only for the results of the previous round, there is the advantage that the size of the table recording the results is gradually reduced, and unnecessary searching is not performed.
Figure 6 shows the results of such search result joins. When the partitioned query searches are finished, an intermediate search result table is created in each partition, and the intermediate result tables are sent/received using communications between the partitions for the join. Because the search is performed asynchronously in each partition, the completion time of each search is different. The partitions that have finished searching already and are in the waiting state send their intermediate result tables if there is another partition that has finished the search. As the cost required for the communication increases as the size of the table increases, the communication cost can be reduced if a table of a small size is transmitted. The partition that has received the intermediate result performs the join with its own intermediate result table.