In this chapter, an overview of the recent research literature related to the graph attention model is presented; in addition, another critical GNN architecture that inspired this work, the Directional Graph Network (DGN), is also discussed. First of all,
Section 3.1 of this chapter gives an overview of the "ancestor" of the Graph Attention Network (GAT), namely the Graph Convolutional Networks (GCNs). The general concept of the attention mechanism, which originated in the language models, is then briefly discussed in
Section 3.2. Next,
Section 3.3 gives a thorough introduction to the GAT model, which is one of the essential works that this thesis is based. Lastly, another important work [
7] that introduces the notion of direction in graph neural network and has inspired this thesis is presented in detail in
Section 3.4.
3.1. Graph Convolutional Networks
As briefly discussed in
Section 2.5, inspired by the success of Convolutional Neural Networks, people attempted to generalize the convolution operation to graph-structured data. This genre of graph neural networks based on graph convolution operations is often referred to as Convolutional Graph Neural Networks (ConvGNNs); or simply as Graph Convolution Networks (GCNs). Akin to convolutional layers used in CNNs, a graph convolutional layer in GCNs generates higher-level representations of each graph node
u by leveraging the information of its neighbourhood. However, as illustrated in
Figure 7, unlike grid-like image data, which has a fixed and regular neighbouring structure, in a graph, a node’s neighbours are unordered and vary in size, which makes the generalization of the graph’s convolution operation much more challenging.
The first GCN proposed by Bruna et al. [
29] is based on the spectral theorem, which uses graph Fourier transformation to transform the graph signal into its spectral domain, and then perform the graph convolution in that domain. Graph convolutions defined by the spectral method are closely related to filters in the context of graph signal processing, which can be interpreted as operations removing noise from the graph signals (or graph features in the context of graph representation learning) [
29]. However, this method is computationally intensive since it requires calculating the graph Fourier transformation as well as the inverse graph Fourier transformation. Furthermore, the graph convolution relies on the eigen-decomposition of the Laplacian matrix, which implies that learnt convolution operations are domain-dependent and are not easily transferable to graphs with different topological properties [
35].
ChebNet is then proposed to address those drawbacks [
36], it uses Chebyshev polynomials to approximate the spectral graph convolution operation up to
order. ChebNet implicitly avoids the graph Fourier transformation computation, thus reducing the computation complexity by a large margin.
Kipf and Welling further simplified the convolution operation in their work [
5] by explicitly setting
. Many later pieces of literature referred their model as GCN; thus, the same parlance will be adopted in the rest of this thesis.
GCN can be understood from the perspective of message passing, which has been introduced earlier in
Section 2.5. Given a graph
, and let
be the feature matrix of
G, whose the first column is
1 that reserved for bias terms in the weight matrix and whose
row spanning from the 2
nd column to the
column is the transpose of the feature vector of node
. To obtain a high level representation
of node
at the
GCN layer, an intermediate representation
is generated first by aggregating the representation of
’s neighbouring nodes and its own representation
generated by the previous layer. This intermediate representation
is then transformed into the output representation
with a one hidden layer MLP.
The above process of node representation generation can be expressed in the matrix form as the following:
where
is the node representation matrix at the
GCN layer (an extra dimension for the bias term), such that its
row is the transpose of the node representation vector
of node
, the matrix
is the adjacency matrix with self-loops added in,
is a learnable weight matrix (note that here we shall use
instead of
if we consider notation consistency with Chapter 2. But for simplicity we use the latter.) and
is an element-wise non-linear activation function. As usual, we take the initial
.
One issue with the above expression is that
is not normalized. This may introduce numerical instability and cause the exploding/vanishing gradient problem (e.g. nodes with many neighbours will get tremendous values in their representations) when multiple GCN layers are stacked together [
5]. The most trivial solution to fix the issue is to modify (3.1) as follows:
where
, and
is the degree matrix of the graph
G. Adding the term
into (3.1) is equivalent as applying the mean operation, which normalizes each row of
based on the degree of nodes involved in. The node-wise message passing rule of (3.2) can be expressed as:
where
is the
row of of the representation matrix
.
In the GCN model proposed by Kipf and Wellington [
5], a more sophisticated normalization method, namely the symmetric normalization, is adopted:
and the equivalent node-wise expression can be written as:
The utilization of symmetric normalization allows the definition of a more refined aggregation process, as this no longer amounts to the simple averaging of neighbouring nodes. However, both the normalized and symmetric normalized GCN models require prior knowledge of the entire graph structure to perform message passing, which makes them less robust and inapplicable to inductive-learning tasks (see
Section 4.4 for more deatils). GraphSAGE [
3] addressed this issue by introducing a general inductive framework, which samples a fixed-size set of neighbours of each node and only uses them to aggregate information in the message-passing step. The node representation generated by GraphSAGE can be expressed as:
where
is a random sample of the neighbours of node
u, and
is the
operation at
layer.
Figure 8 shows an example of nodes sampled (red nodes) for the centering yellow node in its neighbourhood.
3.2. Attention Mechanism
The idea of attention was first proposed in the field of psychology, which is used to explain the cognitive process of selectively processing certain information in an environment while ignoring others [
37]. Bahdanau et al. [
38] were the first researchers who utilized the attention mechanism in machine learning; more specifically, they applied it to the machine translation task in natural language processing [
39]. After their enormous success, people began to adopt and integrate the attention mechanism into different sub-domains of machine learning, such as computer vision [
40] and recommendation systems [
41]. Nowadays, the attention mechanism has became a prevailing concept in machine learning and it is also an essential component of many neural network architectures.
Before the attention mechanism was used for language modelling, the predominant architecture for natural language processing tasks was the sequence-to-sequence model (or the seq2seq model in short) [
42]. The seq2seq model consists of two main components: an encoder and a decoder, which both are recurrent neural networks introduced in
Section 2.4. With this specialized architecture, a seq2seq model can transform an input with an arbitrary length into an output with an arbitrary length. In particular, the encoder compresses the input sequence of tokens/words into a single fixed-length context vector, which then serves as the decoder’s input and is used to generate the output sequence.
Figure 9 shows a high-level overview of a seq2seq model that translates the English sentence "They are watching" into the French sentence "Ils regardent".
Regardless of its prevalence, the traditional seq2seq model faced two major challenges. [
39]. The first challenge is dealing with long input sequences; since the context vector generated by the encoder is fixed-length regardless of the input sequence size, this may lead to information loss when the input sequence is long [
43]. The second challenge is to model alignment between the input and the output sequences, which is essential for certain tasks, such as machine translation and summarization. Intuitively speaking, in sequence-to-sequence tasks, each output token should be more influenced by certain parts of the input sequence than the rest. As an example, for the English-French translation showed in
Figure 9, "Ils" in the output sequence should be associated with "They" in the input sequence, while "regardent" should be related to "are watching". Unfortunately, the encoder-decoder architecture utilized in the seq2seq model lacks any mechanism to selectively focus on relevant input tokens while generating each output token [
44].
The attention model was proposed to mitigate those two challenges. The fundamental idea behind it is that, instead of generating the context vector solely based on the last hidden state of the encoder, attention weights, which reflect the inter-relationships between each pair of tokens in the input sequence, are used in lieu of in the context vector generation.
The attention weights are learned jointly with all the other model parameters during the training (by using a MLP). Next, the context vector is computed as the weighted sum of the encoder’s hidden states on all input tokens to avoid information loss. Then, when generating the output sequence, the attention weights prioritize a set of positions in the input sequence where the relevant information presents [
39].
Figure 10 compares the traditional encode-decoder seq2seq architecture with the attention-based seq2seq architecture.
Even though the attention mechanism boosts the performance of the seq2seq model in many language-related tasks, it still has some drawbacks inherited from the recurrent architecture. One of the drawbacks is computational efficiency. It is difficult to parallelize the input sequence processing since the tokens are processed sequentially. [
39]. Another flaw the seq2seq model has is its lack of ability to relate the tokens within the input/output sequence itself. To address these problems, Vaswani et al. proposed the Transformer architecture in [
45], which profoundly impacted the later research of neural network architecture. In short, the Transformer eliminates the sequential processing and the recurrent architecture by utilizing the self-attention mechanism, allowing the model to see the entire input sequence simultaneously.
As hinted by its name, self-attention is an attention mechanism that computes a representation of a sequence by relating tokens at different positions of a sequence itself. The results in [
45] revealed that the Transformer achieves higher accuracy with less training time via parallel processing for the machine translation task without using any recurrent component [
39]. Apart from self-attention, another essential mechanism introduced in [
45] is the multi-headed attention. Instead of computing the attention only once, the multi-headed mechanism runs through the attention computation multiple times in parallel by using different linear transformations of the same input sequence. The outputs are then concatenated and linearly transformed into the expected dimension [
39]. The empirical results have shown that the attention weights learned using the multi-headed mechanism can even further boost the model’s performance.
3.3. Graph Attention Network
As mentioned in the preceding section, the attention mechanism frees the transformer model from the sequential processing of the input; this characteristic allows the idea to be easily extended to data structures other than a sequence, such as graph-structured data.
The graph-structured data extracted from the real-world are often large and chaotic; using attention can help highlight elements of the graph that are more relevant to the main task. Moreover, the attempt that forces the model to focus on the most important part of the graph potentially allows it to filter out the noises, thus improving the signal-to-noise ratio [
46]. Another benefit of using attention is interpretability; the learned attention weights are a potential tool that may be used to interpret the results obtained from the model [
47]. The attention mechanism on a graph can be defined at different levels, namely the node level, edge level, or sub-graph level; in this thesis, the focus will be on the node-level attention mechanism.
Given a graph
, let node
and let
be the set of neighboring nodes of
. The attention on graph is defined as a function
that maps the node
and any node in
to a relevance score, which defines how much attention the target should give to each of its neighbor. Moreover, it is often assumed that
[
48].
Several different types of graph attention mechanisms exist, though they all share the same principle and only differ in how the attention function f is defined.
One may quickly realize some similarities between the attention mechanism and the symmetric normalized adjacency matrix used in GCN ((3.3)). Both of them seem to be used to indicate the strength of relationship between a pair of connected node. Intuitively speaking, the elements in the symmetric normalized adjacency matrix can be viewed as a relevance score that are used to determine the importance of a given node and its neighbouring nodes during the message passing. The most significant distinction between the relevance score used by the two models is that the edge weights in the attention mechanism are learnt implicitly during the training.
The graph attention network (GAT) [
4] extends the GCN by leveraging an explicit self-attention mechanism. As the name suggests, GAT introduces the attention mechanism when aggregating the neighbouring nodes’ features to substitute the normalized convolution operation. How important a node is to another node is determined jointly with other parameters during the training.
Let
and
be two connected nodes of
G, and let
and
, respectively denote the representation vector of the nodes generated by the
layer. In particular, at the
GAT layer, the amount of the attention node
should give to node
based on their representation vectors is computed via a shared attention mechanism att as:
where
is a learnable weight matrix shared over all nodes.
In different variations of the GAT mode,
is computed differently. In the original GAT, the attention mechanism is a one-layer MLP parameterized by a weight vector
a ∈
, and applying the the LeakyReLU non-linearity:
Brody et al. proposed GATv2 in [
49], which modifies the original GAT model by defining
as:
In particular, if we let
; since
a(k) is shared across all nodes in a graph, then in GAT, if there exists a node
such that
is maximal among all the nodes, then for every
, the node
will always obtain the highest attention score. As shown in (3.5), the GATv2 model alleviates the problem by making
a(k) non-global. As shown in (3.4), the first step performed in GAT is to transform the feature vectors by
linearly, then two transformed vectors are concatenated and mapped by
to an attention score
.
GAT adopts masked attention to preserve the structural information - the attention scores are only computed between a node and its neighbouring nodes (i.e., in other words, GAT only computes
between nodes
and
), whereas in the most general form of graph attention, every pair of nodes can be attended to each other, which implies the graph structure is dropped completely. In particular, we use
mask matrix M to enforce the structural information of the input matrix. The most commonly used mask matrix is defined by converting the zero-entry in the adjacency matrix into
, i.e.
Let
denote the matrix of raw attention scores at the
layer. To enforce the graph structural information into
, we update it by using
M:
. Thus,
Furthermore, in order to make the computed attention scores to be easily comparable across different nodes, they are normalized by using the softmax function:
After applying the softmax function, all attention scores
will be in range
. This process is illustrated in
Figure 11 (left).
Once the attention scores of node
and its neighbouring nodes are obtained, a new representation
of node
can be computed by:
where
is a non-linear activation function and
is the aggregated representation for node
. This process is illustrated in
Figure 11 (right).
The matrix form of (3.7) can be written as:
where row
i of
is
and
is the attention matrix with
for
and 0 otherwise.
Alternatively, GAT utilizes a similar multi-head attention approach inspired by the transformer, which is a popular architecture in NLP [
45]. Running multi-head attention is equivalent as running multiple attention mechanisms in parallel and aggregating the results. The multi-head attention allows the network to learn a richer representation of the input data and can help to stabilize the learning process of self-attention [
4]. For example,
M independent attention mechanisms are executed in parallel follows equation (3.7), and each of these attention mechanism is referred to as an “attention head”. The resulting representations are then concatenated to form the output feature representation:
where
is the normalized attention score computed by using the
attention head at the
layer, and
is the corresponding weight matrix. The matrix form of (3.8) can be written as:
Note that when multi-head attention is performed on the final layer of the network, concatenation no longer works, and averaging is employed instead:
or equivalently,
where
K is the total number of GAT layers.
Figure 11 illustrates the aggregation of a multi-head graph attention layer. Every neighbour
i (as well as node 1 itself) of node 1 is associated with three attention score
, which denoted by different colour lines in the figure. They are then been used to compute and aggregate the next level representation of node 1,
GAT has significantly outperformed GCN on the node classification task on several citation networks. However, recent studies [
9,
10,
50] have shown that applying it or some other popular GNN architectures (including GCN) to heterophilous datasets can lead to significant performance loss.
3.4. Direction in Graph Neural Network
As mentioned in the preceding section, most GNNs choose between the mean, max or sum function as their AGGREGATE operation. However, this choice of AGGREGATE operation may lead to low discriminative power of GNNs, since all neighbours are treated equally [
7]. Furthermore, some more serious issues that many GNNs suffered from could also be triggered by the utilization of such operations, namely the over-smoothing and over-squashing problems [
51]. In particular, over-smoothing states the problem that node representations become indistinguishable as the number of layers in a GNN increases, whereas over-squashing refers to the lack of capability of GNN to propagate information between distant nodes effectively.
The most natural way to alleviate the aforementioned issues is by introducing a mechanism into the aggregation step to allow the model to distinguish messages passed from different neighbours. The GAT model introduced in the previous section adopts the attention mechanism based on the node features and utilizes attention scores to discern incoming messages. In particular, the weights defined by the node features can be considered as a form of the local directional flow that guides the message passing in the aggregation step [
7]. In this section, we introduce the global directional flow over general graphs proposed by Beaini et al.’s [
7].
3.4.1. Vector fields in a graph
In vector calculus and physics, a vector field is an assignment of a vector to each point in a subset of space. In order to establish the sense of direction in a graph, Beaini et al. [
7] introduced the notation of vector field to a graph. Given a graph
, define the
vector space as the set of functions
, along with
,
and
scalar products:
Similarly, the
vector space is defined as the set of functions
with
, and
scalar products:
Here
can be regarded as the set of “vector fields” on the space
. For
F , each row
F represents a vector at node
and each element
is a component of the vector going from node
to node
through the edge
. Note that
for nodes
and
that are not connected; furthermore, if no self-loops are added into the graph
G,
for all nodes
.
Let denote the absolute value of F (i.e., , and let denote the -norm of the i-th row of F. The positive/negative part of then defines the forward/backward directional flow.
The
pointwise scalar product is defined as the map
, which takes two vector fields and returns their inner product at each node in
. The value at the node
is defined as:
The gradient ∇ of
is defined as a mapping
:
and the divergence div of
F is defined as a mapping
:
Using (3.9) and (3.10), the directional derivative of the function
in the direction of the vector field
can be defined as:
where each row of
is the normalized by the
-norm:
for
.
The directional derivative can be interpreted as the instantaneous rate of change of the function moving through each node with velocity specified by .
3.4.2. Directional smoothing and derivatives operation
Now, we have defined the vector field F and established the sense of direction in graphs. In order to utilize those concepts to guide the information propagation in the graph, two weighted aggregation matrices, namely the directional average matrix and the directional derivative matrix will be introduced.
The
directional average matrix is defined as
As shown in the above equation,
is a weighted aggregation matrix with non-negative weights; furthermore, all non-zero rows of
have their
-norm equal to 1. It assigns a large weight to the elements in the forward or backward direction of the field, while assigning a small weight to the other elements, with a total weight of one [
7].
Let
X (note that we omit the bias term here) denote the feature matrix of
G, then its
column
is a vector consisting the
features of all nodes in
, the
directional smoothing aggregation at
is defined as
Note the element in can be viewed as an weighted average over all the features of the neighbouring nodes of node , more specifically, by the direction and amplitude of F.
The
directional derivative matrix is defined as
The aggregator
works by subtracting the projected forward message by the backward message (similar to a center derivative), with an additional diagonal term to balance both directions [
7].
The
directional derivative aggregation
of the
feature vector
is defined as
which is essentially same as the centered directional derivative of
in the direction of
. It is easy to show that (3.15) can also be expressed using (3.11):
Figure 12 illustrates an example of how the directional aggregation works on a singe node
with three neighbouring nodes
and
.
Let
and
denote the
feature of nodes
and
respectively. The directional smoothing aggregation
of the graph in
Figure 12 centred at node
can be written as:
Similarly, the directional derivative aggregation
of the graph in
Figure 12 centred at node
can be written as:
where we assume
,
and
.
3.4.3. Using gradient of the Laplacian eigenvectors as vector fields
With everything we have discussed in the preceding section, one may wonder what would be a reasonable choice of the vector field for general graphs. In [
7], Beaini et al. proposed to use the gradient of low-frequency eigenvectors of the graph Laplacian as such vector field. The eigenvectors of the common types of Laplacian matrices (namely the unnormalized, degree-normalized and symmetric normalized Laplacian matrix) have been studied intensively in the field of spectral graph theory [
52], and has been used extensively in the graph signal processing [
53]. They are known to capture many imperative properties of graphs, thus making them a sensible choice for directional message passing (some theory is provided in
Section 3.4.5).
In particular, if we let
to denote the
eigenvector of the Laplacian matrix of the input graph
G, then its gradient is defined as:
3.4.4. Directional Graph Network
Based on the theoretical motivation discussed in the proceeding sections, Beaini et al. designed the Directional Graph Network (DGN), which utilizes directional smoothing and directional derivative aggregators in the message-passing step. In contrast to GAT and other previously mentioned message-passing GNNs, DGN demonstrates a distinct advantage over heterogeneous datasets due to its inherent anisotropic nature.
When utilizing the DGN model, there are two phases involved, namely the pre-computation phase and the GNN phase [
7]. During the pre-computation phase, we compute the set of
L eigenvectors
corresponding to the
L smallest positive eigenvalues of the Laplacian matrix of the input graph. From now on, we denote the eigenvector associated with the eigenvalue
by
, i.e.,
Then we calculated the gradients of the eigenvectors as defined in (3.17). The computed gradients are then be used as the vector field, which we denoted by . Lastly, we construct the directional smoothing aggregation matrix and the directional derivative aggregation matrix for each as shown in (3.12) and (3.14) respectively.
In the GNN phase, the DGN model takes the graph feature matrix
, the adjacency matrix
, and the set of directional smoothing and directional derivative aggregation matrices
as inputs (where
and
for each
k). Then, the set of aggregation matrices
is used jointly with other aggregation operators, such as the simple mean operator
(same as the normalization operation used in the GCN introduced in
Section 3.1), to aggregate the feature
X of the input graph:
where
is the row concatenation of all directional and non-directional aggregation of the nodes features. Note that for the directional derivative aggregation operation
, the absolute value is taken in order to avoid the sign ambiguity. Then similar to GCN, a one hidden layer MLP is applied to the aggregated node features generate the new representations of nodes:
where
is a learnable weight matrix. Then at the
DGN layer, the node representation matrix is computed as:
In the DGN model the number of eigenvectors to be used is regarded as a hyperparameter. However, Beani et al. showed both empirically and theoretically in [
7] that taking the smallest non-trivial eigenvector is enough. Furthermore, the type of Laplacian matrix used by the DNG model is also a hyperparameter, where the options are
,
and
.
3.4.5. Theoretical Analysis
In this section, we review the theoretical analysis in [
7], which justifies the choice of using the eigenvector
corresponding to the smallest non-trivial eigenvalue
of a graph Laplacian to define the global directional flow in a graph. This analysis also serves as an important theoretical foundation to our work in this thesis. The theorem in [
7] states that by following the gradient of the eigenvectors, the diffusion distance between a pair of nodes on a graph could be reduced effectively.
Let
be the transition matrix (also called the random-walk matrix) of a
j step Markov process on a graph
, where the transition probability of graph nodes at each step is defined by
. Then the continuous time random-walk can be defined on the same graph, with the transition probability from node
to node
:
where
t represents continuous time and
is the probability to transit from node
to node
in
k steps. For instance, if
, then
if
and 0 otherwise. This transition probability is also referred to as the
continuous heat kernel.
First, we outline the following lemma based on the results from [
54].
Lemma 3.1.
The transition probability of the continuous time random walk on graph can be written in the matrix form as:
where .
Next, we give the following definitions of diffusion distance and gradient step from [
7].
Definition 3.1 (Diffusion distance).
The diffusion distance at time t between the nodes and is
Note that the diffusion distance is small when there is high probability that the random walk starts in node
will meet the random walk starts in node
at time
t. In graph representation learning, the diffusion distance is often used to model how node
influence node
[
7].
Definition 3.2 (Gradient step).
Let ϕ denote an eigenvector that corresponds to a non-trivial eigenvalue of a Laplacian matrix. Suppose that
then we will say is obtained from by taking a step in the direction of the gradient of ∇ϕ
Lastly, we outline [
7], Theorem 2.3 here.
Theorem 3.1 (Gradient steps reduce diffusion distance).
Let and be two nodes such that , where is the eigenvector corresponds to the smallest non-trivial eigenvalue of . Let be the node obtained from by taking one step in the direction of (as defined in the Definition 3.2). Then there is a constant C such that for ,
with the reduction in distance being proportional to .
The theorem presented in [
7] considers only the eigenvector
of
for the DGN model. However, besides
, [
7] also utilized the eigenvectors from two other Laplacian matrices, namely
and
, in their experimental setting. In particular, DGN using direction defined by the eigenvector of
corresponding to the smallest non-trivial eigenvalue yielded the best results (though this is not showing directly in [
7], the choice of hyperparameters can be found in the authors’s github repository [
55]). Therefore, it is imperative to extend the theorem to encompass the case that
is used.