One major problem of the existing GCNs is the low expressive power limited by their shallow learning mechanisms [
61,
66]. There are mainly two reasons why an architecture that is scalable in depth has not been achieved yet. First, this problem is difficult: considering graph convolution as a special form of Laplacian smoothing [
32], networks with multiple convolutional layers will suffer from an over-smoothing problem that makes the representation of even distant nodes indistinguishable [
66]. Second, some people think it is unnecessary: for example, [
4] states that it is not necessary for the label information to totally traverse the entire graph and one can operate on the multi-scale coarsened input graph and obtain the same flow of information as GCNs with more layers. Acknowledging the difficulty, we hold on to the objective of deepening GCNs since the desired compositionality
1 will yield easy articulation and consistent performance for problems with different scales.
In
Section 2.1, we first analyze the limits of deep GCNs brought by over-smoothing and the activation functions. Then, we show that any graph convolution with a well-defined analytic spectral filter can be written as a product of a block Krylov matrix and a learnable parameter matrix in a special form. Based on this, we propose two GCN architectures that leverage multi-scale information in different ways and are scalable in depth, with stronger expressive powers and abilities to extract richer representations of graph-structured data. For empirical validation, we test different instances of the proposed architectures on multiple node classification tasks. The results show that even the simplest instance of the architectures achieves state-of-the-art performance, and the complex ones achieve surprisingly higher performance. In
Section 2.2, we propose to study an over-smoothing problem and give some ideas.
2.1. A Stronger Multi-Scale Deep GNN with Truncated Krylov Architecture
Suppose we deepen GCN in the same way as [
24,
32], we have
For this architecture, without considering the ReLU activation function, [
32] shows that
will converge to a space spanned by the eigenvectors of
with eigenvalue 1. Taking activation function into consideration, our analyses on (
3) can be summarized in the following theorems (see proof in the appendix of [
44]).
Theorem 1. Suppose that
has
k connected components. Let
be any feature matrix and let
be any non-negative parameter matrix with
for
. If
has no bipartite components, then in (
3), as
,
.
Theorem 2. Suppose the
n-dimensional
and
are independently sampled from a continuous distribution and the activation function
is applied to
pointwisely, then
Theorem 1 shows that, even considering ReLU, if we simply deepen GCN as (
3), the extracted features will degrade under certain conditions,
i.e. only contains the stationary information of the graph structure and loses all the local information in node for being smoothed. In addition, from the proof we see that the pointwise ReLU transformation is a conspirator. Theorem 2 tells us that Tanh is better at keeping linear independence among column features. We design a numerical experiment on synthetic data to test, under a 100-layer GCN architecture, how activation functions affect the rank of the output in each hidden layer during the feedforward process. As
Figure 1a shows, the rank of hidden features decreases rapidly with ReLU, while having little fluctuation under Tanh, and even the identity function performs better than ReLU. So we propose to replace ReLU by Tanh.
Besides activation function, to find a way to deepen GCN, we first show that any graph convolution with well-defined analytic spectral filter defined on can be written as a product of a block Krylov matrix with a learnable parameter matrix in a specific form. Based on this, we propose snowball network and truncated Krylov network.
We take
. Given a set of block vectors
, the
-span of
is defined as
. Then, the order-
m block Krylov subspace with respect to the matrix
, the block vector
and the vector space
, and its corresponding block Krylov matrix are respectively defined as
It is shown in [
11,
15] that there exists a smallest
m such that for any
,
, where
m depends on
A and
B.
Let
denote the spectrum radius of
and suppose
where
R is the radius of convergence for a real analytic scalar function
g. Based on the above definitions and conclusions, the graph convolution can be written as
where
for
are parameter matrix blocks and
. Then, a graph convolutional layer can generally be written as
where
is a parameter matrix, and
. The essential number of learnable parameters is
.
The block Krylov form provides an insight about why an architecture that concatenates multi-scale features in each layer will boost the expressive power of GCN. Based on this idea, we propose the snowball and truncated Block Krylov architectures [
44] shown in
Figure 2, where we stack multi-scale information in each layer. From the performance comparison on semi-supervised node classification tasks with different label percentage in
Table 1, we can see that the proposed models consistently perform better than the state-of-the-art models, especially when there are less labeled nodes. See detailed experimental results in [
44].
Table 1.
Accuracy without Validation. For each (column), the greener the cell, the better the performance. The redder, the worse. If our methods achieve better performance than all others, the corresponding cell will be in bold.
Table 1.
Accuracy without Validation. For each (column), the greener the cell, the better the performance. The redder, the worse. If our methods achieve better performance than all others, the corresponding cell will be in bold.