How to define a dataset¶
This guide shows you how to define datasets of graphs, which are the inputs of \(\mu\mathcal{G}\) models. Dataset
s are containers for one or more Graph
objects, and they are implemented in Spektral. libmg imports
their implementation, and in fact you will see that this guide is very similar to the one linked above. Datasets are defined by subclassing them and
overriding the read
and download
methods. Then they can be instantiated by providing a name. In the following sections, we will go over these steps.
Defining the dataset¶
We start by importing the Dataset
and Graph
classes from libmg
. We will also be importing os
, numpy
and scipy
which will be useful later.
We can define a new dataset by subclassing Dataset
. In the __init__
method we are only required to pass a string name to the parent class, but as
usual we can also provide additional arguments that we may need.
class MyDataset(Dataset):
def __init__(self, name, arg1, arg2, ...):
super().__init__(name)
self.arg1 = arg1
self.arg2 = arg2
...
def read(self):
pass
def download(self):
pass
Dataset
is instantiated, the __init__
will call download
first (if necessary, see below), followed by read
. The list of Graph
objects
returned by read
will constitute the contents of our dataset.
The download
method is supposed to create the raw data of the dataset. It is called if a directory named ~/spektral/datasets/[ClassName]
is missing. In such
directory the download
method should store the data, so that in future instantiations read
can load this data without calling download
again. Thus, the
download
method will usually create this directory and save some data there, e.g. .npz
files, .csv
files, etc.
The read
method is called on every instantiation and must return a list
of Graph
objects. If we defined a download
method, these Graph
objects will
usually come from the files we saved in the ~/spektral/datasets/[ClassName]
directory. If we didn't define a download
method, we also have the possibility of
generating these graphs on-the-fly.
Defining a Graph
¶
A Graph
object can be instantiated using the constructor Graph(x=None, a=None, e=None, y=None)
. All these four arguments are optional, but
in \(\mu\mathcal{G}\) you should always at least provide x
and a
.
The node features matrix X
¶
Each node in the graph is assigned a vector of features. For example, when considering a citation network, each node represents a paper and will be assigned a vector of floating-point numbers that encode the contents of that paper.
The node features matrix X
stores all these vectors, such that in row \(i\) is stored the feature vector for node \(i\). Therefore, this matrix will have rows
equal to the number of nodes in the graph, and columns equal to the length of the feature vectors (which will all have the same length).
When creating a Graph
, the x
argument that encodes the node features matrix should be passed in as a NumPy array (np.array
).
The adjacency matrix A
¶
The adjacency matrix encodes the connections of a graph. In \(\mu\mathcal{G}\) adjacency matrices are binary, i.e. they only contain zeros and ones. A value of 1 at row \(i\) and column \(j\) means that there exists a directed edge going from node \(i\) to node \(j\). A value of 0 means that there is no such edge.
The adjacency matrix is supposed to be created as a SciPy sparse matrix in COOrdinate format
(coo_matrix
from scipy.sparse
). A sparse matrix only contains the coordinates of the non-zero elements. In this format the adjacency matrix is specified
from three arrays: the row indices, the column indices, and the values. The values will be an array of 1s equal to the number of edges in the graph. The row and
column indices are the indices of the nodes that the edges connect: the row index is the source node of the edge and the column index is the target node of
the edge. The indices should be in row-major order, that is, they are ordered according to the rows first and to the columns second.
The edge features matrix E
¶
As is the case for the nodes, edges can have features as well. The edge features matrix E
has a row for each edge in the graph and columns equal to the length
of their feature vectors. In this format, the feature vector in row \(i\) corresponds to the \(i\)-th edge of the graph. The \(i\)-th edge of the graph is the
edge corresponding to the \(i\)-th row index and column index of the adjacency matrix \(A\).
When creating a Graph
, the e
argument that encodes the edge features matrix should be passed in as a NumPy array (np.array
).
The true labels matrix Y
¶
Usually in machine learning we are also provided the true labels to be used for training or testing models. In \(\mu\mathcal{G}\) the labels are always meant to be node labels, that is, we are given a label for each node of the graph.
The true labels features matrix Y
stores the true labels vector, such that in row \(i\) is stored the true labels vector for node \(i\). Therefore, this matrix
will have rows equal to the number of nodes in the graph, and columns equal to the length of the true label vectors.
When creating a Graph
, the y
argument that encodes the true labels matrix should be passed in as a NumPy array (np.array
).
Overriding download
¶
The download
method will usually create the ~/spektral/datasets/[ClassName]
(available through self.path
) directory and populate it with data. The data
can be generated according to some specification or downloaded from the web. So the general structure of a download
method will be:
def download(self):
os.mkdir(self.path)
data = ... # Obtain the data
# Save the data
np.savez('mydata', ...)
Overriding read
¶
The read
method will either load up the data in ~/spektral/datasets/[ClassName]
or create it on-the-fly. What it matters is that it returns a list
of
Graph
objects.
def read(self):
output = []
mydata = np.load(os.path.join(self.path, 'mydata.npz'))
...
X = np.array(...)
A = coo_matrix(...)
output.append(Graph(x=X, a=A))
...
return output
Instantiating the dataset¶
The dataset can now be instantiated by calling the constructor.