Example4: Using side informationΒΆ

Latent data fusion methods, such as Matrix Factorization or Entity-Relation learn a latent representation for for the entities representing the objects described by the rows and the columns of each matrix/Relation.

Clearly, in these settings, if a row or a column of the matrix is completely empty, no optimization of the corresponding latent variables can be performed. A possible solution to overcome this problem is to add to the model some explicit variables, which are analogous to the conventional features used in regular ML methods. These feature vecotrs are called side information in the MF/ER data fusion context.

In examples/exampleSide.py we show and example of how side information can be introduced into a NXTfusion model.

In this example we will use the following datasets:

wget http://homes.esat.kuleuven.be/~jsimm/chembl-IC50-346targets.mm
wget http://homes.esat.kuleuven.be/~jsimm/chembl-IC50-compound-feat.mm

First we read the datasets and we transpose the traget matrix to make sure that the matrix is in the prot-drug format.

WARNING the ECFP side information is quite large and due to some missing support for sparse side information, it will required 12Gb of RAM. For this reason we are reading it but running the example on a smaller dataset.

Please note that for matrices/Relations the sparsity support IS present in NXTfusion, and so the library can scale quite well to very large matrices.

ic50 = mmread("chembl-IC50-346targets.mm").transpose()
shape = ic50.shape
#read the side information (features)
#requires 12Gb of ram, so we propose a smaller (randomly generated) alternative
#(Sparse support for side information is currentyl missing)
ecfp = mmread("chembl-IC50-compound-feat.mm")
ecfp = np.random.rand(ecfp.shape[0], 50)

We define the Entities as usual, and we transform the input data into DataMatrix format. In this case we transform also the SideInfo raw data into a NXTfusion-understandable format using the SideInfo class.

protEnt = NX.Entity("proteins", list(range(0,shape[0])), np.int16)
drugEnt = NX.Entity("compounds", list(range(0,shape[1])), np.int16)
ic50DrugMat = DM.DataMatrix("ic50", protEnt, drugEnt, ic50)
ecfpSideMat = DM.SideInfo("drugSide", drugEnt, ecfp)

We build the MetaRelation and Relation as usual. The only diference is that the ecfpSideMat containing the side information is passed as argument to the MetaRelation, in order to specify that the side information for the drugEnt NXTfusion.NXTfusion.Entity (ent2) is available.

protDrugRel = NX.MetaRelation("prot-drug", protEnt, drugEnt, None, ecfpSideMat)
protDrugRel.append(NX.Relation("drugInteraction", protEnt, drugEnt, ic50DrugMat, "regression", protDrugLoss, relationWeight=1))
ERgraph = NX.ERgraph([protDrugRel])

Training and testing is performed as usual.

model = example1Model(ERgraph, "mod1")
wrapper = NNwrapper(model, dev = DEVICE, ignore_index = IGNORE_INDEX)
wrapper.fit(ERgraph, epochs=5)

For prediction, we need to specify the side information again. This is done by just passing it to the .predict() method.

X, Y, corresp = buildPytorchFeats(ic50DrugMat)
Yp = wrapper.predict(ERgraph, X, "prot-drug", "drugInteraction", None, ecfpSideMat)
print("Final MSE: ", (np.sum((np.array(Yp) - np.array(Y))**2))/float(len(Yp)))

#we do the same but taking as input the coo_matrix instead
X, Y, corresp = buildPytorchFeats(ic50, protEnt, drugEnt)
Yp = wrapper.predict(ERgraph, X, "prot-drug", "drugInteraction", None, ecfpSideMat)
print("Final MSE: ", (np.sum((np.array(Yp) - np.array(Y))**2))/float(len(Yp)))

In this example we compute the predictions twice to show that the buildPytorchFeats function can build the input X vector starting from both DataMatrix objects or other formats like scipy.sparse.coo_matrix objects thanks to method overloading.