What is Data Fusion?¶
- As summarized by Wikipedia.:
Data fusion is the process of integrating multiple data sources to produce more consistent, accurate, and useful information than that provided by any individual data source.
In the era of big data , many scientific disciplines are producing enormous amounts of heterogeneous data from which we want to infer reliable predictive or descriptive models. We are thus in a pressing need for powerful, scalable algorithms that integrate multiple sources of information and learn complex patterns from this multi-faceted and interconnected data. To face this challenge we propose a novel data fusion approach for nonlinear inference over arbitrary entity-relation graphs.
WHAT IS NXTfusion¶
NXTfusion
is a Neural Network based data fusion method that extends the classical Matrix Factorization paradigm by allowing non-linear inference over arbitrariy connected Entity-Relation Graphs (ER graphs).
What is this an Entity-Relation graph?¶
An ER graph is an abstract data structure, similar to a relational database, that allows to model classes of objects (Entities) and relations between them (Relations).
The ER formalism is a generalization of the well known Matrix Factorization formalism, and indeed we can describe every data fusion problem in terms of entity-relation (ER) models, where entities are classes of objects belonging to a particular domain and relations describe the interactions between entities. Such an arbitrary data fusion model is completely general and could allow inference on an extremely broad class of problems . Moreover, the ease in which entities can be connected through relations would allow the inclusion of data sets that are only loosely related with the problem under investigation.
NXTfusion approach generalizes existing data fusion methods¶
In general, we each relation corresponds to a possibly sparsely observed matrix and the entities are the objects represented as rows and columns on that matrix.
In the classical Matrix Factorization paradigm, usually only a single matrix Y = UV
is factorized into two latent matrices U
and V
, meaning that a single interaction (Y
) between two entities (of which U
and V
are the latent representation) is considered.
An extension to this is the Tensor Factorization (e.g. https://arxiv.org/abs/1512.00315), where multiple matrices/relations between two entities are factorized at the same time.
Real world data is nevertheless richer than this, and a problem might be characterized by many relations between many pairs of objects, thus forming a complex graph of entities (the nodes) connected by relations (the edges).
Here we further extend the field of data fusion by building a Neural Network-based data fusion framework for non-linear inference over completely arbitrary ER graphs, as we showed here https://doi.org/10.1093/bioinformatics/btab092.
Examples from the scientific world¶
Few examples from the scientific world are listed in this non-exhaustive list:
drug-protein interaction predictor, in which Protein and Drugs are entities and the relation between them indicate which drugs interact with which proteins (https://arxiv.org/abs/1512.00315).
gene prioritization (where Gene and Disease are the entity and the relation “gene u is involved in disease v” between them is modeled) (https://doi.org/10.1093/bioinformatics/bty289)
protein-protein interaction predictor, including tensor factorization and inference over arbitrary Entity-Relation graph (https://doi.org/10.1093/bioinformatics/btab092)
What is this repository for?¶
The code here contains a pytorch-based python3
library taht should allow anyone to use our Entity-Relation data fusion framework on your data science problem of choice.
An example of its application, on protein-protein interaction is available here: https://bitbucket.org/eddiewrc/nxtppi/src/master/, and it has been published here: https://doi.org/10.1093/bioinformatics/btab092 .