What is Data Fusion?

As summarized by Wikipedia.:

Data fusion is the process of integrating multiple data sources to produce more consistent, accurate, and useful information than that provided by any individual data source.

In the era of ​ big data ​, many scientific disciplines are producing enormous amounts of heterogeneous data from which we want to infer reliable predictive or descriptive models. We are thus in a pressing need for powerful, scalable algorithms that integrate multiple sources of information and learn complex patterns from this multi-faceted and interconnected data. To face this challenge we propose a novel ​ data fusion approach for ​nonlinear inference over arbitrary entity-relation graphs​.

WHAT IS NXTfusion

NXTfusion is a Neural Network based data fusion method that extends the classical Matrix Factorization paradigm by allowing non-linear inference over arbitrariy connected Entity-Relation Graphs (ER graphs).

What is this an Entity-Relation graph?

An ER graph is an abstract data structure, similar to a relational database, that allows to model classes of objects (Entities) and relations between them (Relations).

The ER formalism is a generalization of the well known Matrix Factorization formalism, and indeed we can describe every data fusion problem in terms of ​ entity-relation (ER) models, where entities are classes of objects belonging to a particular domain and relations describe the interactions between entities. Such an arbitrary data fusion model is ​ completely general and could allow inference on an extremely ​ broad class of problems​ . Moreover, the ease in which entities can be connected through relations would allow the inclusion of data sets that are only ​ loosely related with the problem under investigation.

NXTfusion approach generalizes existing data fusion methods

In general, we each relation corresponds to a possibly sparsely observed matrix and the entities are the objects represented as rows and columns on that matrix.

In the classical Matrix Factorization paradigm, usually only a single matrix Y = UV is factorized into two latent matrices U and V, meaning that a single interaction (Y) between two entities (of which U and V are the latent representation) is considered. An extension to this is the Tensor Factorization (e.g. https://arxiv.org/abs/1512.00315), where multiple matrices/relations between two entities are factorized at the same time.

Real world data is nevertheless richer than this, and a problem might be characterized by many relations between many pairs of objects, thus forming a complex graph of entities (the nodes) connected by relations (the edges).

Here we further extend the field of data fusion by building a Neural Network-based data fusion framework for non-linear inference over completely arbitrary ER graphs, as we showed here https://doi.org/10.1093/bioinformatics/btab092.

Examples from the scientific world

Few examples from the scientific world are listed in this non-exhaustive list:

What is this repository for?

The code here contains a pytorch-based python3 library taht should allow anyone to use our Entity-Relation data fusion framework on your data science problem of choice. An example of its application, on protein-protein interaction is available here: https://bitbucket.org/eddiewrc/nxtppi/src/master/, and it has been published here: https://doi.org/10.1093/bioinformatics/btab092 .