Paper2vec: Combining Graph and Text Information for Scientific Paper Representation

INTRODUCTION

Paper2vec is a neural network embedding based approach for creating scientific paper representations which make use of both textual and graph-based information. It has a unsupervised feature learning from both graphs and text features of scientific papers. The contribution of this papers are:

Introduce Paper2vec - a novel neural network based embedding for representing scientific papers. Propose two novel techniques to incorporate textual information in citation networks to create richer paper embeddings.
Curate a large collection of almost half a million academic research papers with full text and citation information.
Conduct experiments on three real world datasets of varied scales and discuss the performance achieved in two evaluation tasks.

METHODOLOGY

A citation dataset is a undirected and unweighted graph G(V,E) with the nodes being the scientific papers and the edges being a citation link between two papers. Let f : V → R^D be the mapping function from the nodes to the representations which we wish to learn. D is the dimensionality of the latent space and the total number of nodes in our graph (including the non-connected ones) is represented as |V|. f is a matrix of size |V| × D parameters.
The textual representations is denoted by V{d₁, d₂ . . . d_|V|}. w_i and d_k denotes the document and word vectors from the corpus. A fix context window is designed c₁ ∈ C₁ over words in a sentence which is slided throughout our corpus. For all the possible contexts in |C₁|, training is done on every word in the context to predict all words given the document vector and the word vector itself.
The average log probability is maximized as follows:

where softmax function is defined on Pr(w_j |w_i,d_k).

The objective function can be trained using the Skip-gram algorithm for learning the word and document embeddings. The approximation of this objective using positive word-context pairs and randomly sampled negative word-context pairs.

Let G₀ be our input graph and first define the notion of context or neighbourhood inside G₀ : a valid context c₂ ∈ C₂ for a node v_i is the collection of all nodes v_j that are atmost h hops away from v_i . Value of h is determined by window size of c₂. Note that there is no difference between (v_i, v_j) pairs on whether they are connected by citations or by text-based links. We obtain C₂ by sliding over random walk sequences starting from every node v_i, i∈V in G. w try to maximize the likelihood function

We approximate the objective function by taking sets of positve and negative (v_i, v_j) pairs.

EXPERIMENTS AND RESULTS
Paper2vec on three different academic citation datasets of increasing scale (small, medium, large):
1. CORA ML subset: 2708 papers, 7 class, 5249 citation links.
2. CORA full dataset: 51905 papers, 7 broad categories, 132968 citation links.
3. DBLP citation network: 465355 papers, 2301292 citation links.

Node classification: In this task the class or label of a scientific paper is determined given its representation.
Link prediction: Given a random pair of node representations, it is determined whether a citation link exists between them or not.

The final link-prediction dataset contains 20,000, 60,000 and 440,000 examples for the small, medium and large datasets respectively. The 5-fold cross-validation on this binary classification is as follows:

Support Vector Machines(SVM) is used for all the node classification tasks.

CONCLUSION

The neural network based techniques are generally able to out-perform the matrix factorization based methods
In the link prediction task, on the CORA-ML (small) dataset, there is an improvement of above 11% for Paper2vec over the considered baseline i.e Deepwalk.
The unsupervised representations are able to surpass prior work based on semi-supervised models like Collective Classification.

REFERENCES

Ganguly, Soumyajit, and Vikram Pudi. "Paper2vec: combining graph and text information for scientific paper representation." In European Conference on Information Retrieval, pp. 383-395. Springer, Cham, 2017.
Chakraborty, Tanmoy, Sandipan Sikdar, Vihar Tammana, Niloy Ganguly, and Animesh Mukherjee. "Computer science fields as ground-truth communities: "Their impact, rise and fall." In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 426-433. ACM, 2013.

IIITD IR MELANAGE

Search This Blog

Paper2vec: Combining Graph and Text Information for Scientific Paper Representation

Comments

Post a Comment