INTRODUCTION
Paper2vec is a neural network embedding
based approach for creating scientific paper representations which make
use of both textual and graph-based information. It has a unsupervised feature learning from both graphs and
text features of scientific papers. The contribution of this papers are:
- Introduce Paper2vec - a novel neural network based embedding for representing scientific papers. Propose two novel techniques to incorporate textual information in citation networks to create richer paper embeddings.
- Curate a large collection of almost half a million academic research papers with full text and citation information.
- Conduct experiments on three real world datasets of varied scales and discuss the performance achieved in two evaluation tasks.
METHODOLOGY
A citation dataset is a undirected and unweighted graph G(V,E) with the nodes being the scientific papers and the edges being a citation link between two papers. Let f : V → RD be the mapping function from the nodes to the representations
which we wish to learn. D is the dimensionality of the latent space and the total number of nodes in our graph (including the non-connected ones) is represented as |V|.
f is a matrix of size |V| × D parameters.
The textual representations is denoted by V{d1, d2 . . . d|V|}. wi and dk denotes the document and word vectors from the corpus. A fix context window is designed c1 ∈ C1 over words in a sentence which is slided throughout our corpus. For all the possible contexts in |C1|, training is done on every word in the context to predict all words given the document vector and the word vector itself.
The average log probability is maximized as follows:
EXPERIMENTS AND RESULTS
Paper2vec on three different academic citation datasets of increasing scale (small, medium, large):
1. CORA ML subset: 2708 papers, 7 class, 5249 citation links.
2. CORA full dataset: 51905 papers, 7 broad categories, 132968 citation links.
3. DBLP citation network: 465355 papers, 2301292 citation links.
Node classification: In this task the class or label of a scientific paper is determined given its representation.
Link prediction: Given a random pair of node representations, it is determined whether a citation link exists between them or not.
The final link-prediction dataset contains 20,000, 60,000 and 440,000 examples for the small, medium and large datasets respectively. The 5-fold cross-validation on this binary classification is as follows:
CONCLUSION
The textual representations is denoted by V{d1, d2 . . . d|V|}. wi and dk denotes the document and word vectors from the corpus. A fix context window is designed c1 ∈ C1 over words in a sentence which is slided throughout our corpus. For all the possible contexts in |C1|, training is done on every word in the context to predict all words given the document vector and the word vector itself.
The average log probability is maximized as follows:
where softmax function is defined on Pr(wj |wi,dk).
The objective function can be trained using the Skip-gram algorithm for
learning the word and document embeddings. The approximation of this objective using positive word-context pairs and randomly sampled negative
word-context pairs.
Let G0 be our input graph and first define the notion of context
or neighbourhood inside G0 : a valid context c2 ∈ C2 for a node vi
is the collection
of all nodes vj that are atmost h hops away from vi
. Value of h is determined by
window size of c2. Note that there is no difference between (vi, vj) pairs
on whether they are connected by citations or by text-based links. We obtain C2 by sliding over random walk sequences starting from every node vi, i∈V in
G. w try to maximize the likelihood function
We
approximate the objective function by taking sets of positve and negative (vi, vj)
pairs.
EXPERIMENTS AND RESULTS
Paper2vec on three different academic citation datasets of increasing scale (small, medium, large):
1. CORA ML subset: 2708 papers, 7 class, 5249 citation links.
2. CORA full dataset: 51905 papers, 7 broad categories, 132968 citation links.
3. DBLP citation network: 465355 papers, 2301292 citation links.
Node classification: In this task the class or label of a scientific paper is determined given its representation.
Link prediction: Given a random pair of node representations, it is determined whether a citation link exists between them or not.
The final link-prediction dataset contains 20,000, 60,000 and 440,000 examples for the small, medium and large datasets respectively. The 5-fold cross-validation on this binary classification is as follows:
Support Vector Machines(SVM) is used for all the node
classification tasks.
- The neural network based techniques are generally able to out-perform the matrix factorization based methods
- In the link prediction task, on the CORA-ML (small) dataset, there is an improvement of above 11% for Paper2vec over the considered baseline i.e Deepwalk.
- The unsupervised representations are able to surpass prior work based on semi-supervised models like Collective Classification.
REFERENCES
- Ganguly, Soumyajit, and Vikram Pudi. "Paper2vec: combining graph and text information for scientific paper representation." In European Conference on Information Retrieval, pp. 383-395. Springer, Cham, 2017.
- Chakraborty, Tanmoy, Sandipan Sikdar, Vihar Tammana, Niloy Ganguly, and Animesh Mukherjee. "Computer science fields as ground-truth communities: "Their impact, rise and fall." In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 426-433. ACM, 2013.
Comments
Post a Comment