Pairwise Earth Mover Distance Across All Documents (word2vec Representations)
Solution 1:
The "Word Mover's Distance" (earth-mover's distance applied to groups of word-vectors) is a fairly involved optimization calculation dependent on every word in each document.
I'm not aware of any tricks that would help it go faster when calculating many at once – even many distances to the same document.
So the only thing needed to calculate pairwise distances are nested loops to consider each (order-ignoring unique) pairing.
For example, assuming your list of documents (each a list-of-words) is docs
, a gensim word-vector model in model
, and numpy
imported as np
, you could calculate the array of pairwise distances D with:
D = np.zeros((len(docs), len(docs)))
for i inrange(len(docs)):
for j inrange(len(docs)):
if i == j:
continue# self-distance is 0.0if i > j:
D[i, j] = D[j, i] # re-use earlier calc
D[i, j] = model.wmdistance(docs[i], docs[j])
It may take a while, but you'll then have all pairwise distances in array D.
Solution 2:
On top of the accepted answer you may want to use the faster wmd library wmd-relax.
The example then could be adjusted to:
D[i, j] = docs[i].similarity(docs[j])
Post a Comment for "Pairwise Earth Mover Distance Across All Documents (word2vec Representations)"