A Practical Example Of Gsdmm In Python?
Solution 1:
I finally compiled my code for GSDMM and will put it here from scratch for others' use. Hope this helps. I have tried to comment on important parts:
# Importsimport random
import numpy as np
from gensim.models.phrases import Phraser, Phrases
from gensim.utils import simple_preprocess
from gsdmm import MovieGroupProcess
# data
data = ...
# stop words
stop_words = ...
# turning sentences into words
data_words =[]
for doc in data:
doc = doc.split()
data_words.append(doc)
# create vocabulary
vocabulary = ...
# Removing stop Words
stop_words.extend(['from', 'rt'])
defremove_stopwords(texts):
return [
[
word
for word in simple_preprocess(str(doc))
if word notin stop_words
]
for doc in texts
]
data_words_nostops = remove_stopwords(vocabulary)
# building bi-grams
bigram = Phrases(vocabulary, min_count=5, threshold=100)
bigram_mod = Phraser(bigram)
print('done!')
# Form Bigrams
data_words_bigrams = [bigram_mod[doc] for doc in data_words_nostops]
# lemmatization
pos_to_use = ['NOUN', 'ADJ', 'VERB', 'ADV']
data_lemmatized = []
for sent in data_words_bigrams:
doc = nlp(" ".join(sent))
data_lemmatized.append(
[token.lemma_ for token in doc if token.pos_ in pos_to_use]
)
docs = data_lemmatized
vocab = set(x for doc in docs for x in doc)
# Train a new model
random.seed(1000)
# Init of the Gibbs Sampling Dirichlet Mixture Model algorithm
mgp = MovieGroupProcess(K=10, alpha=0.1, beta=0.1, n_iters=30)
vocab = set(x for doc in docs for x in doc)
n_terms = len(vocab)
n_docs = len(docs)
# Fit the model on the data given the chosen seeds
y = mgp.fit(docs, n_terms)
deftop_words(cluster_word_distribution, top_cluster, values):
for cluster in top_cluster:
sort_dicts = sorted(
mgp.cluster_word_distribution[cluster].items(),
key=lambda k: k[1],
reverse=True,
)[:values]
print('Cluster %s : %s'%(cluster,sort_dicts))
print(' — — — — — — — — — ')
doc_count = np.array(mgp.cluster_doc_count)
print('Number of documents per topic :', doc_count)
print('*'*20)
# Topics sorted by the number of document they are allocated to
top_index = doc_count.argsort()[-10:][::-1]
print('Most important clusters (by number of docs inside):', top_index)
print('*'*20)
# Show the top 10 words in term frequency for each cluster
top_words(mgp.cluster_word_distribution, top_index, 10)
Hope this helps!
EDIT:
Links
gensim
modules- Python library
gsdmm
Solution 2:
I am experimenting with GSDMM as well and ran into the same problem, that there is just not much online (I was unable to find more than you, of course besides some papers using it). If you look at the code of the GSDMM GitHub repo you can see, that it is a pretty small repo with only a few functionalities. These are basically all used in the tutorial from towarddatascience, so I don´t think you are missing out on things.
If you have a specific question, feel free to ask!
Edit: If you follow the tutorial on towardsdatascience you will realize, that it is an inconsistent and not finished project. Some helper functions are missing and the algorithm is not correctly used. The author runs it with K=10
and ends up with 10 Clusters. If you increase K
(and you should) then the number of clusters would be higher than 10, so there is a little bit of cheating happening.
Solution 3:
GSDMM (Gibbs Sampling Dirichlet Multinomial Mixture) is a short text clustering model. It is essentially a modified LDA (Latent Drichlet Allocation) which suppose that a document such as a tweet or any other text encompasses one topic.
Address: github.com/da03/GSDMM
import numpy as np
from scipy.sparse import lil_matrix
from scipy.sparse import find
import math
classGSDMM:
def__init__(self, n_topics, n_iter, random_state=910820, alpha=0.1, beta=0.1):
self.n_topics = n_topics
self.n_iter = n_iter
self.random_state = random_state
np.random.seed(random_state)
self.alpha = alpha
self.beta = beta
deffit(self, X):
alpha = self.alpha
beta = self.beta
D, V = X.shape
K = self.n_topics
N_d = X.sum(axis=1)
words_d = {}
for d inrange(D):
words_d[d] = find(X[d,:])[1]
# initialization
N_k = np.zeros(K)
M_k = np.zeros(K)
N_k_w = lil_matrix((K, V), dtype=np.int32)
K_d = np.zeros(D)
for d inrange(D):
k = np.random.choice(K, 1, p=[1.0/K]*K)[0]
K_d[d] = k
M_k[k] = M_k[k]+1
N_k[k] = N_k[k] + N_d[d]
for w in words_d[d]:
N_k_w[k, w] = N_k_w[k,w]+X[d,w]
foriterinrange(self.n_iter):
print'iter ', iterfor d inrange(D):
k_old = K_d[d]
M_k[k_old] -= 1
N_k[k_old] -= N_d[d]
for w in words_d[d]:
N_k_w[k_old, w] -= X[d,w]
# sample k_new
log_probs = [0]*K
for k inrange(K):
log_probs[k] += math.log(alpha+M_k[k])
for w in words_d[d]:
N_d_w = X[d,w]
for j inrange(N_d_w):
log_probs[k] += math.log(N_k_w[k,w]+beta+j)
for i inrange(N_d[d]):
log_probs[k] -= math.log(N_k[k]+beta*V+i)
log_probs = np.array(log_probs) - max(log_probs)
probs = np.exp(log_probs)
probs = probs/np.sum(probs)
k_new = np.random.choice(K, 1, p=probs)[0]
K_d[d] = k_new
M_k[k_new] += 1
N_k[k_new] += N_d[d]
for w in words_d[d]:
N_k_w[k_new, w] += X[d,w]
self.topic_word_ = N_k_w.toarray()
Solution 4:
As I understand it you have the code https://github.com/rwalk/gsdmm but you need to decide how to apply it.
How does it work?
You can download the paper A dirichlet multinomial mixture model-based approach for short text clustering, it shows that the clusters search is equivalent to game of table choosing. Image to have a group of students and want to group them on tables by their movie interest. Every student (=item) switches in each round to a table(=cluster) that has students with similar movies and is popular. Alpha controls a factor that decides how easily a table gets removed when it's empty (low alpha = less tables). Small betas means that a table is chosen based on similarity to the table than based on popularity of a table. For short text clustering you take words instead of movies.
Alpha, beta, number of iterations
Therefore low alpha result in many clusters with single words, while high alphas result in less clusters with more words. High beta result in popular clusters while low beta result in similar clusters (which are not strong populated). What parameteres you need is based on the dataset. The number of clusters can mostly controlled by beta, but alpha has also (as described) an influence. The number of iterations seems to be stable after 20 iterations but 10 is also ok.
Data preperation process
Before you train the algorithm you will need to create a clean data set. For this you convert every text to lower case, you remove non-ASCII characters and stop-words and you apply stemming or lemmatisation. You will also need to apply this process when you execute it on a new sample.
Post a Comment for "A Practical Example Of Gsdmm In Python?"