Converting A Text Corpus To A Text Document With Vocabulary_id And Respective Tfidf Score
I have a text corpus with say 5 documents, every document is separated with each other by /n. I want to provide an id to every word in the document and calculate its respective tfi
Solution 1:
I guess this is what you need. Here corpus
is a collection of documents.
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["stack over flow stack over flow text vectorization scikit", "stack over flow"]
vectorizer = TfidfVectorizer()
x = vectorizer.fit_transform(corpus) # corpus is a collection of documentsprint(vectorizer.vocabulary_) # vocabulary terms and their indexprint(x) # tf-idf weights for each terms belong to a particular document
This prints:
{'vectorization': 5, 'text': 4, 'over': 1, 'flow': 0, 'stack': 3, 'scikit': 2}
(0, 2) 0.33195438857 # first document, word = scikit
(0, 5) 0.33195438857 # word = vectorization
(0, 4) 0.33195438857 # word = text
(0, 0) 0.472376562969 # word = flow
(0, 1) 0.472376562969 # word = over
(0, 3) 0.472376562969 # word = stack
(1, 0) 0.57735026919 # second document
(1, 1) 0.57735026919
(1, 3) 0.57735026919
From this information, you can represent the documents in your desired way as following:
cx = x.tocoo()
doc_id = -1for i,j,v inzip(cx.row, cx.col, cx.data):
if doc_id == -1:
print(str(j) + ':' + "{:.4f}".format(v), end=' ')
else:
if doc_id != i:
print()
print(str(j) + ':' + "{:.4f}".format(v), end=' ')
doc_id = i
This prints:
2:0.3320 5:0.3320 4:0.3320 0:0.4724 1:0.4724 3:0.4724 0:0.5774 1:0.5774 3:0.5774
Post a Comment for "Converting A Text Corpus To A Text Document With Vocabulary_id And Respective Tfidf Score"