Skip to content Skip to sidebar Skip to footer

Converting A Text Corpus To A Text Document With Vocabulary_id And Respective Tfidf Score

I have a text corpus with say 5 documents, every document is separated with each other by /n. I want to provide an id to every word in the document and calculate its respective tfi

Solution 1:

I guess this is what you need. Here corpus is a collection of documents.

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["stack over flow stack over flow text vectorization scikit", "stack over flow"]

vectorizer = TfidfVectorizer()
x = vectorizer.fit_transform(corpus) # corpus is a collection of documentsprint(vectorizer.vocabulary_) # vocabulary terms and their indexprint(x) # tf-idf weights for each terms belong to a particular document

This prints:

{'vectorization': 5, 'text': 4, 'over': 1, 'flow': 0, 'stack': 3, 'scikit': 2}
  (0, 2)    0.33195438857 # first document, word = scikit
  (0, 5)    0.33195438857 # word = vectorization
  (0, 4)    0.33195438857 # word = text
  (0, 0)    0.472376562969 # word = flow
  (0, 1)    0.472376562969 # word = over
  (0, 3)    0.472376562969 # word = stack
  (1, 0)    0.57735026919 # second document
  (1, 1)    0.57735026919
  (1, 3)    0.57735026919

From this information, you can represent the documents in your desired way as following:

cx = x.tocoo()
doc_id = -1for i,j,v inzip(cx.row, cx.col, cx.data):
    if doc_id == -1:
        print(str(j) + ':' + "{:.4f}".format(v), end=' ')
    else:
        if doc_id != i:
            print()
        print(str(j) + ':' + "{:.4f}".format(v), end=' ')
    doc_id = i

This prints:

2:0.3320 5:0.3320 4:0.3320 0:0.4724 1:0.4724 3:0.4724 0:0.5774 1:0.5774 3:0.5774

Post a Comment for "Converting A Text Corpus To A Text Document With Vocabulary_id And Respective Tfidf Score"