Skip to content Skip to sidebar Skip to footer

Python Pandas: Generate Document-term Matrix From Whitespace Delimited '.dat' File

I'm using Python to attempt to rank documents using an Okapi BM25 model. I think that I can calculate some of the terms required for the Score(D,Q) such as the IDF (Inverse Documen

Solution 1:

Your answer seems to be ok if each document appears only once in the file. Otherwise, the code will overwrite some records in dict d.

I think the following would be more general:

import numpy as np
import pandas as pd

fname = 'example.txt'

full_list = []
withopen(fname, "r") as f:
    for line in f:
        arr = line.strip(" \n").split(" ")
        for chunk in arr[1:]:
            # converting numbers to ints:
            int_pair = [int(x) for x in chunk.split(":")]
            full_list.append([arr[0], *int_pair])

df = pd.DataFrame(full_list)

df2 = df.pivot_table(values = 2, index = 0, columns = 1, aggfunc = np.sum, fill_value = 0)

How it works:

>>> cat 'example.txt'
D1 1:32:23:3
D2 1:42:7 
D2 7:1
D1 2:44:2
D1 4:14:3>>> full_list
Out[37]: 
[['D1', 1, 3],
 ['D1', 2, 2],
 ['D1', 3, 3],
 ['D2', 1, 4],
 ['D2', 2, 7],
 ['D2', 7, 1],
 ['D1', 2, 4],
 ['D1', 4, 2],
 ['D1', 4, 1],
 ['D1', 4, 3]]
>>> df
Out[38]: 
    0120  D1  131  D1  222  D1  333  D2  144  D2  275  D2  716  D1  247  D1  428  D1  419  D1  43>>> df2
Out[39]: 
1123470                
D1  36360
D2  47001

Solution 2:

Managed to accomplish this using a combination of changing to a list of lists, converting the list of lists to a dictionary of ID and dictionary of term frequencies, then straight to DataFrame, any improvements very welcome!

def term_matrix(fname):
f = open(fname, "r")
l = [x.strip(" \n").split(" ") for x in f.readlines()]

d = dict()

for i in l:
    d[i[0]] = dict(t.split(":") for t in i[1:])

return pd.DataFrame(d).transpose()

Post a Comment for "Python Pandas: Generate Document-term Matrix From Whitespace Delimited '.dat' File"