Python Pandas: Generate Document-term Matrix From Whitespace Delimited '.dat' File
I'm using Python to attempt to rank documents using an Okapi BM25 model. I think that I can calculate some of the terms required for the Score(D,Q) such as the IDF (Inverse Documen
Solution 1:
Your answer seems to be ok if each document appears only once in the file. Otherwise, the code will overwrite some records in dict d
.
I think the following would be more general:
import numpy as np
import pandas as pd
fname = 'example.txt'
full_list = []
withopen(fname, "r") as f:
for line in f:
arr = line.strip(" \n").split(" ")
for chunk in arr[1:]:
# converting numbers to ints:
int_pair = [int(x) for x in chunk.split(":")]
full_list.append([arr[0], *int_pair])
df = pd.DataFrame(full_list)
df2 = df.pivot_table(values = 2, index = 0, columns = 1, aggfunc = np.sum, fill_value = 0)
How it works:
>>> cat 'example.txt'
D1 1:32:23:3
D2 1:42:7
D2 7:1
D1 2:44:2
D1 4:14:3>>> full_list
Out[37]:
[['D1', 1, 3],
['D1', 2, 2],
['D1', 3, 3],
['D2', 1, 4],
['D2', 2, 7],
['D2', 7, 1],
['D1', 2, 4],
['D1', 4, 2],
['D1', 4, 1],
['D1', 4, 3]]
>>> df
Out[38]:
0120 D1 131 D1 222 D1 333 D2 144 D2 275 D2 716 D1 247 D1 428 D1 419 D1 43>>> df2
Out[39]:
1123470
D1 36360
D2 47001
Solution 2:
Managed to accomplish this using a combination of changing to a list of lists, converting the list of lists to a dictionary of ID and dictionary of term frequencies, then straight to DataFrame, any improvements very welcome!
def term_matrix(fname):
f = open(fname, "r")
l = [x.strip(" \n").split(" ") for x in f.readlines()]
d = dict()
for i in l:
d[i[0]] = dict(t.split(":") for t in i[1:])
return pd.DataFrame(d).transpose()
Post a Comment for "Python Pandas: Generate Document-term Matrix From Whitespace Delimited '.dat' File"