Skip to content Skip to sidebar Skip to footer

Create A Frequency Matrix For Bigrams From A List Of Tuples, Using Numpy Or Pandas

I am very new to Python. I have a list of tuples, where I created bigrams. This question is pretty close to my needs my_list = [('we', 'consider'), ('what', 'to'), ('use', 'the'),

Solution 1:

You can create frequancy data frame and call index-values by words:

words=sorted(list(set([item fortin my_list foritemin t])))
df = pd.DataFrame(0, columns=words, index=words)
foriin my_list:
  df.at[i[0],i[1]] += 1

output:

          consider  of  the  to  use  we  what  words
consider         00000000
of               00000000
the              00000000to00000000
use              00100000
we               10000000
what             00010000
words            01000000

Note that in this one, the order in the bigram matters. If you don't care about order, you should sort the tuples by their content first, using this:

my_list = [tuple(sorted(i)) for i in my_list]

Another way is to use Counter to do the count, but I expect it to be similar performance(again if order in bigrams matters, remove sorted from frequency_list):

from collections import Counter

frequency_list = Counter(tuple(sorted(i)) for i in my_list)
words=sorted(list(set([item for t in my_list for item in t])))
df = pd.DataFrame(0, columns=words, index=words)
for k,v in frequency_list.items():
  df.at[k[0],k[1]] = v

output:

          consider  of  the  to  use  we  what  words
consider         00000100
of               00000001
the              00001000to00000010
use              00000000
we               00000000
what             00000000
words            00000000

Solution 2:

If you do not care about speed too much you could use for loop.

import pandas as pd
import numpy as np
from itertools import product

my_list = [('we', 'consider'), ('what', 'to'), ('use', 'the'), ('words', 'of')]

index = pd.DataFrame(my_list)[0].unique()
columns = pd.DataFrame(my_list)[1].unique()
df = pd.DataFrame(np.zeros(shape=(len(columns), len(index))),
                  columns=columns, index=index, dtype=int)

for idx,col in product(index, columns):
    df[col].loc[idx] = my_list.count((idx, col))

print(df)

Output:

       consider  to  the  of
we            1000
what          0100
use           0010
words         0001

Post a Comment for "Create A Frequency Matrix For Bigrams From A List Of Tuples, Using Numpy Or Pandas"