Skip to content Skip to sidebar Skip to footer

What Is The Way To Add An Index Column In Dask When Reading From A Csv?

I'm trying to process a fairly large dataset that doesn't fit into memory using Pandas when loading it at once so I'm using Dask. However, I'm having difficulty in adding a unique

Solution 1:

Right, it's hard to know number of lines in each chunk of a CSV file without reading through it, so it's hard to produce an index like 0, 1, 2, 3, ... if the dataset spans multiple partitions.

One approach would be to create a column of ones:

df["idx"] = 1

and then call cumsum

df["idx"] = df["idx"].cumsum()

But note that this does add a bunch of dependencies to the task graph that backs your dataframe, so some operations might not be as parallel as they were before.

Post a Comment for "What Is The Way To Add An Index Column In Dask When Reading From A Csv?"