Skip to content Skip to sidebar Skip to footer

Numpy's Genfromtxt Returns Different Structured Data Depending On Dtype Parameters

I have the following: from numpy import genfromtxt seg_data1 = genfromtxt('./datasets/segmentation.all', delimiter=',', dtype='|S5') seg_data2 = genfromtxt('./datasets/segmenta

Solution 1:

With dtype="|S5" you import all columns as strings (5 char). The result is a 2d array with rows like

['BRICK''140.0''125.0' ..., '7.777''0.545''-1.12']

With dtype=["|S5"] + ["float" for n in range(19)] you specify the dtype for each column, the result is a structured array. It is 1d with 20 fields. You access the fields by name (look at set_data2.dtype), not by column number.

A element, or record, of this array is displayed as a tuple, and includes a string and 19 floats:

('BRICK', 140.0, 125.0, 9.0, 0.0, 0.0, 0.2777779, 0.06296301, 0.66666675, 0.31111118, 6.185185, 7.3333335, 7.6666665, 3.5555556, 3.4444444, 4.4444447, -7.888889, 7.7777777, 0.5456349, -1.1218182)

# the initial character column

print set_data2['f0']  

Specifying dtype=None should produce the same thing, possibly with some integer columns instead of all floats.

It is also possible to specify a dtype with 2 fields, one the string column, and the other the 19 floats. I'd have to check the docs and run a few test cases to be sure of the format.

I think you read enough of genfromtxt docs to see that you could specify a compound dtype, but not enough to understand the results.

=================

Example of importing csv with text and numbers:

In [139]: txt=b"""one 1 2 3
     ...: two 4 5 6
     ...: """

default: all floats

In [140]: np.genfromtxt(txt.splitlines())
Out[140]: 
array([[ nan,   1.,   2.,   3.],
       [ nan,   4.,   5.,   6.]])

automatic dtype selection - 4 fields

In [141]: np.genfromtxt(txt.splitlines(),dtype=None)
Out[141]: 
array([(b'one', 1, 2, 3), (b'two', 4, 5, 6)], 
      dtype=[('f0', 'S3'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])

user specified field dtypes

In [142]: np.genfromtxt(txt.splitlines(),dtype='str,int,float,int')
Out[142]: 
array([('', 1, 2.0, 3), ('', 4, 5.0, 6)], 
      dtype=[('f0', '<U'), ('f1', '<i4'), ('f2', '<f8'), ('f3', '<i4')])

Compound dtype, with column count for the numeric field (and correction to string column)

In[145]: np.genfromtxt(txt.splitlines(),dtype='S5,(3)int')
Out[145]: 
array([(b'one', [1, 2, 3]), (b'two', [4, 5, 6])], 
      dtype=[('f0', 'S5'), ('f1', '<i4', (3,))])

In[146]: _['f0']Out[146]: 
array([b'one', b'two'], 
      dtype='|S5')

In[149]: _['f1']Out[149]: 
array([[1, 2, 3],
       [4, 5, 6]])

If you need to do math across the numeric fields, this last case (or something more elaborate) might be most convenient.

To generate something more complicated it may be best to develop the dtype in a separate expression (dtype syntax can be tricky)

In [172]: dt=np.dtype([('f0','|S5'),('f1',[('f10',int),('f11',float,(2))])])

In [173]: np.genfromtxt(txt.splitlines(),dtype=dt)
Out[173]: 
array([(b'one', (1, [2.0, 3.0])), (b'two', (4, [5.0, 6.0]))], 
      dtype=[('f0', 'S5'), ('f1', [('f10', '<i4'), ('f11', '<f8', (2,))])])

Post a Comment for "Numpy's Genfromtxt Returns Different Structured Data Depending On Dtype Parameters"