Skip to content Skip to sidebar Skip to footer

Inserting Null Columns Into A Scipy Sparse Matrix In A Specific Order

I have a sparse matrix with M rows and N columns, to which I want to concatenate K additional NULL columns so my objects will have now M rows and (N+K) columns. The tricky part is

Solution 1:

As I deduced in the comments, the memory error was produced in the

null_mat[:, indeces] = x

line because the lil__setitem__ method, does a x.toarray(), that is, it first converts x to a dense array. Mapping the sparse matrix onto the index lil directly might be more space efficient, but a lot more work to code. And lil is optimized for iterative assignment, not this large scale matrix mapping.

sparse.hstack uses sparse.bmat to join sparse matrices. This converts all inputs to coo, and then combines their attributes into a new set, building the new matrix from those.

direct coo matrix construction

After quite a bit of playing around, I found that the following simple operation works:

In [479]: z1=sparse.coo_matrix((x.data, (x.row, indeces[x.col])),shape=(M,size))

In [480]: z1
Out[480]: 
<5000x12 sparse matrix of type'<class 'numpy.float64'>'with5000 stored elements in COOrdinate format>

Compare this with the x and null_mat:

In [481]: x
Out[481]: 
<5000x10 sparse matrix of type'<class 'numpy.float64'>'with5000 stored elements in COOrdinate format>
In [482]: null_mat
Out[482]: 
<5000x12 sparse matrix of type'<class 'numpy.float64'>'with5000 stored elements in LInked Listformat>

Testing the equality of sparse matrices can be tricky. coo values in particular can occur in any order, such as in x which was produced by sparse.random.

But the csr format orders the rows, so this comparison of the indptr attribute is a pretty good equality test:

In [483]: np.allclose(null_mat.tocsr().indptr, z1.tocsr().indptr)
Out[483]: True

A time test:

In [477]: timeit z1=sparse.coo_matrix((x.data, (x.row, indeces[x.col])),shape=(M,size))
108 µs ± 1.24 µs per loop (mean ± std. dev. of7 runs, 10000 loops each)

In [478]: 
In [478]: timeit null_mat[:, indeces] = x
3.05 ms ± 4.55 µs per loop (mean ± std. dev. of7 runs, 100 loops each)

matrix multiplication approach

csr format indexing with lists is done with matrix multiplication. It constructs an extractor matrix, and applies that. Matrix multiplication is a csr_matrix strong point.

We can perform the reordering in the same way:

In [489]: I = sparse.csr_matrix((np.ones(10),(np.arange(10),indeces)), shape=(10,12))
In [490]: I
Out[490]: 
<10x12 sparse matrix of type '<class 'numpy.float64'>'
    with 10 stored elements in Compressed Sparse Row format>

In [496]: w1=x*I

Comparing the dense equivalents of these matrices:

In [497]: np.allclose(null_mat.A, z1.A)
Out[497]: True
In [498]: np.allclose(null_mat.A, w1.A)
Out[498]: True


In [499]: %%timeit
     ...: I = sparse.csr_matrix((np.ones(10),(np.arange(10),indeces)),shape=(10,
     ...: 12))
     ...: w1=x*I
1.11 ms ± 5.65 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

That's better than the lil indexing approach, though still much slower than the direct coo matrix construction. Though to be fair, we should construct a csr matrix from the coo style inputs. That conversion takes some time:

In [502]: timeit z2=sparse.csr_matrix((x.data, (x.row, indeces[x.col])),shape=(M
     ...: ,size))
639 µs ± 604 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

error traceback

The MemoryError traceback should have revealed that the error occurred in this indexed assignment, and that the relevant method calls are:

Signature: null_mat.__setitem__(index, x)
Source:   
    def__setitem__(self, index, x):
       ....
       if isspmatrix(x):
           x = x.toarray()
       ...

Signature: x.toarray(order=None, out=None)
Source:   
    deftoarray(self, order=None, out=None):
        """See the docstring for `spmatrix.toarray`."""
        B = self._process_toarray_args(order, out)
Signature: x._process_toarray_args(order, out)
Source:   
    def_process_toarray_args(self, order, out):
        ...
        return np.zeros(self.shape, dtype=self.dtype, order=order)

I found this by doing a code search on the scipy github, for the np.zeros calls.

Post a Comment for "Inserting Null Columns Into A Scipy Sparse Matrix In A Specific Order"