Inserting Null Columns Into A Scipy Sparse Matrix In A Specific Order
Solution 1:
As I deduced in the comments, the memory error was produced in the
null_mat[:, indeces] = x
line because the lil
__setitem__
method, does a x.toarray()
, that is, it first converts x
to a dense array. Mapping the sparse matrix onto the index lil
directly might be more space efficient, but a lot more work to code. And lil
is optimized for iterative assignment, not this large scale matrix mapping.
sparse.hstack
uses sparse.bmat
to join sparse matrices. This converts all inputs to coo
, and then combines their attributes into a new set, building the new matrix from those.
direct coo matrix construction
After quite a bit of playing around, I found that the following simple operation works:
In [479]: z1=sparse.coo_matrix((x.data, (x.row, indeces[x.col])),shape=(M,size))
In [480]: z1
Out[480]:
<5000x12 sparse matrix of type'<class 'numpy.float64'>'with5000 stored elements in COOrdinate format>
Compare this with the x
and null_mat
:
In [481]: x
Out[481]:
<5000x10 sparse matrix of type'<class 'numpy.float64'>'with5000 stored elements in COOrdinate format>
In [482]: null_mat
Out[482]:
<5000x12 sparse matrix of type'<class 'numpy.float64'>'with5000 stored elements in LInked Listformat>
Testing the equality of sparse matrices can be tricky. coo
values in particular can occur in any order, such as in x
which was produced by sparse.random
.
But the csr
format orders the rows, so this comparison of the indptr
attribute is a pretty good equality test:
In [483]: np.allclose(null_mat.tocsr().indptr, z1.tocsr().indptr)
Out[483]: True
A time test:
In [477]: timeit z1=sparse.coo_matrix((x.data, (x.row, indeces[x.col])),shape=(M,size))
108 µs ± 1.24 µs per loop (mean ± std. dev. of7 runs, 10000 loops each)
In [478]:
In [478]: timeit null_mat[:, indeces] = x
3.05 ms ± 4.55 µs per loop (mean ± std. dev. of7 runs, 100 loops each)
matrix multiplication approach
csr
format indexing with lists is done with matrix multiplication. It constructs an extractor
matrix, and applies that. Matrix multiplication is a csr_matrix
strong point.
We can perform the reordering in the same way:
In [489]: I = sparse.csr_matrix((np.ones(10),(np.arange(10),indeces)), shape=(10,12))
In [490]: I
Out[490]:
<10x12 sparse matrix of type '<class 'numpy.float64'>'
with 10 stored elements in Compressed Sparse Row format>
In [496]: w1=x*I
Comparing the dense equivalents of these matrices:
In [497]: np.allclose(null_mat.A, z1.A)
Out[497]: True
In [498]: np.allclose(null_mat.A, w1.A)
Out[498]: True
In [499]: %%timeit
...: I = sparse.csr_matrix((np.ones(10),(np.arange(10),indeces)),shape=(10,
...: 12))
...: w1=x*I
1.11 ms ± 5.65 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
That's better than the lil
indexing approach, though still much slower than the direct coo
matrix construction. Though to be fair, we should construct a csr
matrix from the coo
style inputs. That conversion takes some time:
In [502]: timeit z2=sparse.csr_matrix((x.data, (x.row, indeces[x.col])),shape=(M
...: ,size))
639 µs ± 604 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
error traceback
The MemoryError traceback should have revealed that the error occurred in this indexed assignment, and that the relevant method calls are:
Signature: null_mat.__setitem__(index, x)
Source:
def__setitem__(self, index, x):
....
if isspmatrix(x):
x = x.toarray()
...
Signature: x.toarray(order=None, out=None)
Source:
deftoarray(self, order=None, out=None):
"""See the docstring for `spmatrix.toarray`."""
B = self._process_toarray_args(order, out)
Signature: x._process_toarray_args(order, out)
Source:
def_process_toarray_args(self, order, out):
...
return np.zeros(self.shape, dtype=self.dtype, order=order)
I found this by doing a code search on the scipy
github, for the np.zeros
calls.
Post a Comment for "Inserting Null Columns Into A Scipy Sparse Matrix In A Specific Order"