Skip to content Skip to sidebar Skip to footer

Efficiently Determining If Large Sorted Numpy Array Has Only Unique Values

I have a very large numpy array and I want to sort it and test if it is unique. I'm aware of the function numpy.unique but it sorts the array another time to achieve it. The reason

Solution 1:

You can check whether there are two or more equal values next to each other (non-unique values in a sorted array) by comparing their difference to 0

numpy.any(numpy.diff(slices) == 0)

Be aware though that numpy will create two intermediate arrays: one with the difference values, one with boolean values.

Solution 2:

Here's an approach making use of slicing and instead of actual differentiation, we can just compare each element against the previous one without actually computing the differentiation value, like so -

~((slices[1:] == slices[:-1]).any())

Runtime test -

In [54]: slices = np.sort(np.random.randint(0,100000000,(10000000)))

# @Nils Werner's soln
In [55]: %timeit ~np.any(np.diff(slices) == 0)
100 loops, best of 3: 18.5 ms per loop

# @Marco's suggestion in comments
In [56]: %timeit np.diff(slices).all()
10 loops, best of 3: 20.6 ms per loop

# Proposed soln in this post
In [57]: %timeit ~((slices[1:] == slices[:-1]).any())
100 loops, best of 3: 6.12 ms per loop

Post a Comment for "Efficiently Determining If Large Sorted Numpy Array Has Only Unique Values"