dracule

July 4th, 2009, 10:52 PM

Currently I need to find the most common elements in thousands of

arrays within one large array (arround 2 million instances with ~70k unique elements)

so I set up a dictionary to handle the counting so when I am

iterating I up the count on the corrosponding dictionary element. I then iterate through the dictionary and find the 25 most common elements. the elements are initially held in a array within an array. so I am am just trying to find the most common elements between all the arrays contained in one large array.

my current code looks something like this:

d = {}

for arr in my_array:

-----for i in arr:

#elements are numpy integers and thus are not accepted as dictionary keys

-----------d[int(i)]=d.get(int(i),0)+1

then I filter things down. but with my algorithm that only takes about1 sec so I dont need to show it here since that isnt the problem.

But there has to be something better. I have to do this many many

times and it seems silly to iterate through 2 million things just to get 25. The element IDs are integers and are currently being held in numpy arrays in a larger array. this ID is what makes up the key to the dictionary.

It currently takes about 5 seconds to accomplish this with my current algorithm.

So does anyone know the best solution or algorithm? I think the trick lies in matrix intersections but I do not know.

the common elements do not reside in ALL arrays.

I asked around on usenet and came up with this solution:

import numpy as np

import heapq

def frequency(arr):

'''Generate frequency-value pairs from a numpy array'''

clustered = arr.flatten() # copy (so can safely sort)

clustered.sort() # Bring identical values together

scanner = iter(clustered)

last = scanner.next()

count = 1

for el in scanner:

if el == last:

count += 1

else:

yield count, last

last = el

count = 1

yield count, last

def most_frequent(arr, N):

'''Return the top N (freq, val) elements in arr'''

counted = frequency(arr) # get an iterator for freq-val pairs

heap = []

# First, just fill up the array with the first N distinct

for i in range(N):

try:

heap.append(counted.next())

except StopIteration:

break # If we run out here, no need for a heap

else:

# more to go, switch to a min-heap, and replace the least

# element every time we find something better

heapq.heapify(heap)

for pair in counted:

if pair > heap[0]:

heapq.heapreplace(heap, pair)

return sorted(heap, reverse=True) # put most frequent first.

but that is still much too slow.I need to do this operation over 450,000 times and spending 5-15 seconds on each is simply not possible

Any help would be appreciated.

arrays within one large array (arround 2 million instances with ~70k unique elements)

so I set up a dictionary to handle the counting so when I am

iterating I up the count on the corrosponding dictionary element. I then iterate through the dictionary and find the 25 most common elements. the elements are initially held in a array within an array. so I am am just trying to find the most common elements between all the arrays contained in one large array.

my current code looks something like this:

d = {}

for arr in my_array:

-----for i in arr:

#elements are numpy integers and thus are not accepted as dictionary keys

-----------d[int(i)]=d.get(int(i),0)+1

then I filter things down. but with my algorithm that only takes about1 sec so I dont need to show it here since that isnt the problem.

But there has to be something better. I have to do this many many

times and it seems silly to iterate through 2 million things just to get 25. The element IDs are integers and are currently being held in numpy arrays in a larger array. this ID is what makes up the key to the dictionary.

It currently takes about 5 seconds to accomplish this with my current algorithm.

So does anyone know the best solution or algorithm? I think the trick lies in matrix intersections but I do not know.

the common elements do not reside in ALL arrays.

I asked around on usenet and came up with this solution:

import numpy as np

import heapq

def frequency(arr):

'''Generate frequency-value pairs from a numpy array'''

clustered = arr.flatten() # copy (so can safely sort)

clustered.sort() # Bring identical values together

scanner = iter(clustered)

last = scanner.next()

count = 1

for el in scanner:

if el == last:

count += 1

else:

yield count, last

last = el

count = 1

yield count, last

def most_frequent(arr, N):

'''Return the top N (freq, val) elements in arr'''

counted = frequency(arr) # get an iterator for freq-val pairs

heap = []

# First, just fill up the array with the first N distinct

for i in range(N):

try:

heap.append(counted.next())

except StopIteration:

break # If we run out here, no need for a heap

else:

# more to go, switch to a min-heap, and replace the least

# element every time we find something better

heapq.heapify(heap)

for pair in counted:

if pair > heap[0]:

heapq.heapreplace(heap, pair)

return sorted(heap, reverse=True) # put most frequent first.

but that is still much too slow.I need to do this operation over 450,000 times and spending 5-15 seconds on each is simply not possible

Any help would be appreciated.