13 December, 2009

Random Subset of Combinations

I was solving an interesting problem today and thought I'd write about it and post some nice code.
Imagine you have a list of fields with a discrete set of possible values each. Example:


one of [John, Paul, Marie, ….]
one of [Turner, Conner, Lennon…]
a date between 01 January and 31 December
one of [USA, Canada, Germany, France...]


The number of possible combinations is equal to the multiplication of the number of possible values for each field and grows very fast.
So now imagine that you want a way to randomly generate a small number of combinations, where there are no repetitions (like 100 random, unique people from the example above) and you want it done in an optimal way.
My solution (I have the source in Python so don't freak out from my attempt at explaining):
1. Map every combination to an integer in the range between 0 and the total number of combinations (so the first possible person has id of 0 and the last one has id of TOTAL - 1).
2. Generating a random combination is equivalent to generating a random id between 0 and TOTAL - 1 and then building the corresponding combination.
3. The mapping is done by representing the combinations as points in a multidimensional space using each field as a dimension. The index of the value of the field in the set of all possible value is the coordinate in this dimension.
4. Lets say that we have N fields each with a list of possible values, L1, …, LN and sizes Size1, …, SizeN.
The number of all combinations in a space is the multiplication of all the sizes:
TOTAL = Size1*Size2…*SizeN
5. If we remove the Nth field we are left with a space with a lower dimension and a total of:
TotalN-1 = Size1*Size2*…SizeN-1
We can call this the Subtotal of N or SubN. With some calculations it's clear that
SubN = SizeN-1*SubN-1
and we set Sub1 = 1
6. Lets name the index of the value of the i-th field of a combination Xi. Then the id of a combination can be calculated like this:
ID = XN*SubN + XN-1*SubN-1 + …+ X1 * Sub1
7. If we have the ID we can calculate X1, …XN like this:
XN = ID/SubN (/ is integer division, i.e we ignore the remainder)
XN-1 = (ID % SubN)/SubN-1 (% is modulo i.e. the remainder of the division)

X1 = ((ID % SubN)%SubN-1)…%Sub2
8. So this gives us a way to quickly generate a random combination, but still doesn't solve the initial problem of getting a whole subset of non-repeating, random combinations.
This is equivalent to generating a list of random, non-repeating numbers in the range 0 - (TOTAL - 1).
9. I wasn't sure if this can be done optimally, but after some reading (Great article on k-permutations) it seemed that there is a very efficient algorithm.
For a subset of size K from N numbers it is possible to do it in O(K) both for space and speed. After I implemented the algorithm I found out that Python has a method that can do this (random.sample) so I'm using it in my code, but my implementation is also there (shuffle_range) if you need to do it in a different language or would like to appreciate how short and beautiful it is ;-)
10. The idea of the algorithm is to take all the numbers from 1 to N (bare with me, it gets optimized later) and shuffle them. Then get the first K numbers and throw the rest.
The shuffling is done by putting the numbers in a big array, iterating through them and for each position generating a random number between 0 and n-1 then swapping the value at the current position with the value at the random position.
The optimization that I mention allows this to be reduced from an O(N) to an O(K) algorithm both for space and speed.
Instead of creating an array of size N we use a hash table with the rule that if the table contains a value for a position this is the value at that position, else the value is equal to the index of the position. Then we only iterate from 0 to K-1 and generate the result using the above rule.

And after this long explanation, which will either put you to sleep or make you have nightmares in the next few nights, here is the source code, which I hope will clarify things a bit:

"""Generation of a random subset of all possible combinations for a list
of list of options.

This code is released in the public domain, meaning it is free for any
use.
2009 Plamen Dragozov
"""
import random

class Combinator(object):
"""Manages a list of lists and generates combinations of the elements of these lists.
"""
def __init__(self, lists):
"""Initialize with the given lists and pre-calculate the subtotals.

Subtotals are the number of combinations of all lists without the last one,
so subtotal3 is equal to Size2*Size1.
Total is the number of all possible combinations.
"""
subtotals = []
total = 1
for l in lists:
subtotals.append(total)
total = total * len(l)

self.total = total
self.lists = lists
self.subtotals = subtotals

def combination(self, n):
""" A combination is a list where the element at position 'i' is one of the elements in list 'i'.

All possible combinations (with a count of Total) can be mapped to the first Total integers.
This method returns the n-th combination, calculating all element indices in the lists.
"""
result = []
for i in range(len(self.lists) - 1, -1, -1):#iterate backwards, from the last to the 1st
subtotal = self.subtotals[i]
pos = n / subtotal
result.append(self.lists[i][pos])
n = n % subtotal
result.reverse()
return result

def random(self, count):
"""Generate 'count' random combinations without repetition.
"""
ids = random.sample(xrange(self.total), min(count, self.total))
return [self.combination(i) for i in ids]

def shuffle_range(n, k = None):
"""Creates a random sub-permutation of size k for the first n integers.

This is pretty much equivalent to random.sample.
"""
if (not k) or (k > n):
k = n
hashT = {}
for i in xrange(k):
j = random.randint(0, n - 1)
#swap whats at i with what's at j and vice versa
hashT[i], hashT[j] = hashT.get(j, j), hashT.get(i, i)

return [hashT.get(i, i) for i in xrange(k)]

No comments: