Archive

Archive for May, 2014

How they work, Cryptocurrencies Like Bitcoin

May 8th, 2014 No comments

This is a well written explanation of Cryptocurrencies like Bitcoin based on Minimum Viable Block Chain.

http://www.igvita.com/

Categories: Uncategorized Tags:

de Bruijn sequence: Part 1

May 7th, 2014 No comments

I was thinking about genetic sequences and the problem of assembly. I have followed Titus Brown‘s use of Bloom Filters. To learn more about Bloom Filters and their application to genetic sequence assembly check out Handling ridiculous amounts of data with probabilistic data structures by Titus Brown. Part of the problem is both the amount of data and the number of combinations or sequence samples. I am interested in the most compact for to store all possible combination of sequences of a given length. The solution is a de Bruijn sequence

Below are a few different functions I experimented with to produce a de Bruijn sequence.

In [2]:
import memory_profiler
import timeit
import string

def de_bruijn_1(k, n):
    """
    De Bruijn sequence for alphabet size k (0,1,2...k-1)
    and subsequences of length n.
    From wikipedia Sep 22 2013:
        Modified last line to return a string rather than a list
    """
    a = [0] * k * n
    sequence = []
    def db(t, p,):
        if t > n:
            if n % p == 0:
                for j in range(1, p + 1):
                    sequence.append(a[j])
        else:
            a[t] = a[t - p]
            db(t + 1, p)
            for j in range(int(a[t - p]) + 1, k):
                a[t] = j
                db(t + 1, t)
    db(1, 1)
    #return sequence  #original
    return ''.join([str(i) for i in sequence])

################
def de_bruijn_strings(k, n):
    """
    De Bruijn sequence for alphabet size k (0,1,2...k-1)
    and subsequences of length n.
    Modifed wikipedia Sep 22 2013 to use strips
    """
    global sequence
    global a
    a = '0' * k * n
    sequence = ''
    def db(t, p):
        global sequence
        global a
        if t > n:
            if n % p == 0:
                for j in range(1, p + 1):
                    sequence = sequence + a[j]
        else:
            a = a[:t] + a[t - p]  + a[t+1:]
            db(t + 1, p)
            for j in range(int(a[t - p]) + 1, k):
                a = a[:t] + str(j)  + a[t+1:]
                db(t + 1, t)
        return sequence
    db(1, 1)
    return sequence

################
_mapping = bytearray(b"?")*256
_mapping[:10] = b"0123456789"

def de_bruijn_bytes(k, n):
    """
    By Peter Otten on python-list
    """
    a = k * n * bytearray([0])
    sequence = bytearray()
    extend = sequence.extend
    def db(t, p):
        if t > n:
            if n % p == 0:
                extend(a[1: p+1])
        else:
            a[t] = a[t - p]
            db(t + 1, p)
            for j in range(a[t - p] + 1, k):
                a[t] = j
                db(t + 1, t)
    db(1, 1)
    return sequence.translate(_mapping).decode("ascii")

d1 = de_bruijn_1(4, 9)
d2 = de_bruijn_strings(4, 9)
d3 = de_bruijn_bytes(4, 9)

print('They are the same?', d1==d2==d3)

# from de_bruijn import *
%memit de_bruijn_1(4, 10)
%memit de_bruijn_strings(4, 10)
%memit de_bruijn_bytes(4, 10)
('They are the same?', True)
peak memory: 43.86 MiB, increment: 17.18 MiB
peak memory: 28.51 MiB, increment: 0.53 MiB
peak memory: 33.89 MiB, increment: 5.03 MiB

As you can see above the memory requirments for the de_bruijn_strings and de_bruijn_bytes and significantly less. This is really the limiting factor in increasing the word size. On the machine I am testing on I think I can go up to de_bruijn_bytes(4, 17) before I fill the memory.

In [3]:
!python -m timeit -s 'from de_bruijn import de_bruijn_1' 'de_bruijn_1(4, 10)'
!python -m timeit -s 'from de_bruijn import de_bruijn_strings' 'de_bruijn_strings(4, 10)'
!python -m timeit -s 'from de_bruijn import de_bruijn_bytes' 'de_bruijn_bytes(4, 10)'
10 loops, best of 3: 585 msec per loop
10 loops, best of 3: 59.5 sec per loop
10 loops, best of 3: 168 msec per loop

Notice That although the de_bruijn_strings uses low memory it is really slow. de_bruijn_bytes is my far a better solution.

 

Categories: Uncategorized Tags: