Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Rust function as slow as its python counterpart

I am trying to speed up Python programs using Rust, a language in which I am a total beginner. I wrote a function that counts the occurrences of each possible string of length n within a larger string. For instance, if the main string is "AAAAT" and n=3, the outcome would be a hashmap {"AAA":2,"AAT":1}. I use pyo3 to call the Rust function from Python. The code of the Rust function is:

fn count_nmers(seq : &str,n :usize) -> PyResult<HashMap<&str,u64>> {
    let mut current_pos : usize =0;
    let mut counts : HashMap<&str,u64> =HashMap::new();
    while current_pos+n <= seq.len() {
        //print!("{}\n", &seq[current_pos..current_pos+n]);
        match counts.get(&seq[current_pos..current_pos+n]) {
            Some(repeats) => counts.insert(&seq[current_pos..current_pos+n],repeats+1),
            None => counts.insert(&seq[current_pos..current_pos+n],1)
        };
        current_pos +=1;
    }
    //print!("{:?}",counts)
    Ok(counts)
}

When I use small values for n (n<10), Rust is about an order of magnitude faster than Python, but as the length of n increases, the gap tends to zero with both functions having the same speed by n=200. (see graph)
[Times to count for different n-mer lengths (Python black, rust red)][1]

I must be doing something wrong with the strings, but I can’t find the mistake.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

enter image description here

Edit: The python code is:

def nmer_freq_table(sequence,nmer_length=6):
    nmer_dict=dict()
    for nmer in seq_win(sequence,window_size=nmer_length):
        if str(nmer) in nmer_dict.keys():
            nmer_dict[str(nmer)]=nmer_dict[str(nmer)]+1
        else:
            nmer_dict[str(nmer)]=1
    return nmer_dict

def seq_win(seq,window_size=2):
    length=len(seq)
    i=0
    while i+window_size <= length:
        yield seq[i:i+window_size]
        i+=1

Edit: changed the picture to reflect improved code and larger n.

>Solution :

You are computing hash function multiple times, this may matter for large n values. Try using entry function instead of manual inserts:

    while current_pos+n <= seq.len() {
        let en = counts.entry(&seq[current_pos..current_pos+n]).or_default();
        *en += 1;
        current_pos +=1;
    }

Complete code here

Next, make sure you are running --release compiled code like cargo run --release.

And finally, on large data, most of time is spent in HashMap/dict internals which are not a python, but compiled code. So don’t expect it to scale well.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading