Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory mapping #33

Open
mcolpus opened this issue Dec 9, 2024 · 1 comment
Open

memory mapping #33

mcolpus opened this issue Dec 9, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@mcolpus
Copy link

mcolpus commented Dec 9, 2024

Thanks for the great software. I just wanted to check if memory mapping (like with Kraken2) works with Sylph, or if it doesn't make sense to here. The index here is much smaller but still some 14G.

I've tried putting the database into /dev/shm and that didn't seem to affect the running time.
The longest step was obtaining sketches which took 20 seconds either way. But looking in htop it seems that the limit step is loading into RAM.

ubuntu@mcolpus-main:~/pipelines/gatekeeper_pipeline$ sylph profile /mnt/block_data/sylph_dbs/gtdb-r220-c200-dbv1.syldb -t 20 -1 reads_1.fastq.gz -2 reads_2.fastq.gz -o out
2024-12-09T11:08:17.811Z INFO  [sylph::contain] Obtaining sketches...
2024-12-09T11:08:37.584Z INFO  [sylph::contain] Finished obtaining genome sketches.
2024-12-09T11:08:41.640Z INFO  [sylph::contain] reads_1.fastq.gz taxonomic profiling; reassigning k-mers for 1 genomes...
2024-12-09T11:08:41.685Z INFO  [sylph::contain] reads_1.fastq.gz has 1 genomes passing profiling threshold. 
2024-12-09T11:08:41.685Z INFO  [sylph::contain] Finished paired sample reads_1.fastq.gz.
2024-12-09T11:08:41.685Z INFO  [sylph::contain] sylph finished.

ubuntu@mcolpus-main:~/pipelines/gatekeeper_pipeline$ sylph profile /dev/shm/gtdb-r220-c200-dbv1.syldb -t 20 -1 reads_1.fastq.gz -2 reads_2.fastq.gz -o out
2024-12-09T11:09:02.694Z INFO  [sylph::contain] Obtaining sketches...
2024-12-09T11:09:23.187Z INFO  [sylph::contain] Finished obtaining genome sketches.
2024-12-09T11:09:27.228Z INFO  [sylph::contain] reads_1.fastq.gz taxonomic profiling; reassigning k-mers for 1 genomes...
2024-12-09T11:09:27.273Z INFO  [sylph::contain] reads_1.fastq.gz has 1 genomes passing profiling threshold. 
2024-12-09T11:09:27.273Z INFO  [sylph::contain] Finished paired sample reads_1.fastq.gz.
2024-12-09T11:09:27.273Z INFO  [sylph::contain] sylph finished.

If I change to using -t 1 then it still takes 20 seconds to obtain sketches, but subsequent steps take longer as expected.

@bluenote-1577
Copy link
Owner

bluenote-1577 commented Dec 9, 2024

Hi @mcolpus, I have not written any memory mapping aware code, so I don't anticipate it working.

The way to "get around this" is to profile multiple samples at once:

sylph profile -1 A1.fq B1.fq C1.fq -2 A2.fq B2.fq C2.fq ...

this loads the database only once and is more efficient.

I'll look at including memory mapping in the future, depending on how easy it is to implement.

@bluenote-1577 bluenote-1577 added the enhancement New feature or request label Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants