Skip to content

Graphtyper copying the same files multiple times if using "--region_file" #159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sroener opened this issue Jan 7, 2025 · 2 comments
Open

Comments

@sroener
Copy link

sroener commented Jan 7, 2025

Hi,

thank you for writing and maintaining graphtyper.

I notices that graphtyper copies the same cram files multiple files if I use the "--region_file" option. From my log, I get messages similar to the following reports:

[2024-11-22 00:29:36.836] SV genotyping region chr2:1010000-1221700
[2024-11-22 00:29:36.836] Path to genome is 'GRCh38_full_analysis_set_plus_decoy_hla.fa'
[2024-11-22 00:29:36.836] Running with up to 72 threads.
[2024-11-22 00:29:36.836] Copying data from 288 input SAM/BAM/CRAMs to local disk.
[2024-11-22 00:29:36.836] Temporary folder is /tmp/graphtyper_241122_002936_chr2_001010000.iWGl68
[2024-11-22 00:29:36.836] Copying reference genome FASTA and its index to temporary folder.
[2024-11-22 00:29:39.496] Genotype calling step starting.
[2024-11-22 00:29:39.497] Padded region is: chr2:1009000-1422700
[2024-11-22 00:29:39.497] Constructing graph.
[2024-11-22 00:29:39.520] Calculating contig offsets.
[2024-11-22 00:30:47.770] Finished calling. Thread work: 5/2/4/3/3/3/4/2/3/4/3/3/3/3/2/3/4/4/4/3/3/3/2/3/3/3/2/3/4/4/2/2/3/3/4/4/3/4/3/2/3/2/4/4/3/3/3/2/3/4/4/2/2/3/3/4/4/3/2/2/3/2/3/2/2/3/3/2/3/3/3/2
[2024-11-22 00:30:47.770] Merging output VCFs.
[2024-11-22 00:30:49.878] Cleaning up temporary files.
[2024-11-22 00:30:50.219] Finished! Output written at: batch1/chr2/001010000-001221700.vcf.gz

[2024-11-22 00:30:50.219] SV genotyping region chr2:1223900-1594700
[2024-11-22 00:30:50.219] Path to genome is 'GRCh38_full_analysis_set_plus_decoy_hla.fa'
[2024-11-22 00:30:50.219] Running with up to 72 threads.
[2024-11-22 00:30:50.219] Copying data from 288 input SAM/BAM/CRAMs to local disk.
[2024-11-22 00:30:50.219] Temporary folder is /tmp/graphtyper_241122_003050_chr2_001223900.wcbtZp
[2024-11-22 00:30:50.219] Copying reference genome FASTA and its index to temporary folder.
[2024-11-22 00:30:52.815] Genotype calling step starting.
[2024-11-22 00:30:52.815] Padded region is: chr2:1222900-1795700
[2024-11-22 00:30:52.815] Constructing graph.
[2024-11-22 00:30:52.853] Calculating contig offsets.
[2024-11-22 00:32:04.971] Finished calling. Thread work: 4/3/3/3/3/4/3/3/3/3/4/3/4/3/3/4/3/3/3/3/2/3/3/4/3/4/3/3/3/3/3/3/3/2/3/3/4/4/3/3/3/3/3/3/4/3/3/3/4/4/3/2/3/3/3/3/2/3/3/3/3/3/3/2/2/2/2/2/2/3/2/2
[2024-11-22 00:32:04.972] Merging output VCFs.
[2024-11-22 00:32:10.079] Cleaning up temporary files.

I interpret the logs as the software is iterating over the regions in the region_file and repeating the same steps over and over again. These steps include copying the input crams and the reference genome to a distinct temporary directory and cleaning it after calling. This creates a lot of IO overhead that doesn't seem necessary from my perspective. The cram files and the reference genome should not change between the different regions.

The main question here would be: does graphtyper internally copy the whole data/cram file or only parts of it. In the case of the latter case, would it be possible to first load the data for all regions and then start processing?

My suggestions would be moving the copying/cleaning of the temporary directory outside of the "loop". This could save a lot of IO and probably make the calling of multiple regions faster.

I assume the changes would have to be done in genotype_sv.cpp.

Please let me know if my suggestions are feasible.

@hannespetur
Copy link
Member

Hey, only parts of the reference and cram files are copied in each region. There is currently no option to first load all the data for all regions and then start processing. I get your point that it would reduce I/O calls that so I am fine with having it as an option, but the downside is that it would also require a lot more temporary disk space so I don't think I'd find that option useful in my SV genotyping analysis.

Best, Hannes

@sroener
Copy link
Author

sroener commented Mar 5, 2025

Hi @hannespetur,

thank you for answering my questions! Knowing that it is only downloading the necessary chunks is very valuable!

I am not entirely sure, how to understand your response regarding the download. Would the option be feasible to implement, or are you not considering the option, as it is not useful for your analysis?

Best,

Sebastian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants