You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I notices that graphtyper copies the same cram files multiple files if I use the "--region_file" option. From my log, I get messages similar to the following reports:
[2024-11-22 00:29:36.836] SV genotyping region chr2:1010000-1221700
[2024-11-22 00:29:36.836] Path to genome is 'GRCh38_full_analysis_set_plus_decoy_hla.fa'
[2024-11-22 00:29:36.836] Running with up to 72 threads.
[2024-11-22 00:29:36.836] Copying data from 288 input SAM/BAM/CRAMs to local disk.
[2024-11-22 00:29:36.836] Temporary folder is /tmp/graphtyper_241122_002936_chr2_001010000.iWGl68
[2024-11-22 00:29:36.836] Copying reference genome FASTA and its index to temporary folder.
[2024-11-22 00:29:39.496] Genotype calling step starting.
[2024-11-22 00:29:39.497] Padded region is: chr2:1009000-1422700
[2024-11-22 00:29:39.497] Constructing graph.
[2024-11-22 00:29:39.520] Calculating contig offsets.
[2024-11-22 00:30:47.770] Finished calling. Thread work: 5/2/4/3/3/3/4/2/3/4/3/3/3/3/2/3/4/4/4/3/3/3/2/3/3/3/2/3/4/4/2/2/3/3/4/4/3/4/3/2/3/2/4/4/3/3/3/2/3/4/4/2/2/3/3/4/4/3/2/2/3/2/3/2/2/3/3/2/3/3/3/2
[2024-11-22 00:30:47.770] Merging output VCFs.
[2024-11-22 00:30:49.878] Cleaning up temporary files.
[2024-11-22 00:30:50.219] Finished! Output written at: batch1/chr2/001010000-001221700.vcf.gz
[2024-11-22 00:30:50.219] SV genotyping region chr2:1223900-1594700
[2024-11-22 00:30:50.219] Path to genome is 'GRCh38_full_analysis_set_plus_decoy_hla.fa'
[2024-11-22 00:30:50.219] Running with up to 72 threads.
[2024-11-22 00:30:50.219] Copying data from 288 input SAM/BAM/CRAMs to local disk.
[2024-11-22 00:30:50.219] Temporary folder is /tmp/graphtyper_241122_003050_chr2_001223900.wcbtZp
[2024-11-22 00:30:50.219] Copying reference genome FASTA and its index to temporary folder.
[2024-11-22 00:30:52.815] Genotype calling step starting.
[2024-11-22 00:30:52.815] Padded region is: chr2:1222900-1795700
[2024-11-22 00:30:52.815] Constructing graph.
[2024-11-22 00:30:52.853] Calculating contig offsets.
[2024-11-22 00:32:04.971] Finished calling. Thread work: 4/3/3/3/3/4/3/3/3/3/4/3/4/3/3/4/3/3/3/3/2/3/3/4/3/4/3/3/3/3/3/3/3/2/3/3/4/4/3/3/3/3/3/3/4/3/3/3/4/4/3/2/3/3/3/3/2/3/3/3/3/3/3/2/2/2/2/2/2/3/2/2
[2024-11-22 00:32:04.972] Merging output VCFs.
[2024-11-22 00:32:10.079] Cleaning up temporary files.
I interpret the logs as the software is iterating over the regions in the region_file and repeating the same steps over and over again. These steps include copying the input crams and the reference genome to a distinct temporary directory and cleaning it after calling. This creates a lot of IO overhead that doesn't seem necessary from my perspective. The cram files and the reference genome should not change between the different regions.
The main question here would be: does graphtyper internally copy the whole data/cram file or only parts of it. In the case of the latter case, would it be possible to first load the data for all regions and then start processing?
My suggestions would be moving the copying/cleaning of the temporary directory outside of the "loop". This could save a lot of IO and probably make the calling of multiple regions faster.
I assume the changes would have to be done in genotype_sv.cpp.
Please let me know if my suggestions are feasible.
The text was updated successfully, but these errors were encountered:
Hey, only parts of the reference and cram files are copied in each region. There is currently no option to first load all the data for all regions and then start processing. I get your point that it would reduce I/O calls that so I am fine with having it as an option, but the downside is that it would also require a lot more temporary disk space so I don't think I'd find that option useful in my SV genotyping analysis.
thank you for answering my questions! Knowing that it is only downloading the necessary chunks is very valuable!
I am not entirely sure, how to understand your response regarding the download. Would the option be feasible to implement, or are you not considering the option, as it is not useful for your analysis?
Hi,
thank you for writing and maintaining graphtyper.
I notices that graphtyper copies the same cram files multiple files if I use the "--region_file" option. From my log, I get messages similar to the following reports:
I interpret the logs as the software is iterating over the regions in the region_file and repeating the same steps over and over again. These steps include copying the input crams and the reference genome to a distinct temporary directory and cleaning it after calling. This creates a lot of IO overhead that doesn't seem necessary from my perspective. The cram files and the reference genome should not change between the different regions.
The main question here would be: does graphtyper internally copy the whole data/cram file or only parts of it. In the case of the latter case, would it be possible to first load the data for all regions and then start processing?
My suggestions would be moving the copying/cleaning of the temporary directory outside of the "loop". This could save a lot of IO and probably make the calling of multiple regions faster.
I assume the changes would have to be done in genotype_sv.cpp.
Please let me know if my suggestions are feasible.
The text was updated successfully, but these errors were encountered: