remove partially failed samples from output #340

dmalzl · 2025-03-19T08:31:38Z

Description of feature

Hi and thanks for providing this nice pipeline it really saves me a lot of time and headache when downloading large amounts of published data. However, recently I found some hiccups which left me with a bit of wondering if one could maybe include this in the pipeline.

So here is the problem:
When downloading files with sratools it sometime seems to happen that the datastream fetched by prefetch is corrupt but prefetch just continues as if nothing happened and the process completes as usual. Unfortunately, fasterq-dump later complains and fails with "SRA ID not found" ("or another one was "qualities empty" or something). Anyway, I guess this could easily be solved by just retrying another time or using the FTP or Aspera for download of these particular files and does not really concern much. However, when using SRA experiment IDs (SRX) for download this can lead to a situation where the sample is only partially downloaded i.e. if an SRX contains multiple SRRs and one of them fails I am left with an incomplete data download, which requires me to manually check the failed processes and see if I got some finished ones for the same SRX. Not only is it additional work to parse the .nextflow.log file but it is also cumbersome for downstream processing (in my case via rnaseq) if one is not aware of this behaviour.

Possible solution:
Connected to #270, I would suggest to somehow check for completeness based on the retrieved retrieved metadata i.e. simple check if number of SRRs is consistent with the number listed in the ENA table and simply fail if not. This way partial download can happen without polluting the results directory.

The text was updated successfully, but these errors were encountered:

dmalzl added the enhancement Improvement for existing functionality label Mar 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove partially failed samples from output #340

remove partially failed samples from output #340

dmalzl commented Mar 19, 2025

remove partially failed samples from output #340

remove partially failed samples from output #340

Comments

dmalzl commented Mar 19, 2025

Description of feature