Skip to content

remove partially failed samples from output #340

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dmalzl opened this issue Mar 19, 2025 · 0 comments
Open

remove partially failed samples from output #340

dmalzl opened this issue Mar 19, 2025 · 0 comments
Labels
enhancement Improvement for existing functionality

Comments

@dmalzl
Copy link

dmalzl commented Mar 19, 2025

Description of feature

Hi and thanks for providing this nice pipeline it really saves me a lot of time and headache when downloading large amounts of published data. However, recently I found some hiccups which left me with a bit of wondering if one could maybe include this in the pipeline.

So here is the problem:
When downloading files with sratools it sometime seems to happen that the datastream fetched by prefetch is corrupt but prefetch just continues as if nothing happened and the process completes as usual. Unfortunately, fasterq-dump later complains and fails with "SRA ID not found" ("or another one was "qualities empty" or something). Anyway, I guess this could easily be solved by just retrying another time or using the FTP or Aspera for download of these particular files and does not really concern much. However, when using SRA experiment IDs (SRX) for download this can lead to a situation where the sample is only partially downloaded i.e. if an SRX contains multiple SRRs and one of them fails I am left with an incomplete data download, which requires me to manually check the failed processes and see if I got some finished ones for the same SRX. Not only is it additional work to parse the .nextflow.log file but it is also cumbersome for downstream processing (in my case via rnaseq) if one is not aware of this behaviour.

Possible solution:
Connected to #270, I would suggest to somehow check for completeness based on the retrieved retrieved metadata i.e. simple check if number of SRRs is consistent with the number listed in the ENA table and simply fail if not. This way partial download can happen without polluting the results directory.

@dmalzl dmalzl added the enhancement Improvement for existing functionality label Mar 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement for existing functionality
Projects
None yet
Development

No branches or pull requests

1 participant