You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi and thanks for providing this nice pipeline it really saves me a lot of time and headache when downloading large amounts of published data. However, recently I found some hiccups which left me with a bit of wondering if one could maybe include this in the pipeline.
So here is the problem:
When downloading files with sratools it sometime seems to happen that the datastream fetched by prefetch is corrupt but prefetch just continues as if nothing happened and the process completes as usual. Unfortunately, fasterq-dump later complains and fails with "SRA ID not found" ("or another one was "qualities empty" or something). Anyway, I guess this could easily be solved by just retrying another time or using the FTP or Aspera for download of these particular files and does not really concern much. However, when using SRA experiment IDs (SRX) for download this can lead to a situation where the sample is only partially downloaded i.e. if an SRX contains multiple SRRs and one of them fails I am left with an incomplete data download, which requires me to manually check the failed processes and see if I got some finished ones for the same SRX. Not only is it additional work to parse the .nextflow.log file but it is also cumbersome for downstream processing (in my case via rnaseq) if one is not aware of this behaviour.
Possible solution:
Connected to #270, I would suggest to somehow check for completeness based on the retrieved retrieved metadata i.e. simple check if number of SRRs is consistent with the number listed in the ENA table and simply fail if not. This way partial download can happen without polluting the results directory.
The text was updated successfully, but these errors were encountered:
Description of feature
Hi and thanks for providing this nice pipeline it really saves me a lot of time and headache when downloading large amounts of published data. However, recently I found some hiccups which left me with a bit of wondering if one could maybe include this in the pipeline.
So here is the problem:
When downloading files with
sratools
it sometime seems to happen that the datastream fetched byprefetch
is corrupt butprefetch
just continues as if nothing happened and the process completes as usual. Unfortunately,fasterq-dump
later complains and fails with "SRA ID not found" ("or another one was "qualities empty" or something). Anyway, I guess this could easily be solved by just retrying another time or using the FTP or Aspera for download of these particular files and does not really concern much. However, when using SRA experiment IDs (SRX) for download this can lead to a situation where the sample is only partially downloaded i.e. if an SRX contains multiple SRRs and one of them fails I am left with an incomplete data download, which requires me to manually check the failed processes and see if I got some finished ones for the same SRX. Not only is it additional work to parse the.nextflow.log
file but it is also cumbersome for downstream processing (in my case viarnaseq
) if one is not aware of this behaviour.Possible solution:
Connected to #270, I would suggest to somehow check for completeness based on the retrieved retrieved metadata i.e. simple check if number of SRRs is consistent with the number listed in the ENA table and simply fail if not. This way partial download can happen without polluting the results directory.
The text was updated successfully, but these errors were encountered: