-
Notifications
You must be signed in to change notification settings - Fork 1
How to improve the performance by setting suitable parameter when implement in Nanopore cDNA data #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi Longxin, Thanks for using Clair3-RNA! For your purpose, you can increase the following parameters to achieve higher precision, though it will come at the cost of recall: --snp_min_af, --indel_min_af, and --min_coverage. As a recommendation, you can set the allele frequency (AF) to 0.2 or higher (but no more than 0.5, since you're using the variants for genotyping, which requires heterozygous variants). For --min_coverage, a value of 10 or higher should work well. Just to clarify, we focus on high-confidence regions to ensure a fair and reasonable comparison rather than better results, as not all genomic regions are transcribed. Cheers, |
Hi Xian @xianyu0623, |
Hi Longxin, That makes sense. In our results, you'll notice another threshold labeled “AD,” which stands for allele depth. For example, if a position has an A>G SNP with 10 reads covering it, and the AD is set as 4, we only consider this variant during benchmarking if there are 4 or more reads supporting the G allele. You can use src/calculate_overall_metrics.py to obtain similar metrics. If you're using hap.py directly, it does not take allele depth (AD) into account. Additionally, ONT cDNA 9.4.1 tends to have higher error rates compared to Iso-Seq, which may explain the less satisfactory results. You can find a detailed comparison in our supplementary data file. As I mentioned, if you still want higher precision, set --snp_min_af, --indel_min_af higher like 0.2. Feel free to reach out if you have any further questions! Best, |
Dear Xian,
|
Yes, excluding high-confidence regions can lead to a decrease in precision. Variant calling in regions outside of high-confidence areas is particularly challenging due to issues such as duplications and low complexity. We also conducted benchmarking in these regions separately, despite using only high-quality datasets including Iso-Seq, MAS-Seq, and ONT dRNA004 (You can have a look at Supplementary Table 6. Performance by genomic context on PacBio and ONT datasets in our manuscript). |
Hi, I am very appreciate the variant caller designed for long read RNA seq.
I have a good read the Clair3_RNA literature, I found that you pre define a high confident benchmark region, and then use Hap.py comparing the variant calling result to ground truth. and achieve outperformance in precision, recall , F1 score. I use the cDNA Nanopore data used in the paper, and I don't define high confident benchmark region, I want to achieve the highest precision, don't care recall, and as many as the number of TP as possible, can you recommend a set of suitable parameters used in Clair3_RNA.
I aim to get the partial genotype from the RNA Seq, so I want keep the higher variant calling precision.
I am looking forward your early reply. @zhengzhenxian @xianyu0623
cheers,
Longxin
The text was updated successfully, but these errors were encountered: