Skip to content

How to improve the performance by setting suitable parameter when implement in Nanopore cDNA data #16

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
xhuashen opened this issue Apr 23, 2025 · 5 comments

Comments

@xhuashen
Copy link

Hi, I am very appreciate the variant caller designed for long read RNA seq.
I have a good read the Clair3_RNA literature, I found that you pre define a high confident benchmark region, and then use Hap.py comparing the variant calling result to ground truth. and achieve outperformance in precision, recall , F1 score. I use the cDNA Nanopore data used in the paper, and I don't define high confident benchmark region, I want to achieve the highest precision, don't care recall, and as many as the number of TP as possible, can you recommend a set of suitable parameters used in Clair3_RNA.
I aim to get the partial genotype from the RNA Seq, so I want keep the higher variant calling precision.
I am looking forward your early reply. @zhengzhenxian @xianyu0623
cheers,
Longxin

@xianyu0623
Copy link
Collaborator

Hi Longxin,

Thanks for using Clair3-RNA! For your purpose, you can increase the following parameters to achieve higher precision, though it will come at the cost of recall: --snp_min_af, --indel_min_af, and --min_coverage. As a recommendation, you can set the allele frequency (AF) to 0.2 or higher (but no more than 0.5, since you're using the variants for genotyping, which requires heterozygous variants). For --min_coverage, a value of 10 or higher should work well.

Just to clarify, we focus on high-confidence regions to ensure a fair and reasonable comparison rather than better results, as not all genomic regions are transcribed.

Cheers,
Xian.

@xhuashen
Copy link
Author

Hi Xian @xianyu0623,
I am glad for your rapid reply, I have implement the parameter with --min_coverage 10 for the Nanopore cDNA data HG005, unfortunately, the precision is about 0.722 analysis only encompass chr22 and not define high confident region.
The result of Hap.py is
Image
the precision is lower than 87.19 and 92.15%(DP=4 and DP=10 respectively) you mentioned in your literature, although you pre define a high confident region. So the result is not feasible for genotype association analysis. I have done the identical thing for iso seq long read RNA seq, the precision is about 0.824, it is feasible mildly for genotype association analysis.
My confusion is whether this precision and TP are reasonable for Nanopore cDNA data HG005.
Cheers,
Longxin.

@xianyu0623
Copy link
Collaborator

Hi Longxin,

That makes sense. In our results, you'll notice another threshold labeled “AD,” which stands for allele depth. For example, if a position has an A>G SNP with 10 reads covering it, and the AD is set as 4, we only consider this variant during benchmarking if there are 4 or more reads supporting the G allele. You can use src/calculate_overall_metrics.py to obtain similar metrics. If you're using hap.py directly, it does not take allele depth (AD) into account.

Additionally, ONT cDNA 9.4.1 tends to have higher error rates compared to Iso-Seq, which may explain the less satisfactory results. You can find a detailed comparison in our supplementary data file.

As I mentioned, if you still want higher precision, set --snp_min_af, --indel_min_af higher like 0.2.

Feel free to reach out if you have any further questions!

Best,
Xian.

@xhuashen
Copy link
Author

Dear Xian,
@xianyu0623 thanks very much, I try to filter out the variant of AD< 2 with the Clair3 output. and then comparing to ground truth , the result is , precision is about 0.71, it is still lower than you mentioned in your publish. I guess this is because benchmark region

Image
My mean is the my benchmark region is the RNA aligned reads with DP>=10, the precision is 0.71. but your benchmark region is intersected region incorporating GIAB high-confidence region information. in the two case, the gap is between 0.71 and 0.8719, it is hard to understand.
I think all of GIAB region should be relatively high-confidence.
Maybe I have something error when I implement the pipeline. however, I compared it with other methods in some situation , it still outperformance, thanks for your developing the high-profile tools.
Cheers
Longxin

@xianyu0623
Copy link
Collaborator

xianyu0623 commented Apr 24, 2025

Yes, excluding high-confidence regions can lead to a decrease in precision. Variant calling in regions outside of high-confidence areas is particularly challenging due to issues such as duplications and low complexity. We also conducted benchmarking in these regions separately, despite using only high-quality datasets including Iso-Seq, MAS-Seq, and ONT dRNA004 (You can have a look at Supplementary Table 6. Performance by genomic context on PacBio and ONT datasets in our manuscript).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants