How to improve the performance by setting suitable parameter when implement in Nanopore cDNA data #16

xhuashen · 2025-04-23T09:08:39Z

Hi, I am very appreciate the variant caller designed for long read RNA seq.
I have a good read the Clair3_RNA literature, I found that you pre define a high confident benchmark region, and then use Hap.py comparing the variant calling result to ground truth. and achieve outperformance in precision, recall , F1 score. I use the cDNA Nanopore data used in the paper, and I don't define high confident benchmark region, I want to achieve the highest precision, don't care recall, and as many as the number of TP as possible, can you recommend a set of suitable parameters used in Clair3_RNA.
I aim to get the partial genotype from the RNA Seq, so I want keep the higher variant calling precision.
I am looking forward your early reply. @zhengzhenxian @xianyu0623
cheers,
Longxin

xianyu0623 · 2025-04-23T09:47:30Z

Hi Longxin,

Thanks for using Clair3-RNA! For your purpose, you can increase the following parameters to achieve higher precision, though it will come at the cost of recall: --snp_min_af, --indel_min_af, and --min_coverage. As a recommendation, you can set the allele frequency (AF) to 0.2 or higher (but no more than 0.5, since you're using the variants for genotyping, which requires heterozygous variants). For --min_coverage, a value of 10 or higher should work well.

Just to clarify, we focus on high-confidence regions to ensure a fair and reasonable comparison rather than better results, as not all genomic regions are transcribed.

Cheers,
Xian.

xhuashen · 2025-04-23T13:19:10Z

Hi Xian @xianyu0623,
I am glad for your rapid reply, I have implement the parameter with --min_coverage 10 for the Nanopore cDNA data HG005, unfortunately, the precision is about 0.722 analysis only encompass chr22 and not define high confident region.
The result of Hap.py is

the precision is lower than 87.19 and 92.15%(DP=4 and DP=10 respectively) you mentioned in your literature, although you pre define a high confident region. So the result is not feasible for genotype association analysis. I have done the identical thing for iso seq long read RNA seq, the precision is about 0.824, it is feasible mildly for genotype association analysis.
My confusion is whether this precision and TP are reasonable for Nanopore cDNA data HG005.
Cheers,
Longxin.

xianyu0623 · 2025-04-24T05:24:59Z

Hi Longxin,

That makes sense. In our results, you'll notice another threshold labeled “AD,” which stands for allele depth. For example, if a position has an A>G SNP with 10 reads covering it, and the AD is set as 4, we only consider this variant during benchmarking if there are 4 or more reads supporting the G allele. You can use src/calculate_overall_metrics.py to obtain similar metrics. If you're using hap.py directly, it does not take allele depth (AD) into account.

Additionally, ONT cDNA 9.4.1 tends to have higher error rates compared to Iso-Seq, which may explain the less satisfactory results. You can find a detailed comparison in our supplementary data file.

As I mentioned, if you still want higher precision, set --snp_min_af, --indel_min_af higher like 0.2.

Feel free to reach out if you have any further questions!

Best,
Xian.

xhuashen · 2025-04-24T10:14:58Z

Dear Xian,
@xianyu0623 thanks very much, I try to filter out the variant of AD< 2 with the Clair3 output. and then comparing to ground truth , the result is , precision is about 0.71, it is still lower than you mentioned in your publish. I guess this is because benchmark region

My mean is the my benchmark region is the RNA aligned reads with DP>=10, the precision is 0.71. but your benchmark region is intersected region incorporating GIAB high-confidence region information. in the two case, the gap is between 0.71 and 0.8719, it is hard to understand.
I think all of GIAB region should be relatively high-confidence.
Maybe I have something error when I implement the pipeline. however, I compared it with other methods in some situation , it still outperformance, thanks for your developing the high-profile tools.
Cheers
Longxin

xianyu0623 · 2025-04-24T10:25:07Z

Yes, excluding high-confidence regions can lead to a decrease in precision. Variant calling in regions outside of high-confidence areas is particularly challenging due to issues such as duplications and low complexity. We also conducted benchmarking in these regions separately, despite using only high-quality datasets including Iso-Seq, MAS-Seq, and ONT dRNA004 (You can have a look at Supplementary Table 6. Performance by genomic context on PacBio and ONT datasets in our manuscript).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to improve the performance by setting suitable parameter when implement in Nanopore cDNA data #16

How to improve the performance by setting suitable parameter when implement in Nanopore cDNA data #16

xhuashen commented Apr 23, 2025

xianyu0623 commented Apr 23, 2025

xhuashen commented Apr 23, 2025

xianyu0623 commented Apr 24, 2025

xhuashen commented Apr 24, 2025

xianyu0623 commented Apr 24, 2025 •

edited

Loading

How to improve the performance by setting suitable parameter when implement in Nanopore cDNA data #16

How to improve the performance by setting suitable parameter when implement in Nanopore cDNA data #16

Comments

xhuashen commented Apr 23, 2025

xianyu0623 commented Apr 23, 2025

xhuashen commented Apr 23, 2025

xianyu0623 commented Apr 24, 2025

xhuashen commented Apr 24, 2025

xianyu0623 commented Apr 24, 2025 • edited Loading

xianyu0623 commented Apr 24, 2025 •

edited

Loading