Skip to content

Variant to disease/phenotype predicates #1545

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
EvanDietzMorris opened this issue Jan 10, 2025 · 20 comments
Open

Variant to disease/phenotype predicates #1545

EvanDietzMorris opened this issue Jan 10, 2025 · 20 comments

Comments

@EvanDietzMorris
Copy link
Collaborator

The model is generally lacking good ways to represent sequence variant to disease relationships. Gene associated with condition exists, but as far as I can tell there is no way to provide more granularity or specifics. Variant to disease association exists, but I don't think there are predicates for those associations yet.

This is a known issue, but it has become timely, as we (ROBOKOP) are currently working with colleagues at ClinGen to bring these kinds of edges into ROBOKOP (and for Translator), so I wanted to get the ball rolling again.

I personally think it would be nice to have two ways to represent these edges:

  • gene to disease/phenotype edges with qualifiers providing more granularity
  • variant to disease/phenotype edges

In the short term though, we are specifically interested in predicates like "is pathogenic for" for variant to disease edges.

As example would be:
CAID:CA115937 is pathogenic for MONDO:0016419
https://erepo.clinicalgenome.org/evrepo/ui/classification/4b6c7f5f-b13d-435d-bba1-0d501ef69489

Predicates like "is likely pathogenic for" or "may be pathogenic for" would also be helpful, representing the various Clinical Validity Classifications in ClinGen. I'm not sure if we'd also want ones for Benign associations, or if negation would be better for those.

@bpow @Vibhorgupta31

@bpow
Copy link

bpow commented Jan 14, 2025

For additional context, the VariantToDiseaseAssociation class has, in its documentation page, an example predicate of "is pathogenic for", so at least someone else at some point thought that it as a good idea.

For the specific nomenclature of variant-to-disease/phenotype, I think that "pathogenic" is a good terminology to use. It's the long-standing terminology that multiple professional groups (American College of Medical Genetics and Genomics, Association for Molecular Pathology, ClinVar, ClinGen, etc.) use. We could potentially also have an expressly negated predicate is_benign_for to indicate that, not only is there not sufficient evidence for pathogenicity, but there is expressly evidence to refute pathogenicity. Alternatively, we could potentially address this with appropriate qualifiers. Current ACMG/AMP recommendations have a 5-valued set of categories (Benign, Likely Benign, Variant of Uncertain Significance, Likely Pathogenic, Pathogenic), but if that's too many additional predicates than we could probably address that sort of thing in qualifiers (but I'd be interested in suggested ways to map this specific terminology which has rather formal domain-specific definition to more-general qualifiers).

For gene-to-disease / gene-to-phenotype edges, I'd advocate for being careful about how we represent the predicates. "Gene is associated with disease" is often a true statement, but I often hear people colloquially say that a "gene causes a disease" or a "gene causes a phenotype", where in general, it is rather a variant allele of a gene which, through inactivation, reduced/increased activity, or novel effect can be said to be the cause of a disease or phenotype.

@EvanDietzMorris
Copy link
Collaborator Author

@sierra-moxon @mbrush I know formal modeling efforts through Translator are in flux right now, but is there any way we could fast track this issue? Adding "is pathogenic for" and "is benign for" would really help move these sources along in the short term, even if we end up changing them later. Is there anything we can do to help? Would a PR be appropriate, or could we try to schedule a meeting with other folks that may be interested in these predicates?

@Vibhorgupta31 was also asking if there is anything like the biolink help desk we could attend, if that would help.

@sierra-moxon
Copy link
Member

@kevinschaper @AO33 - can you please comment on Monarch's ingest of ClinGen pathogenic, benign edges like this one:

CAID:CA115937 is pathogenic for MONDO:0016419
https://erepo.clinicalgenome.org/evrepo/ui/classification/4b6c7f5f-b13d-435d-bba1-0d501ef69489

@AO33
Copy link

AO33 commented Mar 11, 2025

@sierra-moxon - I believe that particular variant should be included in our graph in the following way...

SequenceVariant (should be represented as a single node)
VariantToDiseaseAssciation (1 edge here between the variant and that mondo id (MONDO:0016419))
VariantToGeneAssociation (1 edge here between the variant and the ATM gene)

We only include "Pathogenic" or "likely pathogenic" variants.
Because the expert review panel overwrote that variant to Pathogenic, then as long as that new annotation made it into our ingest file it should be represented... Hopefully this answers your question

@sierra-moxon
Copy link
Member

Thanks @AO33 - this is great! @EvanDietzMorris - do you have some use cases you could enumerate for bringing in "benign" edges?

@AO33 Is this an opportunity to distribute the modular KGX files that result from this ingest so that they can be reused by RENCI?

@sierra-moxon
Copy link
Member

sierra-moxon commented Mar 11, 2025

@AO33 @kevinschaper - oh also, would you be able to summarize the reasoning for Monarch to bring in "only include "Pathogenic" or "likely pathogenic" variants"? (my memory is that it was a two-fold decision: first, timing/priorities related, second, that "likely benign" had fewer, less clear use cases w/re to answering user queries...but that feels pretty vague in my memory).

@kevinschaper
Copy link
Collaborator

I would very much love to share this! We're still relying too much on filtering early, so we could definitely be producing output for the benign edges as well, and just not including them downstream in our KG. I think from the POV of our existing ingest code, we'd probably do something like producing a single kgx with all of the output, and then maybe split that output so that there's a kgx file for each predicate, and then we'd bring in two of those files to our KG.

Another interesting difference is that we used ClinVar:3029 as the subject rather than CAID:CA115937

@kevinschaper
Copy link
Collaborator

I realized that it wasn't linked, here is our ClinGen ingest (/parser!): https://github.com/monarch-initiative/clingen-ingest

Oh! and we are using CAID records, we just aren't using the CAID prefix 🤦 I'm going to do a very quick PR to fix that.

@AO33
Copy link

AO33 commented Mar 11, 2025

@sierra-moxon - Your understanding is correct in why we chose only 'Pathogenic' and 'likely pathogenic'.
At the time of our decisions, we seemed to be most interested in only the variants that actually cause disease and this produced a very clean set of information combined across clinvar and clingen.

@kevinschaper - Regarding reporting the ClinVarID.... Corey designed the ingest to use clinvarID if available, at leas in part, because of potential overlap with clinvar. So we wouldn't add the same node twice, just with a different identifier. (There's probably another reason too). But then if a clinvarID is not available, then a fallback solution is used which I think is the clingen registry ID (or something like this)

@sierra-moxon
Copy link
Member

Thank you @AO33 and @kevinschaper! I took a look at the ingest you linked Kevin, and I was quickly able to figure out what you had done, which files you used, and the biolink mappings, etc. - Awesome!

I noticed here: https://github.com/monarch-initiative/clingen-ingest/blob/b24ce99b5ccc6a95179747578c1a2faadac7c0f0/src/clingen_ingest/transform.py#L20 that you are using two predicates for this transform:
genetically_associated_with
and associated with increased likelihood

@EvanDietzMorris

@sierra-moxon
Copy link
Member

sierra-moxon commented Mar 12, 2025

These are fairly broad/comprehensive predicates.

If we need/want to be more specifically reflective of ClinGen, then I think we have a few of directions to persue:

  • add the necessary EPC metadata that ClinGen provides as justification for the "short cut" call of "pathogenic" (there is a nice table on that link that Evan provided, but I did not do the next step of figuring out if we can get that EPC data so we can add it as edge attributes or mini-provenance graphs for the edge).
  • add an edge property - something like "ClinGen clinical validity classification" (this is what ClinGen calls it) to store pathogenic/benign/likely.../etc.
  • we store this information in statement qualifier (or a more specific subslot that we create) - this is analogous to the 'causal mechanism qualifier' that we use for chemical->gene statements (but we should not literally use "causal mechanism qualifier") -- maybe something like 'clinical significance qualifier' (this is not a good name, primarily because I don't know how I would define this in a general way to be used for any other kind of data).

But first, we need to understand the use cases around bringing in the records that are not pathogenic or likely pathogenic.
And, we need to understand what kind of provenance we can get from ClinGen about these classifications.

If you are ok with re-using this ingest Evan, then I will leave this investigation for later.

@EvanDietzMorris
Copy link
Collaborator Author

Wow, thanks a lot everyone for moving this forward so quickly.

Predicate wise, I think genetically_associated_with and associated_with_increased_likelihood aren't perfect for these specific relationships, because genetically associated with is too vague (at least without qualifiers) but more importantly the descriptions of both indicate they are referring to statistical associations, whereas relationships with the designation of Pathogenic or Benign by ClinGen are highly curated and usually based on several different kinds of evidence (I think?). I'll defer to Bradford Powell (@bpow) on the science and provenance we could pull in - he is a member/collaborator with ClinGen. If we are opposed to minting new predicates, qualifiers could work and @sierra-moxon's suggestions make sense to me.

Regarding identifiers, in ORION we normalize all sequence variants using the Clingen Allele Registry. Any dbsnp, clinvar, hgvs etc identifiers get normalized to canonical allele identifiers (CAID).

As far as re-using the monarch ingest, unfortunately we've already potentially duplicated effort. Working with Bradford, Vibhor Gupta (@Vibhorgupta31) has already implemented an ingest in ORION for ClinGen Variant Pathogenecity. I am unsure of how much it actually overlaps with what you have already. This would be a good use case for comparing and consolidating them, but we are only trying to decide on predicates before Vibhor's is ready to use, so it'd be nice to go ahead and finish it.

@sierra-moxon
Copy link
Member

sierra-moxon commented Mar 14, 2025

@EvanDietzMorris - thanks! you bring up lots of good points here.

It would be great to understand the provenance aspect a bit more @bpow. It's wonderful that you can fast-track our landscape analysis of the resource. :)

I know at Translator, we pushed the "associated with" hierarchy specifically for statistical associations. Still, in the definitions of these predicates in the model, the associated_with hierarchy goes back and forth between being specifically statistical and "typically" statistical, e.g., Expresses a relationship between two named things where the relationship is typically generated statistically (though not in all cases), and is weaker than its child, 'correlated with,' but stronger than its parent, 'related to'. This is on the modeling team to fix, and I will. It will take some discussion. "allele frequency" as evidence for the assertion of causal pathogenicity may walk a bit of a fine line with these existing definitions.

Capturing the correct knowledge level and agent type for this could be the better place to store the fact that this is a highly curated assignment -- though I can see the complication here: if the edge is statistical, can it ever be the agent_type equivalent of "curated" and vice versa? (the answer is probably yes, and this doc might help answer my question for me -- noting from the reference doc in TRAPI on implementing "agent type": if a human curator concludes that a particular gene variant causes a medical condition - based on their interpretation of information produced by computational modeling tools, automated statistical analyses, and robotic laboratory assay systems - the agent type for this statement is "manual_agent" despite all of the evidence being created by automated agents

@EvanDietzMorris—Regarding evidence, is your team comfortable with a "stronger" predicate, something like "causes" for cases where the variant "is pathogenic for" the disease? If I remember correctly, Monarch was more comfortable with a conservative approach in ingesting these, in particular with the "likely" keyword in many assertions. e.g., we could use "causes" for "pathogenic" and "associated with increased likelihood" for "likely pathogenic"...

I took a look at the download file for variant-disease edges and they do provide "applied evidence codes (Met)", and "applied evidence codes (not Met)" columns with values such as: "PM3, PP4_Moderate, PM2, PS3" and "PVS1"
And I got to the API documentation which showed me all about its SEPIO connections: https://brl-bcm.stoplight.io/docs/erepo/8bf791e983634-get-full-sepio-interpretation ... and defines the relationship in this way: "Variant information and condition/syndrome info that the variant is known to be associated with as well as Outcome that conveys whether a variant is pathogenic, likely pathogenic, benign, etc for a given genetic condition."
Outcome happens to be another modeling paradigm in Biolink that if we do go forward with a change here, that we need to examine.

We need to reconcile the notion of duplicative ingests. I understand the timing here, though. Can we commit to being consistent in the two ingests in the short term? We can always refactor.

@sierra-moxon
Copy link
Member

Exploring the SEPIO return: this is a very rich evidence graph, I love it. fullsepiointerpretation

Does the ask for new predicates, then, come down to an understanding of "level of certainty"?

Do we want predicates to hold the disambiguation of "certainly level", or do we want to capture evidence/certainty more granularly, as ClinGen does?

This particular bit, probably warrants a discussion.

@EvanDietzMorris
Copy link
Collaborator Author

Good point about knowledge level and agent type, I agree that manual agent seems right here, and that it should help clarify the nature of the knowledge source.

It's hard not to wade into some pretty philosophical waters here regarding the nature of the evidence. I think that's why I leaned towards deferring to the nomenclature used by ClinGen, specifically pathogenic for. I haven't looked at it much, but it looks like the usage of SEPIO does a lot to address these issues. Maybe the next step should be to set up a meeting with Bradford and Vibhor and anyone else interested? Seems like there's a lot to discuss here.

I definitely agree about being consistent with our ingests. FWIW I think the parser Vibhor wrote also only includes pathogenic relationships and not benign ones, but it seems smart to consider all of the possibilities while we're looking at the topic.

@bpow
Copy link

bpow commented Apr 10, 2025

Re-starting these discussions-- I apologize for the delays on responding, @Vibhorgupta31 (who has also been working on this) and I were traveling at opposite times, but are both back and had a chance to talk some with @EvanDietzMorris yesterday, so this is partially a summary of those discussions.

  • I agree that the duplication of effort is unfortunate, but as we probably all realize, transforming the data is often the easy part relative to matching things to appropriate predicates/qualifiers/etc. And we're still at the right place to get that right :)
  • I advocate for recording and communicating benign and likely benign variants. Knowing that a variant has been evaluated and determined to be benign is very important information clinically.
  • Regarding the types of evidence used to decide pathogenicity, some of it could be considered as statistical association (e.g., variant occurrence in affected individuals and tracking with phenotype in a family, but not present in controls), but other data are more related to function/mechanism. That's basically the reason for all of the PVS1/PM2/PS3, etc. codes is that they indicate different types of evidence and their different levels of strength. Those codes are part of the ACMG/AMP classification guidelines that are used directly or indirectly by just about all clinical labs and many research labs since about 2013. We should be careful against guiding any structure too closely to those codes, though, since those guidelines are under revision.
  • There is a lot more underlying evidence and provenance collected by ClinGen than is available in the tabular downloads. We've been looking into transforms using the json-ld representation (which unfortunately complicates acquisition and versioning of the upstream data).
  • As a clinical geneticist, the use of "is associated with" feels odd to me here because when I hear "association", I think about things like GWAS which can only really establish association but not causation. In the case of variants in genes related to monogenic conditions evaluated under the ACMG/AMP criteria, a pathogenic or likely pathogenic variant in a gene associated with a dominant condition (or combination of P/LP variants in trans in a gene associated with a recessive condition) may be clinically considered a "molecular diagnosis" of that condition.
  • I'd be extremely cautious about the predicates currently selected in the clingen-ingest transform, and I think this bears more discussion (flagging @kevinschaper to make sure this is noticed). In short, I think the current predicates used there conflate confidence in the assessment of causality/pathogenicity with the likelihood that a particular genetic variant will lead to a specific phenotypic feature.

Right now, you're using:

  • Pathogenic => biolink:causes
  • Likely Pathogenic => biolink:associated_with_increased_likelihood
  • Variant of Uncertain Significance => biolink:genetically_associated_with

From the original framing of the benign / likely benign / variant of uncertain significance / likely pathogenic / pathogenic spectrum, these were envisioned as corresponding approximately to <1% chance of being pathogenic / 1-<10% / 10-<90% / 90-<99% / >= 99% assessment of being a pathogenic variant. That's different from genetic penetance, which expresses a conditional probability (chance that someone might develop colon cancer given a molecular diagnosis of MSH2-related Lynch syndrome (MONDO:0007356). A pathogenic variant in MSH2 causes MONDO:0007356, and is associated_with_increased_likelihood of colon cancer (and endometrial/stomach/ovarian cancers).

Again, for the associated_with_increased_likelihood to me is best used for things like the APOE ε4 allele having an elevated likelihood ratio for Alzheimer disease.

For the VUS, I really don't like the use of "genetically_associated_with". These are the (unfortunately not-infrequent) variants that cannot be clearly classified as benign or pathogenic. In some genes, VUS are a few percent or more of variants... Clinically we usually consider these to be negative from the standpoint of medical management, in part based on historical data among VUS that are later able to be reclassified, with ~90% eventually being reclassified as negative. Since we consider these negative clinically, I'd rather them have the predicate negated (although with less "certainty" than a the negation of a benign or likely benign variant).

All this being said, "associated_with" is an accurate, if incomplete statement for the P/LP variants. I'd like to further explore the use of predicate qualifiers, as I am less familiar with their use and how they could be standardized.

@mbrush
Copy link
Collaborator

mbrush commented Apr 11, 2025

Hi all. Finally had time to make it through the entire thread here . . . .where you all re-created years of ClinGen / GA4GH modeling conversations in the matter of a few weeks! In the Variant Annotation GA4GH work, I have been working with Larry Babb and others from ClinGen to define a model to represent Pathogenicity Classification data from sources like ClinVar and ClinGen. Our initial model is targeted at the simpler and more variable ClinVar data, and we plan to use an expanded version of the same model to represent richer ClinGen data in the near future.

The GA4GH models are all based in the approach used by SEPIO and Biolink - where the semantics of the claims put forth are explicitly structured using SPOQ fields (subject = a variant, object = a disease/condition, predicate = the relation asserted to hold between them, and qualifier(s) report additional detail/context). And additional fields can be used to report the strength/confidence and direction of the claim See below).

I am happy to discuss recommendations for representing the variant-disease data from sources like ClinVar and ClinGen, based on our GA4GH work. I'll share some general thoughts below, but suggest a call at some point to get into the details.

  • For pathogenic / likely pathogenic variants, we use a causalFor predicate , as experts told us that at the end of the day, 'pathogenic' just means "causal for" a mendelian disease. We felt there was no need to create a pathogenicFor predicate that is distinguished from causalFor only in its narrower domain and range.
  • We capture confidence concepts (e.g. 'likely') using a separate strength field that reports how confident / how much evidence exists for the stated claim. I feel strongly that confidence level info should not be baked into predicates. This can be captured in an EPC edge property.
  • For benign / likely benign variants, we use the same predicate (causalFor) with negation - specifically we assign a direction of "disputes" to the statement. This was done to avoid creating negated predicates like notCausalFor (aka benignFor), as there was no perceived benefit to doing this. However, in a graph context if it is useful to be able to traverse edges representing this idea, we might consider it. Confidence for these benign classifications also uses the strength field.
  • For variants of uncertain significance, we use the direction field in the Statement as well - which is set to "none" in this case, to indicate that the possible pathogenicity of the variant is neither supported or disputed.
  • The GA4GH VA Statement model also includes a classification field - which is used to capture the final classification of the variant in familiar terms where it is ok combine confidence and direction with the asserted relation/conclusion on one value. This is where we capture guideline-based values like likely_pathogenic, benign, or uncertain significance, which consumers in the community of practice are familiar with, and wanted to see somewhere in the data. In this way, we get a concisely and explicitly structured representation of what the Statement is reporting in the SPOQ-DS fields, and a more concise, user-friendly representation of the final outcome in the classification field. Something similar could be done with your data if useful.

Based on the above, it may make sense for Automat/Monarch to similarly use a single predicate for ClinGen/ClinVar data (causalFor), and a direction field to indicate if causality is "supported", "disputed" or neither ("none") in the case of a VUS. I agree with Bradford that genetically_associated_with is not appropriate for VUSs - but feel like there may not be a strong enough use case to warrant inclusion of these statements in Monarch anyway.

I'll save thoughts on representing evidence/provenance for later - as this topic seemed secondary here. Similarly for gene-disease associations (where I concur with Bradford that we do not want to use predicates that imply genes 'cause' disease).

I also agree that work is needed to clarify associated_with predicates - for the reasons outlined above. Similarly for the has_phenotype predicate - which I do not like as it is is overloaded and imprecise (but it is so entrenched in Monarch resources that it may be too hard to eradicate)

Finally, we can also have a conversation about how to represent the variant subjects of these statements - GA4GH and its VRS specification has done a lot of work in this space, and has approaches and tools that we might consider using.

Ok , that is all for now. Lets continue some conversation here, and move to have a call if/when we think it is needed/useful.


As an aside, in case anyone looks at the GA4GH models, note that they complicate things a bit by using the notion of a 'Proposition' to encapsulate the SPOQ semantics of possible facts that 'Statements' can assess and/or assert to be true. This pattern works well in traditional JSON structures/use cases - but doesn't translate as naturally to graph formalisms. That said, SEPIO does support patterns that collapse the Proposition SPOQ back into the Statement, making it more amenable to representing in graphs. This is the type of model I presume will be useful for Automat and Monarch given their basis in Biolink and graph-based representations.

@Vibhorgupta31
Copy link

Hey all,

Dr. Powell and I went through the comments, and one key question that came up is regarding the strength field. While we agree that strength should be part of the edge, do we have specific values in mind for it? From the current description, it seems to encompass both quantitative and qualitative aspects—any guidance or clarification here would be really helpful.

Another point of concern is from an implementation perspective: if someone from the BioLink team could provide insight on the expected timeline for getting this up and running, that would be greatly appreciated.

Would it make sense to set up a quick meeting to go over these points? I'm happy to help coordinate.

Thanks

@mbrush
Copy link
Collaborator

mbrush commented Apr 24, 2025

re: the comment about strength "encompassing qualitative and quantitative aspects" - in the GA4GH VA Spec model strength is meant to capture a qualitative term, but the spec has not yet recommended a specific enumeration of values here. This is left for implementers to define and coordinate amongst themselves. Future versions may recommend or require specific values here. Note that the GA4GH spec does provide a score field that is the quantitative version of strength.

@bpow
Copy link

bpow commented Apr 28, 2025

@mbrush: I noticed in some ClinGen gene-validity data provided by @tnavatar that there are some direction elements with the 3 values being used of Supports, Contradicts, and Inconclusive. Do you have any thoughts on this as a domain of a direction element (as opposed to the supported / disputed / none you discussed above)? Trying to get consistency earlier rather than later might help avoid further revisions down the line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants