-
Notifications
You must be signed in to change notification settings - Fork 78
Variant to disease/phenotype predicates #1545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
For additional context, the VariantToDiseaseAssociation class has, in its documentation page, an example predicate of "is pathogenic for", so at least someone else at some point thought that it as a good idea. For the specific nomenclature of variant-to-disease/phenotype, I think that "pathogenic" is a good terminology to use. It's the long-standing terminology that multiple professional groups (American College of Medical Genetics and Genomics, Association for Molecular Pathology, ClinVar, ClinGen, etc.) use. We could potentially also have an expressly negated predicate For gene-to-disease / gene-to-phenotype edges, I'd advocate for being careful about how we represent the predicates. "Gene is associated with disease" is often a true statement, but I often hear people colloquially say that a "gene causes a disease" or a "gene causes a phenotype", where in general, it is rather a variant allele of a gene which, through inactivation, reduced/increased activity, or novel effect can be said to be the cause of a disease or phenotype. |
@sierra-moxon @mbrush I know formal modeling efforts through Translator are in flux right now, but is there any way we could fast track this issue? Adding "is pathogenic for" and "is benign for" would really help move these sources along in the short term, even if we end up changing them later. Is there anything we can do to help? Would a PR be appropriate, or could we try to schedule a meeting with other folks that may be interested in these predicates? @Vibhorgupta31 was also asking if there is anything like the biolink help desk we could attend, if that would help. |
@kevinschaper @AO33 - can you please comment on Monarch's ingest of ClinGen pathogenic, benign edges like this one: CAID:CA115937 is pathogenic for MONDO:0016419 |
@sierra-moxon - I believe that particular variant should be included in our graph in the following way... SequenceVariant (should be represented as a single node) We only include "Pathogenic" or "likely pathogenic" variants. |
Thanks @AO33 - this is great! @EvanDietzMorris - do you have some use cases you could enumerate for bringing in "benign" edges? @AO33 Is this an opportunity to distribute the modular KGX files that result from this ingest so that they can be reused by RENCI? |
@AO33 @kevinschaper - oh also, would you be able to summarize the reasoning for Monarch to bring in "only include "Pathogenic" or "likely pathogenic" variants"? (my memory is that it was a two-fold decision: first, timing/priorities related, second, that "likely benign" had fewer, less clear use cases w/re to answering user queries...but that feels pretty vague in my memory). |
I would very much love to share this! We're still relying too much on filtering early, so we could definitely be producing output for the benign edges as well, and just not including them downstream in our KG. I think from the POV of our existing ingest code, we'd probably do something like producing a single kgx with all of the output, and then maybe split that output so that there's a kgx file for each predicate, and then we'd bring in two of those files to our KG. Another interesting difference is that we used ClinVar:3029 as the subject rather than CAID:CA115937 |
I realized that it wasn't linked, here is our ClinGen ingest (/parser!): https://github.com/monarch-initiative/clingen-ingest Oh! and we are using CAID records, we just aren't using the CAID prefix 🤦 I'm going to do a very quick PR to fix that. |
@sierra-moxon - Your understanding is correct in why we chose only 'Pathogenic' and 'likely pathogenic'. @kevinschaper - Regarding reporting the ClinVarID.... Corey designed the ingest to use clinvarID if available, at leas in part, because of potential overlap with clinvar. So we wouldn't add the same node twice, just with a different identifier. (There's probably another reason too). But then if a clinvarID is not available, then a fallback solution is used which I think is the clingen registry ID (or something like this) |
Thank you @AO33 and @kevinschaper! I took a look at the ingest you linked Kevin, and I was quickly able to figure out what you had done, which files you used, and the biolink mappings, etc. - Awesome! I noticed here: https://github.com/monarch-initiative/clingen-ingest/blob/b24ce99b5ccc6a95179747578c1a2faadac7c0f0/src/clingen_ingest/transform.py#L20 that you are using two predicates for this transform: |
These are fairly broad/comprehensive predicates. If we need/want to be more specifically reflective of ClinGen, then I think we have a few of directions to persue:
But first, we need to understand the use cases around bringing in the records that are not pathogenic or likely pathogenic. If you are ok with re-using this ingest Evan, then I will leave this investigation for later. |
Wow, thanks a lot everyone for moving this forward so quickly. Predicate wise, I think genetically_associated_with and associated_with_increased_likelihood aren't perfect for these specific relationships, because genetically associated with is too vague (at least without qualifiers) but more importantly the descriptions of both indicate they are referring to statistical associations, whereas relationships with the designation of Pathogenic or Benign by ClinGen are highly curated and usually based on several different kinds of evidence (I think?). I'll defer to Bradford Powell (@bpow) on the science and provenance we could pull in - he is a member/collaborator with ClinGen. If we are opposed to minting new predicates, qualifiers could work and @sierra-moxon's suggestions make sense to me. Regarding identifiers, in ORION we normalize all sequence variants using the Clingen Allele Registry. Any dbsnp, clinvar, hgvs etc identifiers get normalized to canonical allele identifiers (CAID). As far as re-using the monarch ingest, unfortunately we've already potentially duplicated effort. Working with Bradford, Vibhor Gupta (@Vibhorgupta31) has already implemented an ingest in ORION for ClinGen Variant Pathogenecity. I am unsure of how much it actually overlaps with what you have already. This would be a good use case for comparing and consolidating them, but we are only trying to decide on predicates before Vibhor's is ready to use, so it'd be nice to go ahead and finish it. |
@EvanDietzMorris - thanks! you bring up lots of good points here. It would be great to understand the provenance aspect a bit more @bpow. It's wonderful that you can fast-track our landscape analysis of the resource. :) I know at Translator, we pushed the "associated with" hierarchy specifically for statistical associations. Still, in the definitions of these predicates in the model, the Capturing the correct knowledge level and agent type for this could be the better place to store the fact that this is a highly curated assignment -- though I can see the complication here: if the edge is statistical, can it ever be the agent_type equivalent of "curated" and vice versa? (the answer is probably yes, and this doc might help answer my question for me -- noting from the reference doc in TRAPI on implementing "agent type": @EvanDietzMorris—Regarding evidence, is your team comfortable with a "stronger" predicate, something like "causes" for cases where the variant "is pathogenic for" the disease? If I remember correctly, Monarch was more comfortable with a conservative approach in ingesting these, in particular with the "likely" keyword in many assertions. e.g., we could use "causes" for "pathogenic" and "associated with increased likelihood" for "likely pathogenic"... I took a look at the download file for variant-disease edges and they do provide "applied evidence codes (Met)", and "applied evidence codes (not Met)" columns with values such as: "PM3, PP4_Moderate, PM2, PS3" and "PVS1" We need to reconcile the notion of duplicative ingests. I understand the timing here, though. Can we commit to being consistent in the two ingests in the short term? We can always refactor. |
Exploring the SEPIO return: this is a very rich evidence graph, I love it. fullsepiointerpretation Does the ask for new predicates, then, come down to an understanding of "level of certainty"? Do we want predicates to hold the disambiguation of "certainly level", or do we want to capture evidence/certainty more granularly, as ClinGen does? This particular bit, probably warrants a discussion. |
Good point about knowledge level and agent type, I agree that manual agent seems right here, and that it should help clarify the nature of the knowledge source. It's hard not to wade into some pretty philosophical waters here regarding the nature of the evidence. I think that's why I leaned towards deferring to the nomenclature used by ClinGen, specifically pathogenic for. I haven't looked at it much, but it looks like the usage of SEPIO does a lot to address these issues. Maybe the next step should be to set up a meeting with Bradford and Vibhor and anyone else interested? Seems like there's a lot to discuss here. I definitely agree about being consistent with our ingests. FWIW I think the parser Vibhor wrote also only includes pathogenic relationships and not benign ones, but it seems smart to consider all of the possibilities while we're looking at the topic. |
Re-starting these discussions-- I apologize for the delays on responding, @Vibhorgupta31 (who has also been working on this) and I were traveling at opposite times, but are both back and had a chance to talk some with @EvanDietzMorris yesterday, so this is partially a summary of those discussions.
Right now, you're using:
From the original framing of the benign / likely benign / variant of uncertain significance / likely pathogenic / pathogenic spectrum, these were envisioned as corresponding approximately to <1% chance of being pathogenic / 1-<10% / 10-<90% / 90-<99% / >= 99% assessment of being a pathogenic variant. That's different from genetic penetance, which expresses a conditional probability (chance that someone might develop colon cancer given a molecular diagnosis of MSH2-related Lynch syndrome (MONDO:0007356). A pathogenic variant in MSH2 Again, for the For the VUS, I really don't like the use of "genetically_associated_with". These are the (unfortunately not-infrequent) variants that cannot be clearly classified as benign or pathogenic. In some genes, VUS are a few percent or more of variants... Clinically we usually consider these to be negative from the standpoint of medical management, in part based on historical data among VUS that are later able to be reclassified, with ~90% eventually being reclassified as negative. Since we consider these negative clinically, I'd rather them have the predicate negated (although with less "certainty" than a the negation of a benign or likely benign variant). All this being said, "associated_with" is an accurate, if incomplete statement for the P/LP variants. I'd like to further explore the use of predicate qualifiers, as I am less familiar with their use and how they could be standardized. |
Hi all. Finally had time to make it through the entire thread here . . . .where you all re-created years of ClinGen / GA4GH modeling conversations in the matter of a few weeks! In the Variant Annotation GA4GH work, I have been working with Larry Babb and others from ClinGen to define a model to represent Pathogenicity Classification data from sources like ClinVar and ClinGen. Our initial model is targeted at the simpler and more variable ClinVar data, and we plan to use an expanded version of the same model to represent richer ClinGen data in the near future. The GA4GH models are all based in the approach used by SEPIO and Biolink - where the semantics of the claims put forth are explicitly structured using SPOQ fields (subject = a variant, object = a disease/condition, predicate = the relation asserted to hold between them, and qualifier(s) report additional detail/context). And additional fields can be used to report the strength/confidence and direction of the claim See below). I am happy to discuss recommendations for representing the variant-disease data from sources like ClinVar and ClinGen, based on our GA4GH work. I'll share some general thoughts below, but suggest a call at some point to get into the details.
Based on the above, it may make sense for Automat/Monarch to similarly use a single predicate for ClinGen/ClinVar data ( I'll save thoughts on representing evidence/provenance for later - as this topic seemed secondary here. Similarly for gene-disease associations (where I concur with Bradford that we do not want to use predicates that imply genes 'cause' disease). I also agree that work is needed to clarify Finally, we can also have a conversation about how to represent the variant subjects of these statements - GA4GH and its VRS specification has done a lot of work in this space, and has approaches and tools that we might consider using. Ok , that is all for now. Lets continue some conversation here, and move to have a call if/when we think it is needed/useful. As an aside, in case anyone looks at the GA4GH models, note that they complicate things a bit by using the notion of a 'Proposition' to encapsulate the SPOQ semantics of possible facts that 'Statements' can assess and/or assert to be true. This pattern works well in traditional JSON structures/use cases - but doesn't translate as naturally to graph formalisms. That said, SEPIO does support patterns that collapse the Proposition SPOQ back into the Statement, making it more amenable to representing in graphs. This is the type of model I presume will be useful for Automat and Monarch given their basis in Biolink and graph-based representations. |
Hey all, Dr. Powell and I went through the comments, and one key question that came up is regarding the strength field. While we agree that strength should be part of the edge, do we have specific values in mind for it? From the current description, it seems to encompass both quantitative and qualitative aspects—any guidance or clarification here would be really helpful. Another point of concern is from an implementation perspective: if someone from the BioLink team could provide insight on the expected timeline for getting this up and running, that would be greatly appreciated. Would it make sense to set up a quick meeting to go over these points? I'm happy to help coordinate. Thanks |
re: the comment about |
@mbrush: I noticed in some ClinGen gene-validity data provided by @tnavatar that there are some |
The model is generally lacking good ways to represent sequence variant to disease relationships. Gene associated with condition exists, but as far as I can tell there is no way to provide more granularity or specifics. Variant to disease association exists, but I don't think there are predicates for those associations yet.
This is a known issue, but it has become timely, as we (ROBOKOP) are currently working with colleagues at ClinGen to bring these kinds of edges into ROBOKOP (and for Translator), so I wanted to get the ball rolling again.
I personally think it would be nice to have two ways to represent these edges:
In the short term though, we are specifically interested in predicates like "is pathogenic for" for variant to disease edges.
As example would be:
CAID:CA115937 is pathogenic for MONDO:0016419
https://erepo.clinicalgenome.org/evrepo/ui/classification/4b6c7f5f-b13d-435d-bba1-0d501ef69489
Predicates like "is likely pathogenic for" or "may be pathogenic for" would also be helpful, representing the various Clinical Validity Classifications in ClinGen. I'm not sure if we'd also want ones for Benign associations, or if negation would be better for those.
@bpow @Vibhorgupta31
The text was updated successfully, but these errors were encountered: