New slot subject_lineage, object_lineage, predicate_lineage #1549

matentzn · 2025-02-11T09:13:39Z

Is your feature request related to a problem? Please describe.

I would like to request slots for subject_lineage, object_lineage, predicate_lineage to complement original_subject slots. The idea is to be able to capture the transforms edge data undergoes every step of the way, from source extraction to final KG (aggregator or otherwise).

For example,

I want to be able to express this:

The original source says Alzheimer
The first integrator grounds to DOID:123
The second integrator normalises to MONDO:123
The third integrator normalises to ICD10:EXX999

What working group (or team) did this request originate from?

Monarch Initiative, Every Cure

Describe the solution you'd like

I don't know yet how this would look like. I can see two obvious ways: using complex lists or simple lists.

Simple example:

subject_lineage: 
  - Alzheimer
  - DOID:123
  - MONDO:123
  - ICD10:EXX999

The advantage would be that we could express this easily in a KGX TSV file like:

Alzheimer|DOID:123|MONDO:123|ICD10:EXX999

Complex example:

The simple example lacks a lot of detail, and most importantly, extensibility (making it much less future proof). A more complex solution would look like this:

subject_lineage: 
  - value: Alzheimer
  - value: DOID:123
  - value: MONDO:123
  - value: ICD10:EXX999

The advantage is that this could be modelled with arbitrary depth, like this:

subject_lineage: 
  - value: Alzheimer
    creator_id: infores:ChatGPT
    was_generated_by: some:provActivity
  - value: DOID:123
    creator_id: orcid:123
    was_generated_by: some:provActivity2
  - value: MONDO:123
    creator_id: ncats:nodeNormaliser
  - value: ICD10:EXX999
    creator_id: orcid:123

Additional information to support this request (optional)

I don't know what the best way to do this is, but I am certain this request would greatly help elevating biolink to taking its role in making AI outputs explainable / transparent. Many times we wonder: how "good" is our data, but if we do not reflect the normalisation lineage, we loose information every time the edges are integrated through a new context.

Tag relevant members for discussion

@kevinschaper @sierra-moxon @cmungall @cbizon

The text was updated successfully, but these errors were encountered:

mbrush · 2025-02-11T14:35:14Z

Provenance of node normalization is something we’ve talked about needing in the NCATS Translator project as well, but don’t yet have a model for. It would be nice to work on a shared solution here.

matentzn · 2025-02-11T14:39:39Z

Absolutely! Most of the data we use is anyways NcATS kg data.. how would you like to proceed?

mbrush · 2025-02-11T15:13:42Z

@gaurav is the node norm guy for Translator. Might be good for you two to meet and discuss use cases and requirements? Thoughts @cbizon?

cbizon · 2025-02-12T14:11:02Z

I think it's important. Pretty sure we added this into DOGSLED proposal, but don't remember the when/where of it. But I think that this would be a great feature to add into biolink.

Nodenorm probably doesn't much care, because the interface there is just curie in / stuff out, and while it returns e.g. biolink categories and biolink prefixes the output structure is not biolink itself (maybe a problem?). And then the caller smushes that into whatever biolink message/dataset they are making.

gaurav · 2025-02-12T15:45:54Z

TLDR: I don't think this is relevant to NodeNorm itself, but I think it could be useful to users of NodeNorm, so I agree that it would be useful.

NodeNorm groups identifiers using "glomming", where we obtain ID-ID pairs from different sources and then combine them into a single clique. For example, if source 1 asserts A = B and source 2 asserts B = C, then we will create a combined clique of A, B, C.

We did propose some NodeNorm provenance work for DOGSLED, but I believe that's been pushed past year one. But I've sketched up a plan to implement provenance in Babel with minimal extra work, which won't be able to tell you exactly where a particular ID-ID pair came from, but would tell you all the provenance that went into a particular compendium. So when looking at the A, B, C clique in NodeNorm, you wouldn't be able to tell that sources 1 and 2 were involved in created this specific clique, but only that sources 1, 2, 3, 4 and so on were used to create all the cliques for a particular Biolink type.

So, on the one hand, NodeNorm wouldn't be able to produce something that looks like DOID:123 = MONDO:123 as per source 1 for an [x]_lineage field. But on the other hand, I do like someone using NodeNorm being able to say we got A, translated it to B, then NodeNorm normalized it to C, which we transformed to D. So I don't have any problems putting this into Biolink (and/or SSSOM).

matentzn · 2025-02-13T07:46:15Z

Thank you @gaurav!

Does anyone have an opinion on the modelling?

sierra-moxon · 2025-02-13T16:07:04Z

@matentzn - are there parallels here that we can leverage from SSSOM?
e.g., should we think of the lineage as a kind of mapping?

vdancik · 2025-02-13T16:31:49Z

I believe we need a better name for the concepts you call subject_lineage, object_lineage, predicate_lineage, like node_id_provenance and edge_id_provenance.

cbizon · 2025-02-13T17:01:10Z

One thing I'm not sure how to handle is that often a grounding or normalization will be to a clique, even if it is represented by a clique leader. Is that something important to track?

matentzn · 2025-02-14T23:07:19Z

One thing I'm not sure how to handle is that often a grounding or normalization will be to a clique, even if it is represented by a clique leader. Is that something important to track?

I don't think this is needed - all that is needed is the fact that the subject was one thing before normalisation and it became another after; whether were are are other IDs in the clique I don't think matters to normalisation lineage much. Unless you think so?

I believe we need a better name for the concepts you call subject_lineage, object_lineage, predicate_lineage, like node_id_provenance and edge_id_provenance.

What I am proposing are lineage fields specifically on the subject, object and predicate slots in the edge model - of course this will apply to other fields as well. I though of lineage because it seemed that data lineage sort of reflects the idea, but I am not set on this term. Other alternatives could be:

*_provenance
*_history
*_lineage
*_transformation

@matentzn - are there parallels here that we can leverage from SSSOM? e.g., should we think of the lineage as a kind of mapping?

The idea is good, but the final model would look quite verbose:

subject_history: 
  - subject_label: Alzheimer
    subject_type: rdf:Literal
    predicate_id: skos:exactMatch
    object_id: DOID:123
    creator_id: infores:ChatGPT
  - subject_id: DOID:123
    predicate_id: skos:exactMatch
    object_id: MONDO:123   
    creator_id: orcid:123
  - subject_id: MONDO:123
    predicate_id: skos:exactMatch
    object_id: ICD10:EXX999
    mapping_tool: ncats:nodeNormaliser

Of course I personally like this, but I don't think anyone would ever go to the trouble of implementing this?

matentzn mentioned this issue Feb 11, 2025

Make original_subject and original_object mandatory everycure-org/matrix-validator#13

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New slot subject_lineage, object_lineage, predicate_lineage #1549

New slot subject_lineage, object_lineage, predicate_lineage #1549

matentzn commented Feb 11, 2025

mbrush commented Feb 11, 2025

matentzn commented Feb 11, 2025

mbrush commented Feb 11, 2025

cbizon commented Feb 12, 2025

gaurav commented Feb 12, 2025

matentzn commented Feb 13, 2025

sierra-moxon commented Feb 13, 2025

vdancik commented Feb 13, 2025

cbizon commented Feb 13, 2025

matentzn commented Feb 14, 2025 •

edited

Loading

New slot subject_lineage, object_lineage, predicate_lineage #1549

New slot subject_lineage, object_lineage, predicate_lineage #1549

Comments

matentzn commented Feb 11, 2025

mbrush commented Feb 11, 2025

matentzn commented Feb 11, 2025

mbrush commented Feb 11, 2025

cbizon commented Feb 12, 2025

gaurav commented Feb 12, 2025

matentzn commented Feb 13, 2025

sierra-moxon commented Feb 13, 2025

vdancik commented Feb 13, 2025

cbizon commented Feb 13, 2025

matentzn commented Feb 14, 2025 • edited Loading

matentzn commented Feb 14, 2025 •

edited

Loading