Skip to content

New slot subject_lineage, object_lineage, predicate_lineage #1549

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
matentzn opened this issue Feb 11, 2025 · 10 comments
Open

New slot subject_lineage, object_lineage, predicate_lineage #1549

matentzn opened this issue Feb 11, 2025 · 10 comments

Comments

@matentzn
Copy link

Is your feature request related to a problem? Please describe.

I would like to request slots for subject_lineage, object_lineage, predicate_lineage to complement original_subject slots. The idea is to be able to capture the transforms edge data undergoes every step of the way, from source extraction to final KG (aggregator or otherwise).

For example,

  • I want to be able to express this:
  1. The original source says Alzheimer
  2. The first integrator grounds to DOID:123
  3. The second integrator normalises to MONDO:123
  4. The third integrator normalises to ICD10:EXX999

What working group (or team) did this request originate from?

Monarch Initiative, Every Cure

Describe the solution you'd like

I don't know yet how this would look like. I can see two obvious ways: using complex lists or simple lists.

Simple example:

subject_lineage: 
  - Alzheimer
  - DOID:123
  - MONDO:123
  - ICD10:EXX999

The advantage would be that we could express this easily in a KGX TSV file like:

Alzheimer|DOID:123|MONDO:123|ICD10:EXX999

Complex example:

The simple example lacks a lot of detail, and most importantly, extensibility (making it much less future proof). A more complex solution would look like this:

subject_lineage: 
  - value: Alzheimer
  - value: DOID:123
  - value: MONDO:123
  - value: ICD10:EXX999

The advantage is that this could be modelled with arbitrary depth, like this:

subject_lineage: 
  - value: Alzheimer
    creator_id: infores:ChatGPT
    was_generated_by: some:provActivity
  - value: DOID:123
    creator_id: orcid:123
    was_generated_by: some:provActivity2
  - value: MONDO:123
    creator_id: ncats:nodeNormaliser
  - value: ICD10:EXX999
    creator_id: orcid:123

Additional information to support this request (optional)

I don't know what the best way to do this is, but I am certain this request would greatly help elevating biolink to taking its role in making AI outputs explainable / transparent. Many times we wonder: how "good" is our data, but if we do not reflect the normalisation lineage, we loose information every time the edges are integrated through a new context.

Tag relevant members for discussion

@kevinschaper @sierra-moxon @cmungall @cbizon

@mbrush
Copy link
Collaborator

mbrush commented Feb 11, 2025

Provenance of node normalization is something we’ve talked about needing in the NCATS Translator project as well, but don’t yet have a model for. It would be nice to work on a shared solution here.

@matentzn
Copy link
Author

Absolutely! Most of the data we use is anyways NcATS kg data.. how would you like to proceed?

@mbrush
Copy link
Collaborator

mbrush commented Feb 11, 2025

@gaurav is the node norm guy for Translator. Might be good for you two to meet and discuss use cases and requirements? Thoughts @cbizon?

@cbizon
Copy link
Collaborator

cbizon commented Feb 12, 2025

I think it's important. Pretty sure we added this into DOGSLED proposal, but don't remember the when/where of it. But I think that this would be a great feature to add into biolink.

Nodenorm probably doesn't much care, because the interface there is just curie in / stuff out, and while it returns e.g. biolink categories and biolink prefixes the output structure is not biolink itself (maybe a problem?). And then the caller smushes that into whatever biolink message/dataset they are making.

@gaurav
Copy link
Contributor

gaurav commented Feb 12, 2025

TLDR: I don't think this is relevant to NodeNorm itself, but I think it could be useful to users of NodeNorm, so I agree that it would be useful.

NodeNorm groups identifiers using "glomming", where we obtain ID-ID pairs from different sources and then combine them into a single clique. For example, if source 1 asserts A = B and source 2 asserts B = C, then we will create a combined clique of A, B, C.

We did propose some NodeNorm provenance work for DOGSLED, but I believe that's been pushed past year one. But I've sketched up a plan to implement provenance in Babel with minimal extra work, which won't be able to tell you exactly where a particular ID-ID pair came from, but would tell you all the provenance that went into a particular compendium. So when looking at the A, B, C clique in NodeNorm, you wouldn't be able to tell that sources 1 and 2 were involved in created this specific clique, but only that sources 1, 2, 3, 4 and so on were used to create all the cliques for a particular Biolink type.

So, on the one hand, NodeNorm wouldn't be able to produce something that looks like DOID:123 = MONDO:123 as per source 1 for an [x]_lineage field. But on the other hand, I do like someone using NodeNorm being able to say we got A, translated it to B, then NodeNorm normalized it to C, which we transformed to D. So I don't have any problems putting this into Biolink (and/or SSSOM).

@matentzn
Copy link
Author

Thank you @gaurav!

Does anyone have an opinion on the modelling?

@sierra-moxon
Copy link
Member

@matentzn - are there parallels here that we can leverage from SSSOM?
e.g., should we think of the lineage as a kind of mapping?

@vdancik
Copy link
Collaborator

vdancik commented Feb 13, 2025

I believe we need a better name for the concepts you call subject_lineage, object_lineage, predicate_lineage, like node_id_provenance and edge_id_provenance.

@cbizon
Copy link
Collaborator

cbizon commented Feb 13, 2025

One thing I'm not sure how to handle is that often a grounding or normalization will be to a clique, even if it is represented by a clique leader. Is that something important to track?

@matentzn
Copy link
Author

matentzn commented Feb 14, 2025

One thing I'm not sure how to handle is that often a grounding or normalization will be to a clique, even if it is represented by a clique leader. Is that something important to track?

I don't think this is needed - all that is needed is the fact that the subject was one thing before normalisation and it became another after; whether were are are other IDs in the clique I don't think matters to normalisation lineage much. Unless you think so?

I believe we need a better name for the concepts you call subject_lineage, object_lineage, predicate_lineage, like node_id_provenance and edge_id_provenance.

What I am proposing are lineage fields specifically on the subject, object and predicate slots in the edge model - of course this will apply to other fields as well. I though of lineage because it seemed that data lineage sort of reflects the idea, but I am not set on this term. Other alternatives could be:

  1. *_provenance
  2. *_history
  3. *_lineage
  4. *_transformation

@matentzn - are there parallels here that we can leverage from SSSOM? e.g., should we think of the lineage as a kind of mapping?

The idea is good, but the final model would look quite verbose:

subject_history: 
  - subject_label: Alzheimer
    subject_type: rdf:Literal
    predicate_id: skos:exactMatch
    object_id: DOID:123
    creator_id: infores:ChatGPT
  - subject_id: DOID:123
    predicate_id: skos:exactMatch
    object_id: MONDO:123   
    creator_id: orcid:123
  - subject_id: MONDO:123
    predicate_id: skos:exactMatch
    object_id: ICD10:EXX999
    mapping_tool: ncats:nodeNormaliser

Of course I personally like this, but I don't think anyone would ever go to the trouble of implementing this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants