-
Notifications
You must be signed in to change notification settings - Fork 84
fix: improve contextual chunk headings #118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…plication of header in body
…sts for correct header inheritance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
Comments suppressed due to low confidence (1)
src/raglite/_insert.py:42
- Iterating over each heading part to remove it from the beginning of the chunk body may result in unintended removal when multiple, potentially overlapping, heading parts are present. Consider using a regular expression to match and remove the full sequence of heading parts as a single occurrence to ensure accuracy.
for heading_part in heading_parts:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request fixes the bug where new Markdown chunks were unintentionally inheriting the previous chunk’s heading and adds tests to ensure that the header inheritance is correctly updated. Key changes include:
- Updating the logic in _insert.py to clear previous headings when a new heading is detected.
- Adding unit tests in tests/test_insert.py to verify the header correction.
- Updating type hints in _mcp.py and pinning the llama-cpp-python dependency in pyproject.toml.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
tests/test_insert.py | Added comprehensive tests for correct heading assignment |
src/raglite/_mcp.py | Updated type annotations for FastMCP |
src/raglite/_insert.py | Modified heading inheritance logic to fix the bug |
pyproject.toml | Pinned llama-cpp-python to version 0.3.4 to avoid dependency issues |
…hon always builds
litellm 1.67.2 leads to the error in this test run: https://github.com/superlinear-ai/raglite/actions/runs/14637706773/job/41072598692?pr=118
…tead of creating new one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @SimonJasansky!
In my review I already addressed a few comments:
- The necessary compilers for llama-cpp-python are already present in the image, but needed to be explicitly configured in the
Dockerfile
. By doing so we can avoid installing clang. - I addressed the bug that newer versions of LiteLLM seem to introduce. In fact, the bug was on our end in the llama-cpp-python chat template.
- I removed the upper bounds on the dependencies—in general upper bounds are to be avoided because that limits the intallability of the package. This was only possible after (1) and (2).
- I moved the new logic from
_insert.py
further upstream to_database.py
, so that there's more downstream opportunity to benefit from it (i.e., not only in insertion). - I discovered and fixed an issue with the new logic: if the chunk contains headings in the middle of the chunk, that would also reset the contextual headings. But actually, you wouldn't want that. We only want to reset the contextual headings for leading headings in the chunk.
Subject: Fix: Correct Heading Inheritance in Chunking & Add Insertion Tests
Changes made
llama-cpp-python
source builds, as encountered in this failing test run.litellm
version to>=1.60.2,<1.67.0
to avoid error with most recent version1.67.2
encountered in this failing test run.>=2.0.0
Bug Fix: Heading Inheritance
Problem:
The previous logic caused chunks containing a new heading to incorrectly retain the heading from the previous chunk. For example, if a document transitioned from
# Section 1
to# Section 2
, the chunk containing# Section 2
still has# Section 1
as its heading. This flaw is visible in indices 7, 13, and 17 in the "Behavior Before" image.Expected Behavior:
When a chunk contains a new heading, the inherited heading context should be updated or cleared appropriately for that chunk. Some cases:
# H1
-># New H1
: Old# H1
context is cleared.## H2
->## New H2
: Old## H2
context is cleared (parent# H1
retained).## H2
-># New H1
: `## H2 is cleared.### H3
-># New H1
: All old context is cleared.Solution:
The
_create_chunk_records
function now explicitly checks if a newly extracted heading differs from the inherited one at the same or higher level. If it differs, the relevant parts of the inherited heading string are removed.Testing
Added tests to
tests/test_insert.py
to verify the correct heading assignment during document insertion, ensuring this bug does not reoccur.Debugging
During CI, various bugs were encountered, leading to the build changes mentioned above. A detailed description and resolution strategies are described in this PDF.
Images
Behavior Before
Behavior After