Skip to content

fix: improve contextual chunk headings #118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
May 5, 2025
Merged

Conversation

SimonJasansky
Copy link
Contributor

@SimonJasansky SimonJasansky commented Apr 17, 2025

Subject: Fix: Correct Heading Inheritance in Chunking & Add Insertion Tests

Changes made

  • Fixed Bug to prevent chunks with new headings inheriting old headings
  • Added tests to check for this behavior
  • Build changes:
    • Install Clang and use Clang instead of GCC to avoidCMake errors during llama-cpp-python source builds, as encountered in this failing test run.
    • Restrict litellm version to >=1.60.2,<1.67.0 to avoid error with most recent version 1.67.2 encountered in this failing test run.
    • Upgrade fastmcp version requirement to >=2.0.0

Bug Fix: Heading Inheritance

Problem:
The previous logic caused chunks containing a new heading to incorrectly retain the heading from the previous chunk. For example, if a document transitioned from # Section 1 to # Section 2, the chunk containing # Section 2 still has # Section 1 as its heading. This flaw is visible in indices 7, 13, and 17 in the "Behavior Before" image.

Expected Behavior:

When a chunk contains a new heading, the inherited heading context should be updated or cleared appropriately for that chunk. Some cases:

  • # H1 -> # New H1: Old # H1 context is cleared.
  • ## H2 -> ## New H2: Old ## H2 context is cleared (parent # H1 retained).
  • ## H2 -> # New H1: `## H2 is cleared.
  • ### H3 -> # New H1: All old context is cleared.

Solution:
The _create_chunk_records function now explicitly checks if a newly extracted heading differs from the inherited one at the same or higher level. If it differs, the relevant parts of the inherited heading string are removed.

Testing

Added tests to tests/test_insert.py to verify the correct heading assignment during document insertion, ensuring this bug does not reoccur.

Debugging

During CI, various bugs were encountered, leading to the build changes mentioned above. A detailed description and resolution strategies are described in this PDF.

Images

Behavior Before

image

Behavior After

image

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (1)

src/raglite/_insert.py:42

  • Iterating over each heading part to remove it from the beginning of the chunk body may result in unintended removal when multiple, potentially overlapping, heading parts are present. Consider using a regular expression to match and remove the full sequence of heading parts as a single occurrence to ensure accuracy.
for heading_part in heading_parts:

@SimonJasansky SimonJasansky changed the title Fix: inherit previous header and header in body Fix: inherit previous header when new header arrives Apr 18, 2025
@SimonJasansky SimonJasansky changed the title Fix: inherit previous header when new header arrives Fix: remove previous header when new header arrives Apr 18, 2025
@SimonJasansky SimonJasansky requested review from lsorber, StijnGoossens and Copilot and removed request for StijnGoossens April 18, 2025 13:51
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request fixes the bug where new Markdown chunks were unintentionally inheriting the previous chunk’s heading and adds tests to ensure that the header inheritance is correctly updated. Key changes include:

  • Updating the logic in _insert.py to clear previous headings when a new heading is detected.
  • Adding unit tests in tests/test_insert.py to verify the header correction.
  • Updating type hints in _mcp.py and pinning the llama-cpp-python dependency in pyproject.toml.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
tests/test_insert.py Added comprehensive tests for correct heading assignment
src/raglite/_mcp.py Updated type annotations for FastMCP
src/raglite/_insert.py Modified heading inheritance logic to fix the bug
pyproject.toml Pinned llama-cpp-python to version 0.3.4 to avoid dependency issues

@SimonJasansky SimonJasansky marked this pull request as ready for review April 18, 2025 13:56
@SimonJasansky SimonJasansky changed the title Fix: remove previous header when new header arrives fix: remove previous header when new header arrives Apr 18, 2025
@lsorber lsorber marked this pull request as draft April 28, 2025 09:09
@SimonJasansky SimonJasansky marked this pull request as ready for review April 29, 2025 06:24
Copy link
Member

@lsorber lsorber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @SimonJasansky!

In my review I already addressed a few comments:

  1. The necessary compilers for llama-cpp-python are already present in the image, but needed to be explicitly configured in the Dockerfile. By doing so we can avoid installing clang.
  2. I addressed the bug that newer versions of LiteLLM seem to introduce. In fact, the bug was on our end in the llama-cpp-python chat template.
  3. I removed the upper bounds on the dependencies—in general upper bounds are to be avoided because that limits the intallability of the package. This was only possible after (1) and (2).
  4. I moved the new logic from _insert.py further upstream to _database.py, so that there's more downstream opportunity to benefit from it (i.e., not only in insertion).
  5. I discovered and fixed an issue with the new logic: if the chunk contains headings in the middle of the chunk, that would also reset the contextual headings. But actually, you wouldn't want that. We only want to reset the contextual headings for leading headings in the chunk.

@lsorber lsorber changed the title fix: remove previous header when new header arrives fix: improve contextual chunk headings May 5, 2025
@lsorber lsorber merged commit 024dbf4 into main May 5, 2025
4 checks passed
@lsorber lsorber deleted the fix/inherit-previous-header branch May 5, 2025 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants