Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support gds var length label pruning #4868

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ted-wq-x
Copy link
Contributor

@ted-wq-x ted-wq-x commented Feb 7, 2025

Description

Fixes #4396

Contributor agreement

@ray6080 ray6080 requested a review from andyfengHKU February 7, 2025 07:42
auto satisfyForwardPruning = forwardPruningFunc(srcTableID, dstTableID);
if (directionType == RelDirectionType::BOTH) {
if (satisfyForwardPruning ||
(dstTableIDSet.contains(srcTableID) && srcTableIDSet.contains(dstTableID))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have a backwardPruningFunc in order to be consistent.

if (rel.isRecursive()) {
auto nodeTableIDs = getNodeTableIDs();
// there is no label on both sides
if (rel.getUpperBound() == 0 || (srcTableIDSet.size() == dstTableIDSet.size() &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check srcTableIDSet.size() == dstTableIDSet.size() && dstTableIDSet.size() == nodeTableIDs.size() necessary? Seems your logic would still work in this case. So we would we return early?

if (currentLength >= upperBound) {
return;
}
if (graph.contains(curTableID)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be KU_ASSERT. Is there a case where graph not contains curTableID?

@@ -102,6 +106,7 @@ class RelExpression final : public NodeOrRelExpression {
recursiveInfo = std::move(recursiveInfo_);
}
const RecursiveInfo* getRecursiveInfo() const { return recursiveInfo.get(); }
RecursiveInfo* getRecursiveInfo() { return recursiveInfo.get(); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename -> getRecursiveInfoUnsafe

@@ -45,6 +47,14 @@ PathsOutputWriterInfo RJBindData::getPathWriterInfo() const {
return info;
}

std::vector<table_id_set_t> RJBindData::getStepActiveRelTableIDs() const {
if (flipPath) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good time to rename flipPath to extendLeftToRight. Otherwise reading this part is a bit confusing.

}
}
rel.setEntries(prunedEntries);
return stepActiveTableIDs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try to summarize the logic here and pls correct me if I'm wrong.

Consider topology Person - knows -> Person and Person - livesIn -> City and query

MATCH (a:Person) - [*2..2] -> (b:City)

WLOG, we assume going from left to right.

The dfs part will first generate all size 2 path

<<knows>, <knows>>
<<knows>, <livesIn>>

and because <<knows>, <knows>> dst is not b:City it will be pruned.

So eventually we only compute <<knows>, <livesIn>>.

If my understanding is correct, can u also give a benchmark number in the PR description to make sure we actually get performance benefit.


// There may be multiple rel type between src and dst
using PATH = std::vector<table_id_vector_t>;
std::vector<std::vector<PATH>> paths(upperBound + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to comment

std::vector<PATH> are paths of the same length.

And the outer-most std::vector are paths of different length.

In fact, it might be easier to read if just use std::vector<PATH> here.

const table_id_set_t srcTableIDSet, const table_id_set_t dstTableIDSet, size_t lowerBound,
size_t upperBound, RelDirectionType relDirectionType) const {
// src-->[dst,[rels]]
std::unordered_map<table_id_t, std::unordered_map<table_id_t, table_id_vector_t>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to use but I would use std::unrdered_map<table_id_t, STRUCT>

and STRUCT contains

table_id_t dstTableID
std::vector<tabel_id_t> relTableIDs

because we always access by srcTableID instead of src&dstTableID

// todo we need reset rel entries?
auto temp = mergeTableIDs(stepFromLeftTableIDs, stepFromRightTableIDs);
std::vector<TableCatalogEntry*> newRelEntries{temp.begin(), temp.end()};
if (!isSameTableCatalogEntryVector(newRelEntries, rel.getEntries())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we do l284-l312 as a separate PR because

  • it can be viewed as another optimization after pruning recursive rel labels; and
  • we will also need to get a benchmark number to make sure there are performance improvements for node pruning.

Copy link

codecov bot commented Feb 10, 2025

Codecov Report

Attention: Patch coverage is 99.30070% with 1 line in your changes missing coverage. Please review.

Project coverage is 85.83%. Comparing base (b89dd36) to head (c5130de).
Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
src/binder/query/query_graph_label_analyzer.cpp 99.11% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4868      +/-   ##
==========================================
+ Coverage   85.80%   85.83%   +0.02%     
==========================================
  Files        1397     1397              
  Lines       60504    60614     +110     
  Branches     7451     7473      +22     
==========================================
+ Hits        51917    52026     +109     
- Misses       8406     8407       +1     
  Partials      181      181              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ted-wq-x
Copy link
Contributor Author

I found gds join performance worse than before, there are so many memory minor fault. This opt does not provide any performance improvement in current code.

Under recursive join ,this opt can get 900ms -->300ms in sf100,"match (a:person {id:24189255912498})-[f*2]-(b) return count(1)" (Previous test results);

I will provide performance benchmark, when i resolve the current performance problem.

@ted-wq-x ted-wq-x force-pushed the recurse-label-pruning branch from c5130de to 78f3840 Compare February 18, 2025 13:02
@ted-wq-x ted-wq-x force-pushed the recurse-label-pruning branch from 78f3840 to 2ae6a39 Compare February 19, 2025 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants