Skip to content

perf: Add performance tracing capability #1706

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented May 2, 2025

Which issue does this PR close?

Closes #1705

Rationale for this change

This feature makes it possible to visualize the flow of calls during query execution, along with JVM and native memory usage.

2025-05-06_10-29

What changes are included in this PR?

Adds optional feature to enable trace logging using Chrome's Trace Event Format.

How are these changes tested?

I tested manually and confirmed that there is no performance impact when this feature is not enabled.

@codecov-commenter
Copy link

codecov-commenter commented May 2, 2025

Codecov Report

Attention: Patch coverage is 58.33333% with 5 lines in your changes missing coverage. Please review.

Project coverage is 58.69%. Comparing base (f09f8af) to head (bb388dc).
Report is 180 commits behind head on main.

Files with missing lines Patch % Lines
...ain/scala/org/apache/comet/CometExecIterator.scala 58.33% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1706      +/-   ##
============================================
+ Coverage     56.12%   58.69%   +2.56%     
- Complexity      976     1142     +166     
============================================
  Files           119      129      +10     
  Lines         11743    12637     +894     
  Branches       2251     2363     +112     
============================================
+ Hits           6591     7417     +826     
- Misses         4012     4044      +32     
- Partials       1140     1176      +36     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andygrove andygrove changed the title [experiment] Generate event log in Chrome tracing format feat: Add performance tracing capability May 2, 2025
@andygrove andygrove marked this pull request as ready for review May 2, 2025 15:02
@andygrove andygrove changed the title feat: Add performance tracing capability perf: Add performance tracing capability May 2, 2025
@andygrove andygrove requested review from parthchandra and comphead May 2, 2025 15:09
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove btw did you try the samply
https://datafusion.apache.org/library-user-guide/profiling.html#profiling-using-samply-cross-platform-profiler

I experimented with it a little bit and documented in DF

UPD: it works with binaries profiling, but I'm not sure how to use it in hybrid env like Comet through

@andygrove
Copy link
Member Author

UPD: it works with binaries profiling, but I'm not sure how to use it in hybrid env like Comet through

I was looking for a solution where we can choose which sections of code to trace, both in JVM and native, and see the flow between these layers. There are tracing and tracing-chrome crates that may also be interesting to look into, but seems a bit more complex to implement.

@andygrove andygrove marked this pull request as draft May 6, 2025 14:56
@@ -767,3 +750,20 @@ pub unsafe extern "system" fn Java_org_apache_comet_Native_traceEnd(
) {
// no implementation
}

struct TraceGuard<'a> {
Copy link
Contributor

@comphead comphead May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we can have it as generic function?

pub fn trace_time<T, F>(label: &str, mut f: F) -> T
where
    F: FnMut() -> T,
{
    let _ = TraceGuard::new("writeSortedFileNative");
    let result = f();
    result
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I like that idea. I will try and implement this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into some issues around lifetimes for captured variables with this approach, so perhaps it is simpler to stick with the TraceGuard.

@andygrove
Copy link
Member Author

We can trace native memory usage as well and display it next to the running tasks:

2025-05-06_09-29

@andygrove
Copy link
Member Author

@Kontinuation you may be interested in this

@andygrove andygrove marked this pull request as ready for review May 6, 2025 19:18
}

fn log_event(&self, name: &str, ph: &str) {
println!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it planned to output with println or some other method like tracing https://docs.rs/tracing/latest/tracing/ ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created an epic where we can track items like this.

#1718

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it would be nicer to use tracing (and even better to have the tracing output compatible with log like this perhaps? https://docs.rs/tracing/latest/tracing/#for-log-users ).

@@ -133,29 +133,25 @@ class CometExecIterator(

def getNextBatch(): Option[ColumnarBatch] = {
assert(partitionIndex >= 0 && partitionIndex < numParts)
try {
nativeLib.traceBegin("CometExecIterator_getNextBatch")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be inside if (memoryProfilingEnabled ?

@andygrove andygrove marked this pull request as draft May 6, 2025 22:38
@andygrove
Copy link
Member Author

I moved this to draft because it is still quite experimental. I am now working on adding tracing for JVM off-heap usage in CometUnifiedShuffleMemoryAllocator and CometTaskMemoryManager.

@andygrove
Copy link
Member Author

I can now see the off-heap memory being used by CometUnsafeShuffleWriter, which was 4GB in this case (500MB per concurrent task).

2025-05-07_10-15

}

fn log_event(&self, name: &str, ph: &str) {
println!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it would be nicer to use tracing (and even better to have the tracing output compatible with log like this perhaps? https://docs.rs/tracing/latest/tracing/#for-log-users ).

self.log_event(name, "E")
}

pub fn log_counter(&self, name: &str, value: usize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be u64 and so not be system dependent? Not sure how useful it might be but some counters (say bytes read) could be u64 but usize may be smaller.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for event tracing for visualizing where time is spent during execution
4 participants