Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating FlinkDeployment interpreter to display error status, improving health interpreter #6073

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mszacillo
Copy link
Contributor

What type of PR is this?
/kind feature

What this PR does / why we need it:
After doing some load testing of the FlinkDeployment failover (which overall has been looking quite good, but may need to address a couple edge cases), I found that the interpreter is missing one of the ephemeral states that FlinkDeployments can transition through.

Occasionally on failover, the FlinkDeployment will transition from RECONCILING -> INITIALIZING -> CREATED, before finally ending on RUNNING. Additionally, we can make use of the status.error field to further improve the health interpretation.

In this PR I've added:

  1. INITIALIZING state as an ephemeral state which we should check during health interpretation.
  2. Checking if status.error != nil. If the deployment has a published error, then we treat it as healthy, as this indicates that the job failed due to user error. In the future, we may consider adding an ignore list of errors which we would like to failover.

Which issue(s) this PR fixes:
Fixes #6023

Does this PR introduce a user-facing change?:

FlinkDeployment health interpreter improvements, adding status.error to reflected status

@karmada-bot karmada-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 21, 2025
@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chaunceyjiang for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 21, 2025
@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 48.34%. Comparing base (820fd06) to head (f705b26).

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6073   +/-   ##
=======================================
  Coverage   48.33%   48.34%           
=======================================
  Files         666      666           
  Lines       54858    54858           
=======================================
+ Hits        26518    26520    +2     
+ Misses      26616    26614    -2     
  Partials     1724     1724           
Flag Coverage Δ
unittests 48.34% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@RainbowMango
Copy link
Member

@yike21 @chaunceyjiang can you take a look?

@yike21
Copy link
Member

yike21 commented Jan 22, 2025

@yike21 @chaunceyjiang can you take a look?

Ok, I'll take a look at it ASAP :-)

@RainbowMango
Copy link
Member

Here is the FlinkDeployment reference where you can find the definition of the status:
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/reference/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FlinkDeployment health interpreter does not account for ImagePullBackOff Error
5 participants