Skip to content

Spec: Add SigV4 Auth Support for Catalog Federation #1506

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

XJDKC
Copy link
Member

@XJDKC XJDKC commented May 1, 2025

Context

Prior Catalog Federation work:

Reference

Below are all the reference mentioned in this PR:

Description

This PR adds the SigV4 Auth Support so that Polaris can federate to Glue IRC / S3Tables / catalog server hosted behind the AWS API Gateway.

Details

User-provided properties:

  • roleArn: the AWS IAM role urn, polaris will assume the role to get a tmp aws session credential
  • roleSessionName: The role session name to be used by the SigV4 protocol for signing requests
  • externalId: An optional external id used to establish a trust relationship with AWS in the trust policy
  • signingRegion: Region to be used by the SigV4 protocol for signing requests
  • signingName: The service name to be used by the SigV4 protocol for signing requests, the default signing name is "execute-api" is if not provided
    • AWS Glue: glue
    • AWS S3Tables: s3tables
    • AWS API Gateway: execute-api

Service-provided properties:

  • userArn: The aws user arn used to assume the aws role, this represents the polaris service itself, overall steps:
    • Step 1: Polaris will use an user aws credential to assume the IAM role, this aws credential is configured for polaris, usually polaris will read it from environment variable
    • Step 2: After assuming the role, polaris will get the tmp aws credential, polaris will put these credentials into the prop map, this prop map will be used to initialize the REST Catalog Client, properties can be found in AwsProperties:
      • rest.signing-region
      • rest.signing-name
    • Step 3: RESTSigV4Signer will sign the requests based on SigV4 protocol

Reference

properties:
roleArn:
type: string
description: The aws IAM role arn assume when signing requests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this assume the use of STS?

Mere SigV4 can be done with just a keyId/keySecret pair.. why is it not an option? AFAIK, there are S3 implementations that do not have STS, but support SigV4.

Copy link
Member Author

@XJDKC XJDKC May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this assumes the use of STS. While SigV4 can technically work with just a keyID/keySecret pair, that's not how it works in Polaris.

Let me break it down a bit:

Polaris acts as the service provider, and it has its own IAM user with an AWS credential (key ID and key secret). However, this IAM user is owned by the Polaris service, not by the Polaris user. Since IAM doesn't allow one AWS account to grant privileges directly to another IAM user belong to another AWS account, Polaris wouldn't be able to access the polaris user's Glue catalog that way.

So how does Polaris access a user's Glue catalog? By assuming the IAM role provided by the Polaris user. This lets the Polaris service temporarily inherit the permissions tied to that role, essentially gaining the necessary access without the user having to expose long-lived credentials.

Now, why are there S3 implementations that don't use STS?
That's very common on the query engine side. In that case, the query engine is fully managed by the Polaris user themselves. They can create an IAM user + access key pair, grant that IAM user privileges to their storage, and configure their engine to use that credential directly. No need for STS in that scenario.

So here's the key difference:

  1. AWS User credentials (key ID/secret) are long-lived and tied to a user. IAM roles aren't real credentials, they're just a set of permissions. To access resources, Polaris assumes a role and obtains short-lived temporary credentials via STS. This is much more secure
  2. In the query engine use case, both the engine and the storage are owned by the same identity (the Polaris user), so they're free to use long-lived user credentials if they want. Bur for polaris service provider and polaris users, they come from two different groups.

The IAM role assumption model is actually consistent with how Polaris accesses S3 storage in general. In Polaris S3 storage configs, users provide an IAM role, Polaris assumes that role, and gets subscoped temporary credentials via STS to access the storage.

Glue Catalog also recommends to use IAM role to manage the access.
https://docs.aws.amazon.com/glue/latest/dg/configure-iam-for-glue.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info. Re: Polaris acts as the service provider ... while I agree that it's a possible use case, I do not think Apache Polaris needs to be restricted to just that use case. I can imagine users who'd want to deploy and use Polaris in a controlled environment and have ownership both of the AWS storage and Polaris servers.

That, of course, does not invalidate the security best practices that you referenced.

Given that the new connection config is apparently meant to enable connections to AWS services and not to generic S3 implementations, I'd propose to rename the config type to something like AWSSTSAuthenticationParameters to avoid ambiguity (with type code AWSSTS). WDYT?

Copy link
Member Author

@XJDKC XJDKC May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep it consistent with the Iceberg SDK: AuthProperties.java#30

Also, getting the tmp AWS credential from STS is just the first step. The actual operation involves signing the request using the SigV4 protocol. Given that, I'd prefer we stick with using SIGV4 as the type name, it better reflects the authentication mechanism being used.

But I agree that Apache Polaris might be deployed in a controlled environment where users own both the AWS storage and the Polaris servers. In that setup, using an IAM role isn't strictly necessary. Long-lived credentials are often read from environment variables or command-line configs.

That said, we shouldn't recommend storing these long-lived credentials directly in the catalog entity and persisting them in the metastore. It’s a security risk, and it's generally best practice to keep credentials out of persistent metadata wherever possible. (Maybe we can leverage the UserSecretsManager added by @dennishuo and use a UserSecretReference to refer to that long-lived credential).

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify my point on naming: indeed, the actual authentication mechanism on the wire is going to be SigV4, however, the user-facing configuration and requirements involve STS and role assumption. This is more specific than SigV4. In other words, users who configure this authentication method have to be aware of STS and roles. The basic inputs into SigV4 are ID/secret/session-token, but this auth config is higher level.

Therefore, from my POV using a name that highlights reliance on higher-level concepts like STS would be beneficial and provide clarity to users.

I do not feel too strongly about this, though :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or it's better to have a subtype for SigV4? e.g. STS or STATIC_CREDS? WDYT?
This will cover both scenarios.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replied on the dev list as well, but I think for STATIC_CREDS we can probably follow the same convention we started with SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION for StorageConfig, where we can be more opinionated about subscoping design for the "managed multi-tenant" use cases, while allowing a simpler approach for open-ended "self-managed, single-tenant" use cases to just delegate to environment variables and/or default client SDK behaviors.

In particular, we can take stand on whether we ever actually want Polaris to run in a mode that mixes some catalogs that do subscoping with other catalogs that just use static credentials through environment variables or similar. The problem is that the subscoping is trying to promise certain guarantees of isolation, and mixing the two might fundamentally violate its assumptions. So it might be best to really bifurcate the two use cases by essentially making the behavior choice a top-level server config (e.g. SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION) instead of trying to express it as an "option" in the ConnectionConfigType.

That said, I don't feel too strongly whether to make the auth type more sts-specific since in theory there could still be other non-static-creds based flows for sigv4 that also don't use STS. Like it could be SIGV4_STS or STS_SIGV4. But for now still not adding any particular type for static creds.

Alternatively, if anyone feels strongly that even in the static creds case we need an explicit ConnectionType for it (maybe in addition to a server-level config), we at the very least need to add a SUPPORTED_CONNECTION_AUTH_TYPES just like how we have a SUPPORTED_CATALOG_STORAGE_TYPES because in multi-tenant configurations it'll be important to block the ability for catalogs to be configured to use open-ended static creds

Copy link
Member Author

@XJDKC XJDKC May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Dennis, thanks for bringing this up, I didn't realize there was already a solution in place for storage config.

We can definitely follow the same pattern for connection config: skip assuming the role and just read the credentials from the command line. I also looked into the Iceberg SDK, if we don't explicitly pass credentials in the catalog properties when initializing the RESTCatalog, the SDK falls back to AWS SDK's DefaultCredentialsProvider, which picks up credentials from the environment variables.

Code pointers:
RESTSigV4Signer.java#L91
AwsProperties.java#L417-L435

Agree that there could be other non-static-creds (e.g. For Top Secret Region, we need to call CAP service to get the credentials, it's a replacement to the STS). Instead of creating many top-level types for SigV4, let's not adding any particular type for SigV4 auth.

I will add SUPPORTED_CONNECTION_AUTH_TYPES in a follow up PR.

Mind giving this PR a stamp?
cc: @dimas-b @dennishuo

Copy link
Collaborator

@adnanhemani adnanhemani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same comments as @dimas-b and @singhpk234 - but this should be good for now!

@XJDKC
Copy link
Member Author

XJDKC commented May 5, 2025

Hi folks, any other concerns about this spec change?

There is one remaining thing we haven't made a decision:
For non-STS based SigV4 authentication, how do we plan to support it?

1. From env vars or server config, e.g.:

  • POLARIS_IAM_USER_AWS_ACCESS_KEY_ID
  • POLARIS_IAM_USER_AWS_SECRET_ACCESS_KEY
  • POLARIS_IAM_USER_ARN
    In this case, roleArn would not be required.

2. Configured via the Polaris Management API: Stick to SigV4AuthenticationParameters

If we stick with the existing SigV4AuthenticationParameters type, we could:

  • Make roleArn optional
  • Add iamUserAwsAccessKeyId and iamUserAwsSecretAccessKey as optional fields

3. Configured via the Polaris Management API: Add new auth type

We could create a new type to distinguish clearly:

  • New AuthenticationType enum: SIGV4_STS, SIGV4_STATIC_CREDS

4. Configured via the Polaris Management API: Add new auth types

We could create a new sub type to distinguish clearly:
e.g. new subtype under SigV4AuthenticationParameters: STS, CREDS

Do you have any perference or is the current proposal good to go?

@adnanhemani
Copy link
Collaborator

@XJDKC - sorry, I know I had previously approved this proposal but reading through some of the comments, I'm questioning something:

Why do we require a IAM User for Polaris? My understanding is that today, Polaris already has access to some IAM Role credentials (through the Default Credential Provider) - why do we need to introduce the new complexity of an IAM User? If we assume that Polaris has access to the IAM Role, then we no longer need the workaround for cross-account access as well.

Can you clarify on this bit?

On non-STS SigV4 authN, I don't really have a strong preference - and I don't think adopting any of the options constitutes this PR as a one-way door. So I'm okay to punt on it for now.

@XJDKC
Copy link
Member Author

XJDKC commented May 5, 2025

@XJDKC - sorry, I know I had previously approved this proposal but reading through some of the comments, I'm questioning something:

Why do we require a IAM User for Polaris? My understanding is that today, Polaris already has access to some IAM Role credentials (through the Default Credential Provider) - why do we need to introduce the new complexity of an IAM User? If we assume that Polaris has access to the IAM Role, then we no longer need the workaround for cross-account access as well.

Can you clarify on this bit?

On non-STS SigV4 authN, I don't really have a strong preference - and I don't think adopting any of the options constitutes this PR as a one-way door. So I'm okay to punt on it for now.

Hey @adnanhemani, there is an on-going discussion about this PR: https://lists.apache.org/thread/rlbxvw0xmzvlfm7pdh97bs3xvq7o8lmy

This thread also contains some details about it: #1506 (comment)

In short, to get an IAM Role credentials, polaris needs to use a long-lived credential (represents the polaris service identity) to assume the role (provided by polaris users) and get a temp credential. This is the solution provided by AWS IAM to grant permissions to AWS account B (polaris service provider) from AWS account A (polaris users).

@dimas-b
Copy link
Contributor

dimas-b commented May 5, 2025

I believe the STS credentials concern is not specific to this PR. We can probably deal with that in another PR.

Comment on lines 969 to 972
userArn:
type: string
description: The aws user arn used to assume the aws role, this represents the polaris service itself
example: "arn:aws:iam::123456789001:user/polaris-service-user"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe userArn is not part of the Assume Role API call in STS... unless I'm missing something. So this parameter seems redundant in the connection config.

Copy link
Member Author

@XJDKC XJDKC May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This property is used to expose the Polaris service's userArn to Polaris users, so they can update the trust relationship of their IAM role to allow Polaris to assume the role they’ve provided.

AWS doc for editing trust relationships: Editing the trust relationship for an IAM role

From the user's perspective, the flow looks like this:

  1. Create a Polaris Catalog entity that includes a connection config specifying their IAM roles.
  2. Polaris injects its IAM user ARN into the connection config.
  3. Users call the getCatalog API to retrieve catalog details, including the Polaris service identity (userArn).
  4. Users update the trust relationship on their IAM roles, allowing Polaris to assume those roles.
    • For storage config: the IAM role needs permissions to access the bucket.
    • For connection config: the IAM role needs permissions to access their catalog (e.g., Glue).
  5. Polaris assumes the IAM role and gets temporary credentials with the role's permissions.
  6. Polaris uses those temporary credentials to sign and send requests to the Glue Catalog (or other catalog service).

Storage config also has this property.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, @XJDKC : Is it not the user who invokes the Admin API to save this config? The caller of the API must already know the value, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storage config also has this property.

I believe this property is equally misleading in the storage config too 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the service owner needs to inform users about userArn, I believe we need to find another mechanism. Adding properties to configuration objects that are meant to affect Polaris behaviour, but have parameters not used by Polaris code is not a sound approach, I think.

Copy link
Contributor

@dennishuo dennishuo May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that flow is indeed how the StorageConfigInfo works today for managed Polaris services that bind their own service-owned IAM User at the time of the CreateCatalog call; in general the binding of the IAM User to a StorageConfig (or a ConnectionConfig) may not be known until the CreateCatalog actually runs.

I think the problems with trying to do it at step 1 are:

  • The binding might be dynamic and transactionally done performed during the CreateCatalog
  • There's no separate API to express pre-planning for intending to create a Federated Catalog

It's true that if the service provider has very static base IAM Users or if the user interaction is very manual (e.g. if it's a small-scale/informal use case where an internal "user" DMs the "service provider" on Slack), then the userArn can be known already before creating the catalog.

Looks like there was some discussion about this already in #1191

I know @XJDKC has proposed having a stronger delineation between user-provided and service-provided properties, and I agree it's worth cleaning up.

The "not used by Polaris" might just be a shortcoming of the current default implementations -- in a way we should be enforcing consistency of the userArn when setting up the stsClient in

default Supplier<StsClient> stsClientSupplier() {
  return Suppliers.memoize(
      () -> {
        StsClientBuilder stsClientBuilder = StsClient.builder();
        if (awsAccessKey().isPresent() && awsSecretKey().isPresent()) {
          LoggerFactory.getLogger(StorageConfiguration.class)
              .warn("Using hard-coded AWS credentials - this is not recommended for production");
          StaticCredentialsProvider awsCredentialsProvider =
              StaticCredentialsProvider.create(
                  AwsBasicCredentials.create(awsAccessKey().get(), awsSecretKey().get()));
          stsClientBuilder.credentialsProvider(awsCredentialsProvider);
        }
        return stsClientBuilder.build();
      });
}

But the default implementation is just kept simple by basically taking a single root IAM User from the runtime environment that ends up being the userArn used implicitly for all StorageConfigurations in the service.

One step to expand on that architecture in a way that better illustrates how the fields actually should be used by the code would be if we allow something like setting per-realm root IAM Users or IAM Roles in the Quarkus config for being the "service identity" that will assumeRole on the user-specified catalog-level roleArn. Then the stsClient Supplier would be initially authenticating as the realm-specific "service identity" that matches what's in the Catalog's properties.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the concept of "read-only" or "write-only" fields are actually first-class in OpenAPI: https://github.com/OAI/OpenAPI-Specification/blob/main/versions/3.0.2.md#fixed-fields-20

readOnly:

Relevant only for Schema "properties" definitions. Declares the property as "read only". This means that it MAY be sent as part of a response but SHOULD NOT be sent as part of the request. If the property is marked as readOnly being true and is in the required list, the required will take effect on the response only. A property MUST NOT be marked as both readOnly and writeOnly being true. Default value is false.

writeOnly:

Relevant only for Schema "properties" definitions. Declares the property as "write only". Therefore, it MAY be sent as part of a request but SHOULD NOT be sent as part of the response. If the property is marked as writeOnly being true and is in the required list, the required will take effect on the request only. A property MUST NOT be marked as both readOnly and writeOnly being true. Default value is false.

Maybe we should experiment with some of those additional annotations? Interestingly, writeOnly conveys well other patterns we've had such as inline secrets.

And readOnly would be fields used for communicating things decided by the server backend back to the user.

Copy link
Contributor

@dimas-b dimas-b May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, @dennishuo !

It looks like this is not so much about using a two-stage Assume Role approach in Polaris, but more about the source and target audience of userArn. Right now it is clear that the user is not the source but a consumer of this value.

I believe it would clearer if this was represented in another config object, for example ConnectionConfigInfo.accessDeclarations to contain userArn.

Whether or not Polaris does a two-stage AssumeRole is not directly relevant to the AuthenticationParameters. These parameters define what Polaris needs to do in order to gain access to user's resources. userArn is a declaration Polaris makes about the identity it will use for accessing the external resources. Whether Polaris is authorized to use userArn is beyond the scope of this config. It is declared to the user by Polaris in order for the user to make appropriate grants.

WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Dmitri and Dennis, thanks for the suggestions! Extracting the service-provided properties into a nested field and annotating it with readOnly both make a lot of sense to me. I'll update the spec to reflect these changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

dennishuo
dennishuo previously approved these changes May 6, 2025
Copy link
Contributor

@dennishuo dennishuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM given the consistency with StorageConfig, though I agree we should probably refine the way we convey "output-only" concepts like the userArn, and get agreement with other reviewers on how best to convey that distinction before merging this PR (maybe simply annotating with the openApi field readOnly is enough, or we could go further and move the readOnly fields into a separate tuple such as vendorInfo like you suggested elsewhere).

Comment on lines 969 to 972
userArn:
type: string
description: The aws user arn used to assume the aws role, this represents the polaris service itself
example: "arn:aws:iam::123456789001:user/polaris-service-user"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the concept of "read-only" or "write-only" fields are actually first-class in OpenAPI: https://github.com/OAI/OpenAPI-Specification/blob/main/versions/3.0.2.md#fixed-fields-20

readOnly:

Relevant only for Schema "properties" definitions. Declares the property as "read only". This means that it MAY be sent as part of a response but SHOULD NOT be sent as part of the request. If the property is marked as readOnly being true and is in the required list, the required will take effect on the response only. A property MUST NOT be marked as both readOnly and writeOnly being true. Default value is false.

writeOnly:

Relevant only for Schema "properties" definitions. Declares the property as "write only". Therefore, it MAY be sent as part of a request but SHOULD NOT be sent as part of the response. If the property is marked as writeOnly being true and is in the required list, the required will take effect on the request only. A property MUST NOT be marked as both readOnly and writeOnly being true. Default value is false.

Maybe we should experiment with some of those additional annotations? Interestingly, writeOnly conveys well other patterns we've had such as inline secrets.

And readOnly would be fields used for communicating things decided by the server backend back to the user.

@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board May 6, 2025
@XJDKC
Copy link
Member Author

XJDKC commented May 9, 2025

Hi folks, I've updated the spec to extract the Polaris service identity info into a nested object.

Right now, I've added support for two types of service identities:

  • AWS_IAM_USER: Polaris is configured with long-lived AWS credentials from an IAM user. It uses these credentials to assume the IAM role provided by Polaris users, then accesses the remote catalog using the subscoped temp credentials.
  • AWS_IAM_ROLE: Polaris retrieves temporary credentials via Amazon ECS container credentials or Amazon EC2 instance IAM role credentials, then assumes the polaris user-provided IAM role to get new temporary credentials to access the remote catalog.

The service identity info is part of the SigV4AuthenticationParameters, it's read only and will be surfaced to polaris users in the response of getCatalog / listCatalogs / updateCatalog

Looking ahead, we may introduce additional service identity types (e.g. GCP Service Account) and aim to unify the behavior between storage config and connection config for consistency.

I have drafted a proposal for the credential management in Apache Polaris, would love to hear your thoughts, feedback, and suggestions.
Apache Polaris Creds Management Proposal

cc: @dennishuo @dimas-b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants