-
Notifications
You must be signed in to change notification settings - Fork 110
RFC: Centralized Caching Service for Trustee #784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think this makes sense. One thing I don't see mentioned here are considerations around high-availability. We have been thinking about how to make Trustee stateless so that it can be used with some k8s paradigms. I think your proposal should work fine with this. We'll just need to provide some caching backend that does not store things in memory. I am assuming that you'll define a cache trait and allow people to specify various backends. The simplest one would just be a dictionary. I don't know that we actually need to worry about eviction in the first iteration, but I guess it can't hurt. I would imagine that the trait itself will look a lot like a dictionary. Probably we can just have the keys be strings, with different verifiers responsible for their naming schemes (although maybe some verifier prefix to avoid collisions). I think it's also important to define a method of pre-provisioning the cache via the trait, so that admin can provide certs in an offline environment. One question I have is whether this caching backend is whether there are any other parts of Trustee, besides the verifiers, that it would be exposed to. |
An exciting proposal! A basic question: the goal of this cache service is which one?
|
I agree. I feel like HA is important, but falls outside of the scope of the first iteration; though, we definitely want to keep it on our radar. How quickly it may be needed is still up for debate.
Yes, the PoC looked like this: main.rs#[cfg(not(feature = "ordered"))]
use std::collections::HashMap as Map;
#[cfg(feature = "ordered")]
use std::collections::BTreeMap as Map;
//...
// Here `ExampleUseCase` is an enum which implements Context...
static DATA_CACHE: LazyLock<Arc<RwLock<ContextData<Certificate>>>> = LazyLock::new(|| {
let mut context_data = ContextData::new(Map::new());
context_data.insert(
ContextKey(Box::new(ExampleUseCase::ClassificationOne)),
DatasetCache { cache: Map::new() },
);
context_data.insert(
ContextKey(Box::new(ExampleUseCase::ClassificationTwo)),
DatasetCache { cache: Map::new() },
);
Arc::new(RwLock::new(context_data))
}); context.rsuse std::{fmt::Debug, hash::Hash};
use crate::{Map, dataset::DatasetCache};
/// A marker trait used to classify the ordering of the second level cache.
/// This can be any arbitrary data-type, but usually will be represented
/// via an enum which implements this marker trait.
pub trait Context: Debug + Send + Sync {}
#[derive(Debug)]
pub struct ContextKey(pub Box<dyn Context>);
impl PartialEq for ContextKey {
fn eq(&self, other: &Self) -> bool {
format!("{:?}", self.0) == format!("{:?}", other.0)
}
}
impl Eq for ContextKey {}
impl Hash for ContextKey {
fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
format!("{:?}", self.0).hash(state);
}
}
#[derive(Debug)]
pub struct ContextData<T> {
cache: Map<ContextKey, DatasetCache<T>>,
}
impl<T> ContextData<T> {
pub fn new(cache: Map<ContextKey, DatasetCache<T>>) -> Self {
Self { cache }
}
pub fn insert(&mut self, key: ContextKey, value: DatasetCache<T>) {
self.cache.insert(key, value);
}
pub fn get(&self, key: &ContextKey) -> Option<&DatasetCache<T>> {
self.cache.get(key)
}
pub fn get_mut(&mut self, key: &ContextKey) -> Option<&mut DatasetCache<T>> {
self.cache.get_mut(key)
}
} dataset.rsuse std::{fmt::Debug, hash::Hash};
use crate::Map;
/// A marker trait used to classify generic input data. This trait allows for
/// use-case specific implementations of unique data which should be used as
/// the key in the second level of the cache.
pub trait Dataset: Debug + Send + Sync {}
impl<T: Dataset + ?Sized> Dataset for Box<T> {}
#[derive(Debug)]
pub struct DatasetKey(pub Box<dyn Dataset>);
impl PartialEq for DatasetKey {
fn eq(&self, other: &Self) -> bool {
format!("{:?}", self.0) == format!("{:?}", other.0)
}
}
impl Eq for DatasetKey {}
impl Hash for DatasetKey {
fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
format!("{:?}", self.0).hash(state);
}
}
// Here `T` will represent any datatype which should be indexed by the `DatasetKey`
#[derive(Debug)]
pub struct DatasetCache<T> {
pub cache: Map<DatasetKey, T>,
} This implementation removes any need for prefixing data types, as the underlying type is dynamic. While this worked, it has the drawbacks which were listed in the RFC body. It feels a little bit like squeezing a square peg into a round hole.
Possibly. I was discussing this with @cclaudio on a Slack huddle. Potentially, there are a number of different use-cases which this might be applied to. That being said, I don't want to introduce additional problems by providing an incorrect solution to the problem.
That is an excellent question. No official decisions have been made, thus the the RFC 🙂. The initial scope of this was to provide a solution for pre-caching certificates used for attestation. However, the scope seems to have grown. I want to fully understand what the community expectation is so we aren't re-inventing the wheel. |
@bpradipt Can you review and let us know your thoughts here? |
Just a few thoughts:
Can you give an example of latency-sensitive information? For example, we know that SEV-SNP requires Trustee to fetch a specific endorsement key from a remote server (other arches also require this). Would this be an example of latency-sensitive information that can be cached? If so, what else?
I'd probably be more of a proponent of the Single Caches per Use Case option. There's been a concerted effort to bring "multi-tenancy" (for lack of a better word) to Trustee. Trustee being able to service multiple unrelated clients requires some isolation between reference values, policies, and resources. There's some work I'm exploring in the RVPS (which @fitzthum discusses here) to help this w/r/t reference values, which I see this caching service as an extension of.
namespace:identifier would work, but I think the namespace should be enforced by Trustee and not configurable by a client. Namespaces could be enforced by a specific client's INITDATA, for example.
I'm a bit confused on what we mean by "offline" Trustee. As I understand it, Trustee should be deployed remotely on a trusted server (i.e. NOT on the same system as the client). Therefore if there's no network connectivity, clients wouldn't be able to attest either. |
Yes, content reliant on network requests are one of the latency-sensitive scenarios to consider. Here is how I am defining "latency-sensitive." Latency-sensitive information refers to any scenario where the speed of access, processing, or delivery of data is critical to performance, user experience, or functionality of a system.
I also am leaning in this direction, but this could be accomplished in a couple of ways. In fact, depending on the needs of the community, perhaps it would make sense to singleton something like this: struct FirstLevelCache {
use_case_one: SecondLevelCache,
use_case_two: SecondLevelCache,
..
use_case_n: SecondLevelCache
} Where each of the
This is an interesting idea! I will look into it a little bit further.
That is correct, for the context of this. However, if the deployments are within an air-gapped network, they may all be able to communicate with each other, while not being able to communicate outside of the local network. That is what I meant by offline support for the cache. Should the cache support communicating with external entities for retrieving fresh content? |
Let me add few points based on practical experience. In completely air-gapped environments (also referred to as disconnected or offline systems), there’s no way to pull the data dynamically from the internet. And there is no concept of partial online/offline. So as @fitzthum already mentioned, there should be support for pre-populating the cache. May be from a signed, portable bundle made available as an OCI artefact in the customers' internal artefact registry. There should also be tooling to identify the certs based on the server hardware/firmware that must be downloaded to populate the cache. Further, there should be some guidelines on cache validation and handling of invalidated certs for completely air-gapped environments. |
RFC: Centralized Caching Service for Trustee
Status: Draft
Author: Larry Dewey
Created: April 28, 2025
Last Updated: April 28, 2025
Abstract
This RFC proposes the design and implementation of a centralized caching service for the Trustee Community to enhance performance and scalability across various use cases. The service aims to provide a flexible, efficient, and user-friendly caching solution. This document outlines the key considerations, including user experience, storage mechanisms, integration with existing tools, key format specifications, and functionality requirements for online/offline and ordered/unordered caching.
Motivation
While working on a recent proof-of-concept (PoC) to create a generic certificate caching service, it became evident to me that Trustee -- and the community as a whole -- would benifit from the creation of a centralized caching service. Ultimately, we all will benefit from a well-designed caching solution to improve performance, reduce redundancy, and provide a consistent interface for developers and administrators to cache latency-sensitive information. This RFC seeks to gather input from stakeholders to ensure a robust and scalable solution is designed that meets the community’s needs.
Goals
Non-Goals
Proposed Design
1. Administrator / User Experience
The caching service must prioritize an intuitive and seamless experience for both administrators configuring the service and consumers interacting with it.
2. Storage
The storage mechanism is obviously critical to the service’s performance and flexibility.
3. Existing Crates
Leveraging existing tools can accelerate development and ensure reliability.
cached
ormoka
) for additional features like time-based expiration or async support.4. Format / Specification
While it is possible to address generic types for implementation, a standardized key format would be significantly better for consistency and simplicity.
namespace:identifier
or a hash of specific fields. The format should support unique identification while allowing flexibility for use-case-specific fields.Deciding Against
Box<dyn Trait>
During the experimentation of the previously mentioned PoC, using
Box<dyn Trait>
(whereTrait
is a marker trait) was explored but deemed unsuitable due to the following trade-offs:dyn Trait
reduces compile-time type checking, increasing the risk of runtime errors, and also handicaps the compiler, preventing it from making certain optimizations.Box<dyn Trait>
does not outweigh the costs, as a more static approach (e.g., generic types or enums) can achieve similar goals with better performance and safety.5. Online vs. Offline Functionality
The caching service should function effectively in both online and offline scenarios.
Alternatives Considered
Risks and Mitigations
Mitigation: The community will perform a vetting process for crates and monitor for security updates.
Mitigation: Implement configurable memory limits and monitor usage in production.
Mitigation: Use strict synchronization protocols and validate data during online/offline transitions.
Future Work
References
The text was updated successfully, but these errors were encountered: