Skip to content

RFC: Centralized Caching Service for Trustee #784

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
larrydewey opened this issue Apr 28, 2025 · 7 comments
Open

RFC: Centralized Caching Service for Trustee #784

larrydewey opened this issue Apr 28, 2025 · 7 comments

Comments

@larrydewey
Copy link
Contributor

larrydewey commented Apr 28, 2025

RFC: Centralized Caching Service for Trustee

Status: Draft
Author: Larry Dewey
Created: April 28, 2025
Last Updated: April 28, 2025

Abstract

This RFC proposes the design and implementation of a centralized caching service for the Trustee Community to enhance performance and scalability across various use cases. The service aims to provide a flexible, efficient, and user-friendly caching solution. This document outlines the key considerations, including user experience, storage mechanisms, integration with existing tools, key format specifications, and functionality requirements for online/offline and ordered/unordered caching.

Motivation

While working on a recent proof-of-concept (PoC) to create a generic certificate caching service, it became evident to me that Trustee -- and the community as a whole -- would benifit from the creation of a centralized caching service. Ultimately, we all will benefit from a well-designed caching solution to improve performance, reduce redundancy, and provide a consistent interface for developers and administrators to cache latency-sensitive information. This RFC seeks to gather input from stakeholders to ensure a robust and scalable solution is designed that meets the community’s needs.

Goals

  • Design a caching service that is intuitive and efficient for administrators and users.
  • Support flexible storage options (in-memory and on-disk) to accommodate various use cases.
  • Define a standardized key format to simplify cache interactions.
  • Evaluate existing tools and crates to expedite development.
  • Ensure the service supports both online and offline functionality.
  • Provide clear guidelines for ordered and unordered caching behavior.

Non-Goals

  • Implement a fully distributed caching system, at least at this stage.
  • Support real-time cache synchronization across multiple nodes.
  • Define a specific eviction policy beyond a basic last recently used (LRU) algorithm (to be addressed in a future RFC).

Proposed Design

1. Administrator / User Experience

The caching service must prioritize an intuitive and seamless experience for both administrators configuring the service and consumers interacting with it.

  • Single Cache vs. Singleton Caches per Use Case: The service could implement a single, unified cache for all use cases or create isolated singleton caches per use case. A single cache simplifies the consumer’s interaction by providing one location for storing and retrieving data. However, isolated caches offer better separation of concerns, reducing the risk of key collisions and enabling tailored configurations (e.g., different eviction policies or storage mechanisms per cache). Some identified trade-offs include increased complexity for isolated caches versus potential performance overhead in a single cache due to contention or memory reservation.

2. Storage

The storage mechanism is obviously critical to the service’s performance and flexibility.

  • In-Memory vs. On-Disk Caching: The service should support both in-memory caching for low-latency access and on-disk caching for persistence and larger datasets. In-memory caching is ideal for frequently accessed data, while on-disk caching ensures durability across application restarts. However, it is also possible that in low-memory environments an on-disk only solution might make sense. Also, in a high memory low storage scenario it might make sense to have an in-memory only solution. Making this configurable at compile-time would help cover these use-cases.
  • Isolation of Storage: Disk-backed storage for the caching service should be isolated from user-provided content to prevent security risks and ensure maintainability. This approach avoids conflicts with user data and simplifies backup and recovery processes.
  • Memory Limits: For in-memory caching, the service should allow administrators to specify a specific amount of memory (e.g., a percentage of available RAM or a fixed size) to reserve. A configurable limit on the number of items in the cache (e.g., maximum number of key-value pairs) should also be supported to prevent unbounded growth.
  • Eviction Policy: The service should default to a Least Recently Used (LRU) eviction policy for simplicity, with the option to extend support for other policies in the future.

3. Existing Crates

Leveraging existing tools can accelerate development and ensure reliability.

  • LRU Crate: The LRU crate is a strong candidate for implementing in-memory caching with an LRU eviction policy. It is well-maintained, performant, and aligns with the proposed default eviction strategy. It also allows generic keys and generic values to be used.
  • Other Crates: The community should evaluate other caching libraries (e.g., cached or moka) for additional features like time-based expiration or async support.
  • External Crate Considerations: Including external crates introduces dependencies, which may raise concerns about maintenance, licensing, or security. The community should review and vet any suggested external crates, ensuring they are actively maintained and compatible with the project’s licensing. If there are specific concerns, please list them.

4. Format / Specification

While it is possible to address generic types for implementation, a standardized key format would be significantly better for consistency and simplicity.

  • Key Format: The service should define a generic data type for cache keys, such as a string or a structured tuple containing a namespace and identifier. For example, a key could be a string like namespace:identifier or a hash of specific fields. The format should support unique identification while allowing flexibility for use-case-specific fields.

Deciding Against Box<dyn Trait>

During the experimentation of the previously mentioned PoC, using Box<dyn Trait> (where Trait is a marker trait) was explored but deemed unsuitable due to the following trade-offs:

  • Performance Overhead: Dynamic dispatch introduces runtime overhead, which could degrade performance in a high-throughput scenario.
  • Type Safety: Using dyn Trait reduces compile-time type checking, increasing the risk of runtime errors, and also handicaps the compiler, preventing it from making certain optimizations.
  • Complexity: The approach requires additional boilerplate to manage trait objects, complicating the codebase and API.
  • Limited Benefits: The flexibility offered by Box<dyn Trait> does not outweigh the costs, as a more static approach (e.g., generic types or enums) can achieve similar goals with better performance and safety.

5. Online vs. Offline Functionality

The caching service should function effectively in both online and offline scenarios.

  • Online Mode: In online mode, the service prioritizes low-latency access to the cache, leveraging in-memory storage for frequently accessed data. It should also support periodic synchronization with disk storage (if enabled) to ensure data durability. It should also provide a mechanism for retrieving externally stored (trusted) content.
  • Offline Mode: In offline mode, the service should rely on disk-backed storage to retrieve cached data when network connectivity is unavailable. The service must gracefully handle transitions between online and offline states, ensuring data consistency.
  • Configuration: Administrators should configure whether offline support is enabled and specify the storage backend (e.g., local file system, an embedded database like SQLite, or various other formats).

Alternatives Considered

  • Distributed Caching: A distributed caching system (e.g., using Redis or Memcached) was considered but deemed out of scope for the initial implementation due to complexity and infrastructure requirements.
  • Custom Eviction Policies: Supporting multiple eviction policies (e.g., LFU, FIFO) was explored but deferred to a future RFC to focus on a simple LRU-based solution.

Risks and Mitigations

  • Risk: External crate dependencies may introduce vulnerabilities.
    Mitigation: The community will perform a vetting process for crates and monitor for security updates.
  • Risk: In-memory cache could consume excessive resources.
    Mitigation: Implement configurable memory limits and monitor usage in production.
  • Risk: Offline mode may lead to data inconsistency.
    Mitigation: Use strict synchronization protocols and validate data during online/offline transitions.

Future Work

  • Support for distributed caching across multiple nodes.
  • Advanced eviction policies (e.g., LFU, time-based expiration).
  • Integration with monitoring tools for cache hit/miss rates and performance metrics.
  • Support for cache sharding to improve scalability.

References

@larrydewey larrydewey changed the title RFC RFC: Caching Service Apr 28, 2025
@larrydewey larrydewey changed the title RFC: Caching Service RFC: Centralized Caching Service for Trustee Apr 28, 2025
@fitzthum
Copy link
Member

I think this makes sense.

One thing I don't see mentioned here are considerations around high-availability. We have been thinking about how to make Trustee stateless so that it can be used with some k8s paradigms. I think your proposal should work fine with this. We'll just need to provide some caching backend that does not store things in memory.

I am assuming that you'll define a cache trait and allow people to specify various backends. The simplest one would just be a dictionary. I don't know that we actually need to worry about eviction in the first iteration, but I guess it can't hurt. I would imagine that the trait itself will look a lot like a dictionary. Probably we can just have the keys be strings, with different verifiers responsible for their naming schemes (although maybe some verifier prefix to avoid collisions). I think it's also important to define a method of pre-provisioning the cache via the trait, so that admin can provide certs in an offline environment.

One question I have is whether this caching backend is whether there are any other parts of Trustee, besides the verifiers, that it would be exposed to.

@Xynnn007
Copy link
Member

An exciting proposal! A basic question: the goal of this cache service is which one?

  1. Introduce cache in the deployment of Trustee (specifically where?) to improve performance

  2. Or should the CoCo community maintain an online cache service to provide reference values ​​for RVPS, etc.

@larrydewey
Copy link
Contributor Author

larrydewey commented Apr 29, 2025

One thing I don't see mentioned here are considerations around high-availability. We have been thinking about how to make Trustee stateless so that it can be used with some k8s paradigms. I think your proposal should work fine with this. We'll just need to provide some caching backend that does not store things in memory.

I agree. I feel like HA is important, but falls outside of the scope of the first iteration; though, we definitely want to keep it on our radar. How quickly it may be needed is still up for debate.

I am assuming that you'll define a cache trait and allow people to specify various backends. The simplest one would just be a dictionary. I don't know that we actually need to worry about eviction in the first iteration, but I guess it can't hurt. I would imagine that the trait itself will look a lot like a dictionary. Probably we can just have the keys be strings, with different verifiers responsible for their naming schemes (although maybe some verifier prefix to avoid collisions). I think it's also important to define a method of pre-provisioning the cache via the trait, so that admin can provide certs in an offline environment.

Yes, the PoC looked like this:

main.rs

#[cfg(not(feature = "ordered"))]
use std::collections::HashMap as Map;

#[cfg(feature = "ordered")]
use std::collections::BTreeMap as Map;

//...

// Here `ExampleUseCase` is an enum which implements Context...

static DATA_CACHE: LazyLock<Arc<RwLock<ContextData<Certificate>>>> = LazyLock::new(|| {
    let mut context_data = ContextData::new(Map::new());
    context_data.insert(
        ContextKey(Box::new(ExampleUseCase::ClassificationOne)),
        DatasetCache { cache: Map::new() },
    );

    context_data.insert(
        ContextKey(Box::new(ExampleUseCase::ClassificationTwo)),
        DatasetCache { cache: Map::new() },
    );

    Arc::new(RwLock::new(context_data))
});

context.rs

use std::{fmt::Debug, hash::Hash};

use crate::{Map, dataset::DatasetCache};

/// A marker trait used to classify the ordering of the second level cache.
/// This can be any arbitrary data-type, but usually will be represented
/// via an enum which implements this marker trait.
pub trait Context: Debug + Send + Sync {}

#[derive(Debug)]
pub struct ContextKey(pub Box<dyn Context>);

impl PartialEq for ContextKey {
    fn eq(&self, other: &Self) -> bool {
        format!("{:?}", self.0) == format!("{:?}", other.0)
    }
}

impl Eq for ContextKey {}

impl Hash for ContextKey {
    fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
        format!("{:?}", self.0).hash(state);
    }
}

#[derive(Debug)]
pub struct ContextData<T> {
    cache: Map<ContextKey, DatasetCache<T>>,
}

impl<T> ContextData<T> {
    pub fn new(cache: Map<ContextKey, DatasetCache<T>>) -> Self {
        Self { cache }
    }

    pub fn insert(&mut self, key: ContextKey, value: DatasetCache<T>) {
        self.cache.insert(key, value);
    }

    pub fn get(&self, key: &ContextKey) -> Option<&DatasetCache<T>> {
        self.cache.get(key)
    }

    pub fn get_mut(&mut self, key: &ContextKey) -> Option<&mut DatasetCache<T>> {
        self.cache.get_mut(key)
    }
}

dataset.rs

use std::{fmt::Debug, hash::Hash};

use crate::Map;

/// A marker trait used to classify generic input data. This trait allows for
/// use-case specific implementations of unique data which should be used as
/// the key in the second level of the cache.
pub trait Dataset: Debug + Send + Sync {}

impl<T: Dataset + ?Sized> Dataset for Box<T> {}

#[derive(Debug)]
pub struct DatasetKey(pub Box<dyn Dataset>);

impl PartialEq for DatasetKey {
    fn eq(&self, other: &Self) -> bool {
        format!("{:?}", self.0) == format!("{:?}", other.0)
    }
}

impl Eq for DatasetKey {}

impl Hash for DatasetKey {
    fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
        format!("{:?}", self.0).hash(state);
    }
}

// Here `T` will represent any datatype which should be indexed by the `DatasetKey`
#[derive(Debug)]
pub struct DatasetCache<T> {
    pub cache: Map<DatasetKey, T>,
}

This implementation removes any need for prefixing data types, as the underlying type is dynamic. While this worked, it has the drawbacks which were listed in the RFC body. It feels a little bit like squeezing a square peg into a round hole.

One question I have is whether this caching backend is whether there are any other parts of Trustee, besides the verifiers, that it would be exposed to.

Possibly. I was discussing this with @cclaudio on a Slack huddle. Potentially, there are a number of different use-cases which this might be applied to. That being said, I don't want to introduce additional problems by providing an incorrect solution to the problem.

An exciting proposal! A basic question: the goal of this cache service is which one?

  1. Introduce cache in the deployment of Trustee (specifically where?) to improve performance
  2. Or should the CoCo community maintain an online cache service to provide reference values ​​for RVPS, etc.

That is an excellent question. No official decisions have been made, thus the the RFC 🙂. The initial scope of this was to provide a solution for pre-caching certificates used for attestation. However, the scope seems to have grown. I want to fully understand what the community expectation is so we aren't re-inventing the wheel.

@ryansavino
Copy link
Member

@bpradipt Can you review and let us know your thoughts here?

@tylerfanelli
Copy link
Contributor

tylerfanelli commented Apr 30, 2025

Just a few thoughts:

Motivation

While working on a recent proof-of-concept (PoC) to create a generic certificate caching service, it became evident to me that Trustee -- and the community as a whole -- would benifit from the creation of a centralized caching service. Ultimately, we all will benefit from a well-designed caching solution to improve performance, reduce redundancy, and provide a consistent interface for developers and administrators to cache latency-sensitive information. This RFC seeks to gather input from stakeholders to ensure a robust and scalable solution is designed that meets the community’s needs.

Can you give an example of latency-sensitive information? For example, we know that SEV-SNP requires Trustee to fetch a specific endorsement key from a remote server (other arches also require this). Would this be an example of latency-sensitive information that can be cached? If so, what else?

  • Single Cache vs. Singleton Caches per Use Case: The service could implement a single, unified cache for all use cases or create isolated singleton caches per use case. A single cache simplifies the consumer’s interaction by providing one location for storing and retrieving data. However, isolated caches offer better separation of concerns, reducing the risk of key collisions and enabling tailored configurations (e.g., different eviction policies or storage mechanisms per cache). Some identified trade-offs include increased complexity for isolated caches versus potential performance overhead in a single cache due to contention or memory reservation.

I'd probably be more of a proponent of the Single Caches per Use Case option. There's been a concerted effort to bring "multi-tenancy" (for lack of a better word) to Trustee. Trustee being able to service multiple unrelated clients requires some isolation between reference values, policies, and resources.

There's some work I'm exploring in the RVPS (which @fitzthum discusses here) to help this w/r/t reference values, which I see this caching service as an extension of.

4. Format / Specification

While it is possible to address generic types for implementation, a standardized key format would be significantly better for consistency and simplicity.

  • Key Format: The service should define a generic data type for cache keys, such as a string or a structured tuple containing a namespace and identifier. For example, a key could be a string like namespace:identifier or a hash of specific fields. The format should support unique identification while allowing flexibility for use-case-specific fields.

namespace:identifier would work, but I think the namespace should be enforced by Trustee and not configurable by a client. Namespaces could be enforced by a specific client's INITDATA, for example.

5. Online vs. Offline Functionality

The caching service should function effectively in both online and offline scenarios.

  • Online Mode: In online mode, the service prioritizes low-latency access to the cache, leveraging in-memory storage for frequently accessed data. It should also support periodic synchronization with disk storage (if enabled) to ensure data durability. It should also provide a mechanism for retrieving externally stored (trusted) content.
  • Offline Mode: In offline mode, the service should rely on disk-backed storage to retrieve cached data when network connectivity is unavailable. The service must gracefully handle transitions between online and offline states, ensuring data consistency.
  • Configuration: Administrators should configure whether offline support is enabled and specify the storage backend (e.g., local file system, an embedded database like SQLite, or various other formats).

I'm a bit confused on what we mean by "offline" Trustee. As I understand it, Trustee should be deployed remotely on a trusted server (i.e. NOT on the same system as the client). Therefore if there's no network connectivity, clients wouldn't be able to attest either.

@larrydewey
Copy link
Contributor Author

Can you give an example of latency-sensitive information? For example, we know that SEV-SNP requires Trustee to fetch a specific endorsement key from a remote server (other arches also require this). Would this be an example of latency-sensitive information that can be cached? If so, what else?

Yes, content reliant on network requests are one of the latency-sensitive scenarios to consider. Here is how I am defining "latency-sensitive." Latency-sensitive information refers to any scenario where the speed of access, processing, or delivery of data is critical to performance, user experience, or functionality of a system.

I'd probably be more of a proponent of the Single Caches per Use Case option. There's been a concerted effort to bring "multi-tenancy" (for lack of a better word) to Trustee. Trustee being able to service multiple unrelated clients requires some isolation between reference values, policies, and resources.

There's some work I'm exploring in the RVPS (which @fitzthum discusses here) to help this w/r/t reference values, which I see this caching service as an extension of.

I also am leaning in this direction, but this could be accomplished in a couple of ways. In fact, depending on the needs of the community, perhaps it would make sense to singleton something like this:

struct FirstLevelCache {
    use_case_one: SecondLevelCache,
    use_case_two: SecondLevelCache,
    ..
    use_case_n: SecondLevelCache
}

Where each of the SecondLevelCache fields are unique. This could provide a centralized access-point while also isolating the caches from each other.

namespace:identifier would work, but I think the namespace should be enforced by Trustee and not configurable by a client. Namespaces could be enforced by a specific client's INITDATA, for example.

This is an interesting idea! I will look into it a little bit further.

I'm a bit confused on what we mean by "offline" Trustee. As I understand it, Trustee should be deployed remotely on a trusted server (i.e. NOT on the same system as the client). Therefore if there's no network connectivity, clients wouldn't be able to attest either.

That is correct, for the context of this. However, if the deployments are within an air-gapped network, they may all be able to communicate with each other, while not being able to communicate outside of the local network. That is what I meant by offline support for the cache. Should the cache support communicating with external entities for retrieving fresh content?

@bpradipt
Copy link
Member

bpradipt commented May 2, 2025

Let me add few points based on practical experience.

In completely air-gapped environments (also referred to as disconnected or offline systems), there’s no way to pull the data dynamically from the internet. And there is no concept of partial online/offline. So as @fitzthum already mentioned, there should be support for pre-populating the cache. May be from a signed, portable bundle made available as an OCI artefact in the customers' internal artefact registry.
Using an OCI artefact, will allow internal processes to create updated artefacts that can be delivered to the disconnected Trustee system via internal artefact registry

There should also be tooling to identify the certs based on the server hardware/firmware that must be downloaded to populate the cache.

Further, there should be some guidelines on cache validation and handling of invalidated certs for completely air-gapped environments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants