Strategy to Improve Transform Performance #1247
Replies: 2 comments
-
@kade-robertson Hi!
Thanks for the write up and insights. I do agree TypeBox could be doing a better job here. As for the caching, it's a good optimization. However, as noted on #1150 (and as you also pointed out), adding schema caching internal to TypeBox is generally avoided. The concern isn't just about complexity (there is some), it’s also due to TypeBox being unable to make assumptions about how users will ultimately create, mutate, or reuse schemas throughout the lifecycle of an application. Keeping a global WeakMap (or Set) to track transform schematics can be a problem. If TypeBox detects a Transform on a schema during the first pass and caches that result, problems can occur if the schema is later modified in a way that changes or invalidates that transform / schema. While such mutations would be rare (and generally considered bad practice), the possibility is still non-zero, and that uncertainty makes automatic caching unsafe. TypeBox can't really assert the presence of a Transform via WeakMap via reference key alone, it needs deep introspection to know for sure. Transform optimization is a difficult problem. Previous attempts to provide internal caching have generally been met with edge cases that have proven difficult to resolve. As of today, the current TypeCheck implementation provides a https://github.com/sinclairzx81/typebox/blob/master/src/compiler/compiler.ts#L93-L96 Moving forward, TypeBox is trying to expose functions that make it feasible to implement high throughput decode external to the library (inclusive of external caching). I feel this is generally the best direction the library can go as it provides implementers more control over validation, transforms and performance (including having a means to select performance trade-offs based on the compute overhead of some of the If possible, I think it would be good to explore a external caching implementations via the current API, then discuss ways to improve that usage. The Let me know your thoughts! |
Beta Was this translation helpful? Give feedback.
-
This makes sense to me. I have already finished the implementation of these before posting here, as that's when the idea for upstreaming some of the implementation came up. As far as making this more feasible to implement externally, these were the things I ran into that would potentially aid in making external implementations of Encode/Decode a bit easier to maintain, in order of how straightforward I'd expect these to implement:
|
Beta Was this translation helpful? Give feedback.
-
Hello!
In using Typebox for some particularly large data, I've run into Transform types being a bit of a bottleneck on performance. Especially relative to Check performance when using a TypeCompiler, Decode ends up being too slow in its current implementation to make sense for the use case I'm looking at. Taking a look at the implementation of transforms, I did notice an opportunity for some pretty meaningful improvements.
The primary improvement is to avoid traversing a schema if it has no transforms contained within it. Right now, the transform methods have to "walk" the entire object and schema to apply transforms, even if it ends up down a branch which contains nothing to apply. You can get a pretty significant improvement just by checking to see if a schema contains a transform somewhere, and returning early if that branch has no transforms. This looks like a small modification to
Visit
:I can't share anything substantial from the dataset I had been using when originally experimenting with this, but making a superficial schema like so:
And generating some input data where each array on this top level object has 10,000 items,
HasTransform
dropped the averageDecode
andEncode
runtime from 114ms down to 68ms. On "sparse" data (where this has been modified to only have a single field on the top level with a transform), it drops further to 26ms. This performance improvement can be even more significant depending on the level of nesting or the quantity of data that can possibly be transformed.There is also some room for improvement still --
HasTransform
still has to traverse the input schema to look for transforms, and this could potentially be run many times per schema (i.e. for an array with many items, we are forced to check the same schema for every single item). This should be pretty safe to cache within a Transform step, using aWeakMap<TSchema, boolean>
to store the result ofHasTransform
:(This could just be implemented by passing the cache to each function in here, just for demonstration it's a bit easier to write it like this)
This is faster, but it's less significant than before -- the "dense" example from above only drops from 68ms to 53ms, and the sparse example stays the same. Personally, I'm using these together as custom Encode/Decode steps and the change is surprisingly significant (~2 orders of magnitude faster) for a fairly complex and nested schema.
I'd be interested in opening up a PR for this change (or something to achieve the same effect), as it saves me from having to maintain a separate Encode/Decode implementation (which requires some cloning of other files to include some internal Typebox stuff that doesn't get exported), which then needs to be used directly vs. customizing Parse or skipping the Encode/Decode method on a
TypeCompiler
result, since those won't use the "fast" version. I noticed however that in #1150 you had mentioned avoiding caching previously due to complexity concerns, so maybe that aspect is something you aren't looking to bring into the main project.Beta Was this translation helpful? Give feedback.
All reactions