`wasmparser`: Performance regressions since `v0.100.2` #1701

Robbepop · 2024-07-26T16:09:25Z

Due to technical reasons I had to use the (by now) ancient wasmparser 0.100.2 in Wasmi and was using a fork called wasmparser-nostd. The fork had no significant impact on the performance.

Recently I updated Wasmi to use the newest wasmparser v0.214.0 and unfortunately saw massive performance regressions, especially for Wasm validation. Many Wasmi benchmarks regressed by 15-35% and some even by 45% compared to the Wasmi that uses the ancient wasmparser v0.100.2.

The Wasmi upgrade PR is here: wasmi-labs/wasmi#1141

Maybe I did some mistakes when upgrading which explain the performance regressions.

Since Wasmi is very focused on speedy translation times those performance regressions unfortunately are a show stopper for using the new wasmparser versions.

The text was updated successfully, but these errors were encountered:

alexcrichton · 2024-07-26T18:13:55Z

Do you have a link to the module with the largest regression? That'd probably be the easiest to start with in terms of evaluating validation performance on that module and seeing what sticks out in perf

Robbepop · 2024-07-26T18:48:44Z

Do you mean Wasm module?

From what I saw tiny_keccak is a good contender. But there were others.
To benchmark tiny_keccak translation test cases in Wasmi, use:

cargo bench --bench benches translate/tiny_keccak/

The tiny_keccak benchmark can be found here:
https://github.com/wasmi-labs/rust-benchmarks/tree/main/cases/tiny_keccak

Update

I just found one reason why it regressed: I benchmarked without setting the no-hash-maps feature. Generally BTreeMap is faster than HashMap for Wasmi uses and the old wasmparser-nostd fork alsways uses BTreeMaps. With this feature set local benchmarks still indicate regressions but they are less severe, ranging between 10-25% instead of 15-35%. The unchecked (no validation) case now even has slightly improved performance which indicates very strongly to me that the regressions that we see come from the Wasm validation.

alexcrichton · 2024-07-26T20:37:39Z

Ok I've been poking at this a bit and so far what I've found is that #701, the first commit after 0.100.0, led to a precipitous drop in performance. That was known at the time however. There are other smaller regressions here and there but they're looking to be more difficult to pin down.

Profiling the code itself does not show any unexpected hot spots. That means that it's just sort of spread out a lot.

Doing various tweaks seems to help here and there but nothing is a major improvement. For example outlining string-based errors to their own #[cold] function sometimes helps and sometimes doesn't.

One improvement that has shown promise is skipping initialization checking for locals entirely. This was first added in the function-references proposal and later proposals like gc rely on it as well. This yields a ~5% gain (ish) but requires completely removing the check. That means that only way to achieve this would be to add compile-time features to wasmparser.

Overall my guess is that the wasm language has just gotten more complicated over time which has slowed down the validator. I don't know of a great way to handle this other than to add a bunch of Cargo compile-time features for each wasm proposal. Doing that is not something I think would be easy to design but I think would provide the wins necessary here.

Robbepop · 2024-07-26T21:14:03Z

Thank you for the investigation @alexcrichton !

If your assessment is correct I have to admit that it is a bit worrysome to me that the Wasm spec seems to develop into a direction that may not be suitable for all of its original users and stake holders.

I remember that you said that function-references are going to incur a performance overhead and in hindsight I was kinda optimistic that this performance penalty will surely be dealt with over time.

My assumption of how Wasm features are intended to work was that they should (ideally) not significantly influence the performance if they are unused. So only if I actually were to use the function-references proposal in a significant way in my Wasm blob, I'd "suffer" from the performance penalties that are associated to its additional complexity. Though, I totally understand that it is probably already hard enough to keep up with all the Wasm developments.

Conditional compilation wouldn't solve the problem from my point of view since eventually all Wasm runtimes (including Wasmi) want to adopt all stabilized features anyways. The Wasm standard is not modularised at the moment and thus does not allow to be picky.

Robbepop · 2024-07-27T18:34:50Z

I just ran flamegraph on the translate/tiny_keccak/checked/lazy-translation test case:

main:

PR:

From what it looks I can confirm your assessment that multiple parts of the wasmparser validation pipeline seem to have regressed performance. However, especially notable parts (from the looks) are:

Validator::type_section
ModuleParser::process_end
VisitOperator::visit_local_get
VisitOperator::visit_end

One improvement that has shown promise is skipping initialization checking for locals entirely.

I forgot to ask last time: What exactly does this mean? What is initialization checking of locals and where can I find the code?

Looking just at translate_operators we can also see some significant shifts:
main:

PR:

Robbepop · 2024-07-29T09:43:28Z

In my last performance assessment I made a mistake by not also enabling no-hash-maps for the main branch when conducting benchmarks between main and the PR which made the results look better for the PR than they actually are. The corrected results look even worse:

We see a performance regression between 15-45%. Still, unchecked workloads without Wasm validation are not regressed.

alexcrichton · 2024-07-29T15:09:13Z

I would be hesitant to generalize wasm-tools's implementation to the entire spec development and other engines in the sense that we could just perhaps be bitten by our own decisions. This could be wasmparser-specific and not affect other implementations, I'm not sure. When I've benchmarked historically, for example, I believe that v8's validator (via node) has consistently been faster than wasmparser's. Basically I don't think wasmparser was ever at the peak of validation performance.

I also saw similar hot spots as you locally, but I was unable to figure out how best to optimize them and local attempts weren't bearing much fruit.

The initialization checking looks like this. That's an array lookup plus a boolean test of what's in the array. My guess is that everything is a well-predicted branch and in-cache since locals typically go to the same local. The 5% perf win I got (ish) was removing those checks entirely. That's not spec-compliant so we can't land it but it was the best I could do.

Robbepop · 2024-07-29T18:31:50Z

You are right that this might very well be just an implementation detail that can be fixed.

Thanks a lot for the link to the local.get checks, I will have a look at the codebase to see whether it is possible to further apply some optimizations without removing those checks. E.g. I could imagine that we could use in-place bitfields for up to 32 (or 64) locals (common case) instead of Vec<bool> etc.. Maybe there are more things that could be improved. I will experiment with this a bit and come back.

eqrion · 2024-07-29T18:48:39Z

If it helps, here is the implementation in SpiderMonkey [1]. We do an upfront check for if the local index is less than where the first non-nullable local is and skip any boolean checking after that. Then we consult a bitmap. For setting a local, we also skip any consultation of the bitmap in the case where the local has been set [2].

So for applications that are not using any GC types, the only cost on local.get/set is the upfront check against the first non-nullable local index (which will always fail).

[1] https://searchfox.org/mozilla-central/rev/dd421ae14997e3bebb9c08634633a4a3e3edeffc/js/src/wasm/WasmOpIter.h#347-352
[2] https://searchfox.org/mozilla-central/rev/dd421ae14997e3bebb9c08634633a4a3e3edeffc/js/src/wasm/WasmOpIter.h#2283-2285

Robbepop · 2024-07-29T18:51:15Z

If it helps, here is the implementation in SpiderMonkey [1]. We do an upfront check for if the local index is less than where the first non-nullable local is and skip any boolean checking after that. Then we consult a bitmap. For setting a local, we also skip any consultation of the bitmap in the case where the local has been set [2].

So for applications that are not using any GC types, the only cost on local.get/set is the upfront check against the first non-nullable local index (which will always fail).

[1] https://searchfox.org/mozilla-central/rev/dd421ae14997e3bebb9c08634633a4a3e3edeffc/js/src/wasm/WasmOpIter.h#347-352 [2] https://searchfox.org/mozilla-central/rev/dd421ae14997e3bebb9c08634633a4a3e3edeffc/js/src/wasm/WasmOpIter.h#2283-2285

Thanks so much for sharing this information! That's invaluable. :)

alexcrichton · 2024-07-29T19:18:52Z

Indeed thanks @eqrion! I suspect that both SpiderMonkey (and possibly v8 too) would be good sources of inspiration for algorithmic changes to the validator in wasmparser.

Around locals specifically another thing that wasmparser does is it doesn't maintain Vec<ValType> for locals and instead has a scheme where the first N locals are stored in Vec<ValType> but afterwards they're stored in Vec<(u32, ValType)> for a more compressed representation like wasm has. This representation is probably too-clever-by-half and probably isn't really benefitting all that much. Simplifying that to just Vec<ValType> would probably help clean up the code a bit and remove a branch here and there.

Another possibility is that there is currently zero unsafe code in the validator which while is a great property I don't personally consider set in stone forever necessarily. For example using the idea from SpiderMonkey checking for local initialization the first index of a non-nullable local could be the number of locals in a function if they're all nullable. That way after that "bounds check" the access into the actual types array could be done in an unchecked fashion to have just a single condition in the validation of local.get. This could also very well be "too clever by half" and introducing unsafe code is definitely something we'd want to do carefully.

Robbepop · 2024-10-06T14:43:02Z

I updated the Wasmi PR to update wasmparser to v0.218.0 and improved/extended translation benchmarks and reran them to get a better understanding of where performance regressions are located and could be fixed.

Despite the performance regressions I consider merging the PR to finally make use of the newer wasmparser versions but I am keen on fixing the performance regressions, especially for the important lazy modes.

Link: wasmi-labs/wasmi#1141 (comment)

alexcrichton · 2024-10-07T14:37:31Z

FWIW I'm poking around in #1845 about slimming down wasmparser optionally at compile-time and one thing I was thinking of was adding a GC-like feature which might help recover some performance here. I'm not 100% sure it'll work though nor if it's worth it, but figured it's worth poking around at least.

Robbepop · 2024-10-07T15:15:59Z

Oh thanks a lot for the information! I like #1845 a lot even if it won't improve performance, reducing compile times is also a huge gain already. :)

Robbepop · 2024-10-17T10:41:40Z

I opened #1870 for minor gains in the code section validation.

However, as the Wasmi benchmarks suggest, most of the performance regressions are in the Wasm validation parts outside the code section. I have to dig deeper to see which parts of the Wasm validator are the main offenders. @alexcrichton any clues?

Robbepop · 2024-10-19T12:20:27Z

@alexcrichton I reran flamegraph on current Wasmi main compared to the PR to update wasmparser to find out the most regressing part of wasmparser since v0.100.0.

It is the wasmparser::Validator::type_section call that regressed by over 130%, thus requiring more than double the time compared to v0.100.0 (wasmparser-nostd` fork).

From what I can see, the following sub-parts take the most time in the benchmarks:

TypeList::intern_canonical_rec_group (2.48%)
<RecGroup as FromReader>::from_reader (2.15%)
<RecGroup as Ord>::cmp (1.06%)
InternRecGroup::check_subtype (0.88%)

All of which seem to be related to the gc Wasm proposal.

In comparison the sub-parts that take the most time for wasmparser v0.100.0 are:

<FuncType as FromReader>::from_reader (1.24%)
Module::add_type (0.82%)

It seems that the (computationally expensive) recgroup canonicalization and internment is always performed, even for type sections that do not have recursive definitions - which is the overwhelmingly common case, especially for non-GC wasm. If this is true, I think we could optimize this by only doing this canonicalization when necessary.

I.e. use an efficient way to register types that also detects (potentially) recursive references within them, and only use the expensive recgroup canonicalization and internment when needed as a fallback. If a detection of potentially recursive references cannot be implemented easily we could also simply use the WasmFeatures::GC (or WasmFeatures::GC_TYPES) flag.

This way users only pay for what they use.

alexcrichton · 2024-10-22T19:29:34Z

Thanks for measuring this @Robbepop! That intuitively makes sense to me and we historically have spent most of our time looking at optimizing the operator validator rather than the general module validator.

Would you be able to extract an example module or two that showcases this slowdown? For example a module that only has a type section should be sufficient to showcase the slowdown and additionally track improvements.

alexcrichton · 2024-10-22T19:33:52Z

(sorry still recovering from jetlag so responding to the rest of your comment now too)

I think we could optimize this by only doing this canonicalization when necessary

Makes sense to me! I'd prefer to lean on "detect a recursive type" if possible and only use WasmFeatures as a last resort as well, but I agree that the ultimate decision maker there will be benchmarks one way or another.

Robbepop · 2024-10-22T20:33:51Z

Would you be able to extract an example module or two that showcases this slowdown? For example a module that only has a type section should be sufficient to showcase the slowdown and additionally track improvements.

There already is a lots-of-types.wasm benchmark test case in the wasmparser crate. I took a look at it and it consists of tons of empty (func) types, which probably won't take a burden on the type section validation. I would rename this test case to lots-of-empty-fn-types.wasm and add another with lots of unique function types, named lots-of-unique-fn-types.wasm.

Makes sense to me! I'd prefer to lean on "detect a recursive type" if possible and only use WasmFeatures as a last resort as well, but I agree that the ultimate decision maker there will be benchmarks one way or another.

I too think it is better not to use the WasmFeatures if possible. Another idea I had, was to make internment and canonicalization a lazy process and only ever do whatever is computationally expensive just when it is actually needed for a particular type. In contrast to the rec-type detection idea, this kinda turns the eager validation into a lazy one. Not sure if possible, but I think this would be an even cleaner design.

alexcrichton · 2024-10-22T21:05:12Z

All that sounds reasonable to me yeah. I haven't dug into why this is so much slower than before myself so I can't speak too much as to whether it seems like the best route forward, but I'm happy to defer to you as well.

Robbepop · 2024-10-22T21:10:40Z

So far I have only digged into <RecGroup as FromReader>::from_reader regressions. I tried to lightly modify the code, e.g. with some annotations, inline, #[cold], extracting functions, putting common cases in front but without any major performance wins while the code became much uglier, thus I decided not to open a PR.

Robbepop · 2024-11-13T13:13:40Z

@alexcrichton I have been looking into the type section validation of the wasmparser crate recently and was trying to wrap my head around there. I have to admit, that I am swimming a bit in rough waters to coming up with a general implementation with lazy behavior to optimize the common case. I think it makes sense to start with a simple, optimization for common case and then elaborate further.

I implemented specialized type section validation here: #1906

Robbepop mentioned this issue Oct 17, 2024

wasmparser: Create abstraction for local initialization (simplify future optimizations) #1870

Merged

This was referenced Oct 23, 2024

Add benchmark test case .wasm with lots of unique fn types #1880

Merged

Replace wasmparser-nostd fork with upstream wasmparser wasmi-labs/wasmi#1141

Open

Robbepop linked a pull request Nov 13, 2024 that will close this issue

wasmparser: optimize type section valiadation by specializing over Wasm feature subset #1906

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`wasmparser`: Performance regressions since `v0.100.2` #1701

`wasmparser`: Performance regressions since `v0.100.2` #1701

Robbepop commented Jul 26, 2024 •

edited

Loading

alexcrichton commented Jul 26, 2024

Robbepop commented Jul 26, 2024 •

edited

Loading

alexcrichton commented Jul 26, 2024

Robbepop commented Jul 26, 2024 •

edited

Loading

Robbepop commented Jul 27, 2024 •

edited

Loading

Robbepop commented Jul 29, 2024 •

edited

Loading

alexcrichton commented Jul 29, 2024

Robbepop commented Jul 29, 2024 •

edited

Loading

eqrion commented Jul 29, 2024

Robbepop commented Jul 29, 2024

alexcrichton commented Jul 29, 2024

Robbepop commented Oct 6, 2024

alexcrichton commented Oct 7, 2024

Robbepop commented Oct 7, 2024 •

edited

Loading

Robbepop commented Oct 17, 2024 •

edited

Loading

Robbepop commented Oct 19, 2024 •

edited

Loading

alexcrichton commented Oct 22, 2024

alexcrichton commented Oct 22, 2024

Robbepop commented Oct 22, 2024 •

edited

Loading

alexcrichton commented Oct 22, 2024

Robbepop commented Oct 22, 2024 •

edited

Loading

Robbepop commented Nov 13, 2024 •

edited

Loading

wasmparser: Performance regressions since v0.100.2 #1701

wasmparser: Performance regressions since v0.100.2 #1701

Comments

Robbepop commented Jul 26, 2024 • edited Loading

alexcrichton commented Jul 26, 2024

Robbepop commented Jul 26, 2024 • edited Loading

Update

alexcrichton commented Jul 26, 2024

Robbepop commented Jul 26, 2024 • edited Loading

Robbepop commented Jul 27, 2024 • edited Loading

Robbepop commented Jul 29, 2024 • edited Loading

alexcrichton commented Jul 29, 2024

Robbepop commented Jul 29, 2024 • edited Loading

eqrion commented Jul 29, 2024

Robbepop commented Jul 29, 2024

alexcrichton commented Jul 29, 2024

Robbepop commented Oct 6, 2024

alexcrichton commented Oct 7, 2024

Robbepop commented Oct 7, 2024 • edited Loading

Robbepop commented Oct 17, 2024 • edited Loading

Robbepop commented Oct 19, 2024 • edited Loading

alexcrichton commented Oct 22, 2024

alexcrichton commented Oct 22, 2024

Robbepop commented Oct 22, 2024 • edited Loading

alexcrichton commented Oct 22, 2024

Robbepop commented Oct 22, 2024 • edited Loading

Robbepop commented Nov 13, 2024 • edited Loading

`wasmparser`: Performance regressions since `v0.100.2` #1701

`wasmparser`: Performance regressions since `v0.100.2` #1701

Robbepop commented Jul 26, 2024 •

edited

Loading

Robbepop commented Jul 26, 2024 •

edited

Loading

Robbepop commented Jul 26, 2024 •

edited

Loading

Robbepop commented Jul 27, 2024 •

edited

Loading

Robbepop commented Jul 29, 2024 •

edited

Loading

Robbepop commented Jul 29, 2024 •

edited

Loading

Robbepop commented Oct 7, 2024 •

edited

Loading

Robbepop commented Oct 17, 2024 •

edited

Loading

Robbepop commented Oct 19, 2024 •

edited

Loading

Robbepop commented Oct 22, 2024 •

edited

Loading

Robbepop commented Oct 22, 2024 •

edited

Loading

Robbepop commented Nov 13, 2024 •

edited

Loading