Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitor Base Migration Performance Over Time #12432

Open
rjan90 opened this issue Sep 4, 2024 · 3 comments
Open

Monitor Base Migration Performance Over Time #12432

rjan90 opened this issue Sep 4, 2024 · 3 comments
Assignees

Comments

@rjan90
Copy link
Contributor

rjan90 commented Sep 4, 2024

Summary

This issue tracks the performance of "base migration" (i.e., migration with no additional logic other than changing CodeCIDs) over time. The purpose is to monitor the impact of increasing actor growth on base migration times and identify potential performance issues as the network grows.

Motivation

The number of actors in the state tree increases. This growth can potentially impact the performance of state migrations, even for the simplest case of base migration. By regularly monitoring the base migration performance, we can:

  1. Track the impact of actor growth on migration times
  2. Identify trends and potential performance bottlenecks
  3. Inform decisions on optimization efforts for state migrations

User/Customer

  • Node operators
  • Protocol developers

Tracking Process

We will use our existing processes and documentation for when to run these benchmarks:

  1. Timing: Benchmarks are run after each network skeleton has been created, as part of the burndown list for a network upgrade.
  2. Methodology: The benchmarking process follows the steps outlined in our documentation, with two distinct modes:
    a. Offline Mode: Measure the base migration performance in isolation.
    b. Online Mode: Measure the migration while the node is syncing to capture the impact of real-world conditions.

Notes

This issue is related to the ongoing discussions and findings in the following issues:

Example Benchmark

Suggested results comment template that can copy/paste

Meta

  • Date: [Insert date]
  • Skeleton Version: [Insert version]
  • Network: [calibration|mainnet]
  • Actor Count: [Insert total actor count]

Hardware Specifications on the machine you are running the benchmark:

Main component for the baseline migration should be disk read speed, but we are recording the other specs as well.

  • CPU: [Insert CPU model and number of cores/threads]
  • RAM: [Insert total RAM]
  • Disk: [Insert disk type (SSD/NVMe) and capacity]

Benchmark

Offline Mode

  • Migration Times:
    • Without pre-migration/cache:
      [Insert time]
    • With pre-migration/cache:
      [Insert time]
  • Peak Memory Usage: [Insert peak memory usage]
  • Avg CPU Utilization: [Insert average CPU utilization]

Online Mode

  • Migration Time: [Insert time]
  • Peak Memory Usage: [Insert peak memory usage]
  • Avg CPU Utilization: [Insert average CPU utilization]
@rjan90 rjan90 added this to FilOz Sep 4, 2024
@rjan90 rjan90 moved this to ⌨️In Progress in FilOz Sep 4, 2024
@rjan90
Copy link
Contributor Author

rjan90 commented Sep 4, 2024

Date: 2024-09-04

Skeleton Version: 24

Actor Count: 3192139

Hardware Specifications on the machine you are running the benchmark:

  • CPU: AMD EPYC 7F32 8-Core Processor
  • RAM: 512GiB
  • Disk: 2x SAMSUNG MZWLJ3T8HBLS-00007 in RAID0 / ~7TiB

Offline Mode

  • Migration Time:

Without pre-migration/cache:

completed round actual (without cache), took  25.818454017s

With pre-migration/cache:

completed premigration, took  35.893798167s
completed round actual (with cache), took  24.801791185s
  • Peak Memory Usage: 11.7 GiB
  • Avg CPU Utilization: 19.831%

Online Mode

  • Migration Time:
COMPLETED pre-migration	{"duration": 52.036714995}
COMPLETED migration	{"height": "4238500", "from": "bafy2bzacedv7ul3odsltrpfhyhtjddi5hxdu3zbwlfkuyyppbcm4oakdoxsbc", "to": "bafy2bzacedv7ul3odsltrpfhyhtjddi5hxdu3zbwlfkuyyppbcm4oakdoxsbc", "duration": 34.553512928}

@rjan90 rjan90 self-assigned this Sep 4, 2024
@rjan90
Copy link
Contributor Author

rjan90 commented Sep 4, 2024

Date: 2024-09-04

Skeleton Version: 24

Actor Count: 3192329

Hardware Specifications on the machine you are running the benchmark:

  • CPU: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
  • RAM: 128GiB
  • Disk: Samsung_SSD_860_PRO_2TB

Offline Mode

  • Migration Time:

Without pre-migration/cache:

completed round actual (without cache), took  45.59548506s

With pre-migration/cache:

completed premigration, took  44.481500927s
completed round actual (with cache), took  34.087593577s
  • Peak Memory Usage: 16.04 GiB
  • Avg CPU Utilization: 95.19%

Online Mode

  • Migration Time:
2024-09-04T16:40:58.072+0200	WARN	statemgr	stmgr/forks.go:263	COMPLETED pre-migration	{"duration": 57.717825161}
2024-09-04T17:41:15.025+0200	WARN	statemgr	stmgr/forks.go:211	COMPLETED migration	{"height": "4238600", "from": "bafy2bzacebtz2rk6v5xgjdbfzqczzj5h3r3wetzw7sitmybg5zwyg7nmka3sc", "to": "bafy2bzacebtz2rk6v5xgjdbfzqczzj5h3r3wetzw7sitmybg5zwyg7nmka3sc", "duration": 41.692580379}

@BigLep
Copy link
Member

BigLep commented Sep 13, 2024

@rjan90 : I did some updates to the template in the issue description. They were mostly cosmetic, but the key callouts were:

  1. Capturing offline with and without pre-migration cache.
  2. Capturing which network the benchmark is for.
    Feel free to adjust if I got any of that wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ⌨️ In Progress
Development

No branches or pull requests

2 participants