apache · logan-keede · Mar 11, 2025 · Mar 11, 2025 · Mar 11, 2025 · Mar 11, 2025
diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -23,19 +23,18 @@ This crate contains benchmarks based on popular public data sets and
 open source benchmark suites, to help with performance and scalability
 testing of DataFusion.
 
-
 ## Other engines
 
 The benchmarks measure changes to DataFusion itself, rather than
 its performance against other engines. For competitive benchmarking,
 DataFusion is included in the benchmark setups for several popular
 benchmarks that compare performance with other engines. For example:
 
-* [ClickBench] scripts are in the [ClickBench repo](https://github.com/ClickHouse/ClickBench/tree/main/datafusion)
-* [H2o.ai `db-benchmark`] scripts are in [db-benchmark](https://github.com/apache/datafusion/tree/main/benchmarks/src/h2o.rs)
+- [ClickBench][clickbench] scripts are in the [ClickBench repo](https://github.com/ClickHouse/ClickBench/tree/main/datafusion)
+- [H2o.ai ][h2o.ai `db-benchmark`] scripts are in [db-benchmark](https://github.com/apache/datafusion/tree/main/benchmarks/src/h2o.rs)
 
-[ClickBench]: https://github.com/ClickHouse/ClickBench/tree/main
-[H2o.ai `db-benchmark`]: https://github.com/h2oai/db-benchmark
+[clickbench]: https://github.com/ClickHouse/ClickBench/tree/main
+[h2o.ai `db-benchmark`]: https://github.com/h2oai/db-benchmark
 
 # Running the benchmarks
 
@@ -68,8 +67,10 @@ Create / download a specific dataset (TPCH)
 Data is placed in the `data` subdirectory.
 
 ## Select join algorithm
+
 The benchmark runs with `prefer_hash_join == true` by default, which enforces HASH join algorithm.
 To run TPCH benchmarks with join other than HASH:
+
 ```shell
 PREFER_HASH_JOIN=false ./bench.sh run tpch
 ```
@@ -363,9 +364,9 @@ done dropping runtime in 83.531417ms
 
 ## ClickBench
 
-The ClickBench[1] benchmarks are widely cited in the industry and
+The ClickBench[1][1] benchmarks are widely cited in the industry and
 focus on grouping / aggregation / filtering. This runner uses the
-scripts and queries from [2].
+scripts and queries from [2][2].
 
 [1]: https://github.com/ClickHouse/ClickBench
 [2]: https://github.com/ClickHouse/ClickBench/tree/main/datafusion
@@ -380,7 +381,7 @@ logs.
 
 Example
 
-dfbench parquet-filter  --path ./data --scale-factor 1.0
+dfbench parquet-filter --path ./data --scale-factor 1.0
 
 generates the synthetic dataset at `./data/logs.parquet`. The size
 of the dataset can be controlled through the `size_factor`
@@ -412,6 +413,7 @@ Iteration 2 returned 1781686 rows in 1947 ms
 ```
 
 ## Sort
+
 Test performance of sorting large datasets
 
 This test sorts a a synthetic dataset generated during the
@@ -431,17 +433,21 @@ Sort integration benchmark runs whole table sort queries on TPCH `lineitem` tabl
 See [`sort_tpch.rs`](src/sort_tpch.rs) for more details.
 
 ### Sort TPCH Benchmark Example Runs
+
 1. Run all queries with default setting:
+
 ```bash
  cargo run --release --bin dfbench -- sort-tpch -p '....../datafusion/benchmarks/data/tpch_sf1' -o '/tmp/sort_tpch.json'
 ```
 
 2. Run a specific query:
+
 ```bash
  cargo run --release --bin dfbench -- sort-tpch -p '....../datafusion/benchmarks/data/tpch_sf1' -o '/tmp/sort_tpch.json' --query 2
 ```
 
 3. Run all queries with `bench.sh` script:
+
 ```bash
 ./bench.sh run sort_tpch
 ```
@@ -476,116 +482,147 @@ When the memory limit is exceeded, the aggregation intermediate results will be
 
 External aggregation benchmarks run several aggregation queries with different memory limits, on TPCH `lineitem` table. Queries can be found in [`external_aggr.rs`](src/bin/external_aggr.rs).
 
-This benchmark is inspired by [DuckDB's external aggregation paper](https://hannes.muehleisen.org/publications/icde2024-out-of-core-kuiper-boncz-muehleisen.pdf), specifically Section VI.
+This benchmark is inspired by [DuckDB&#39;s external aggregation paper](https://hannes.muehleisen.org/publications/icde2024-out-of-core-kuiper-boncz-muehleisen.pdf), specifically Section VI.
 
 ### External Aggregation Example Runs
+
 1. Run all queries with predefined memory limits:
+
 ```bash
 # Under 'benchmarks/' directory
 cargo run --release --bin external_aggr -- benchmark -n 4 --iterations 3 -p '....../data/tpch_sf1' -o '/tmp/aggr.json'
 ```
 
 2. Run a query with specific memory limit:
+
 ```bash
 cargo run --release --bin external_aggr -- benchmark -n 4 --iterations 3 -p '....../data/tpch_sf1' -o '/tmp/aggr.json' --query 1 --memory-limit 30M
 ```
 
 3. Run all queries with `bench.sh` script:
+
 ```bash
 ./bench.sh data external_aggr
 ./bench.sh run external_aggr
 ```
 
-
 ## h2o benchmarks for groupby
 
 ### Generate data for h2o benchmarks
+
 There are three options for generating data for h2o benchmarks: `small`, `medium`, and `big`. The data is generated in the `data` directory.
 
 1. Generate small data (1e7 rows)
+
 ```bash
 ./bench.sh data h2o_small
 ```
 
-
 2. Generate medium data (1e8 rows)
+
 ```bash
 ./bench.sh data h2o_medium
 ```
 
-
 3. Generate large data (1e9 rows)
+
 ```bash
 ./bench.sh data h2o_big
 ```
 
 ### Run h2o benchmarks
+
 There are three options for running h2o benchmarks: `small`, `medium`, and `big`.
+
 1. Run small data benchmark
+
 ```bash
 ./bench.sh run h2o_small
 ```
 
 2. Run medium data benchmark
+
 ```bash
 ./bench.sh run h2o_medium
 ```
 
 3. Run large data benchmark
+
 ```bash
 ./bench.sh run h2o_big
 ```
 
 4. Run a specific query with a specific data path
 
 For example, to run query 1 with the small data generated above:
+
 ```bash
 cargo run --release --bin dfbench -- h2o --path ./benchmarks/data/h2o/G1_1e7_1e7_100_0.csv  --query 1
 ```
 
 ## h2o benchmarks for join
 
 ### Generate data for h2o benchmarks
+
 There are three options for generating data for h2o benchmarks: `small`, `medium`, and `big`. The data is generated in the `data` directory.
 
 1. Generate small data (4 table files, the largest is 1e7 rows)
+
 ```bash
 ./bench.sh data h2o_small_join
 ```
 
-
 2. Generate medium data (4 table files, the largest is 1e8 rows)
+
 ```bash
 ./bench.sh data h2o_medium_join
 ```
 
 3. Generate large data (4 table files, the largest is 1e9 rows)
+
 ```bash
 ./bench.sh data h2o_big_join
 ```
 
 ### Run h2o benchmarks
+
 There are three options for running h2o benchmarks: `small`, `medium`, and `big`.
+
 1. Run small data benchmark
+
 ```bash
 ./bench.sh run h2o_small_join
 ```
 
 2. Run medium data benchmark
+
 ```bash
 ./bench.sh run h2o_medium_join
 ```
 
 3. Run large data benchmark
+
 ```bash
 ./bench.sh run h2o_big_join
 ```
 
 4. Run a specific query with a specific join data paths, the data paths are including 4 table files.
 
 For example, to run query 1 with the small data generated above:
+
 ```bash
 cargo run --release --bin dfbench -- h2o --join-paths ./benchmarks/data/h2o/J1_1e7_NA_0.csv,./benchmarks/data/h2o/J1_1e7_1e1_0.csv,./benchmarks/data/h2o/J1_1e7_1e4_0.csv,./benchmarks/data/h2o/J1_1e7_1e7_NA.csv --queries-path ./benchmarks/queries/h2o/join.sql --query 1
 ```
+
+### Collect Benchmarks
+
+Collect benchmarks of current main and 5 previous releases.
+
+```bash
+./collect_bench.sh [benchmark_name](optional)
+```
+
+Note: `benchmark_name` can be any benchmark defined in bench.sh. Defaults to `all` just like bench.sh.
+
 [1]: http://www.tpc.org/tpch/
 [2]: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
diff --git a/benchmarks/collect_bench.sh b/benchmarks/collect_bench.sh
@@ -0,0 +1,51 @@
+#!/usr/bin/env bash
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# This script is meant for developers of DataFusion -- it is runnable
+# from the standard DataFusion development environment and uses cargo,
+# etc and orchestrates gathering data and run the benchmark binary to
+# collect benchmarks from the current main and last 5 major releases.
+
+trap 'git checkout main' EXIT #checkout to main on exit
+ARG1=$1
+
+main(){
+
+git fetch upstream main
+git checkout main
+
+# get current major version 
+output=$(cargo metadata --format-version=1 --no-deps | jq '.packages[] | select(.name == "datafusion") | .version')
+major_version=$(echo "$output" | grep -oE '[0-9]+' | head -n1)
+
+# run for current main
+echo "current major version: $major_version"  
+export RESULTS_DIR="results/main"
+./bench.sh run $ARG1
+
+# run for last 5 major releases
+for i in {1..5}; do
+    echo "running benchmark on  $((major_version-i)).0.0"
+    git fetch upstream $((major_version-i)).0.0
+    git checkout $((major_version-i)).0.0
+    export RESULTS_DIR="results/$((major_version-i)).0.0"
+    ./bench.sh run $ARG1
+done
+}
+
+main