linux-kernel - Re: [sched/eevdf] llama-bench performace drop

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20250623105739.GR1613200@noisy.programming.kicks-ass.net>
Date: Mon, 23 Jun 2025 12:57:39 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Gary Yang <gary.yang@...tech.com>
Cc: linux-kernel@...r.kernel.org
Subject: Re: [sched/eevdf] llama-bench performace drop

On Mon, Jun 23, 2025 at 06:27:18PM +0800, Gary Yang wrote:
> Problem: The llama-bench test uses cpu to run AI model. It can create a
> lot of threads, so it belongs to cpu-bounds type process. 

How many threads per CPU? Typically compute workloads stick with 1
thread per CPU.

> It can outputs
> three scores. 1st score is primarily influenced by CPU frequency, 2nd score
> is primarily influenced by memory, or L1/L2 cache, but 3rd score is influenced
> by CPU frequency and memory.
> 
> when run llama-bench test on ARM A720 with kernel6.1, it outputs three scores:
> root# taskset -c 0,5,6,7,8,9,10,11 llama-bench -m DeepSeek-R1-Distill-Qwen-7B-Q4_0.gguf
> -pg 128,128 -t 8
> | model         |     size | params | backend | threads |       test |          t/s |
> | ------------- |--------: |------: | ------- | ------: |------ ---: | -----------: |
> 
> | qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | CPU     |       8 |      pp512 | 58.67 ± 3.08 |
> | qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | CPU     |       8 |      tg128 |  9.32 ± 0.22 |
> | qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | CPU     |       8 |pp128+tg128 | 15.10 ± 1.08 |

Your taskset has 8 CPUs listed, and the threads column has 8. So 1
thread per CPU. This should be a boring workload. Are they sleeping
frequently to sync up or something?

> build: 14d627f4 (5288)
> 
> when run llama-bench test on ARM A720 with kernel6.6.89, it outputs three scores:
> root# taskset -c 0,5,6,7,8,9,10,11 llama-bench -m DeepSeek-R1-Distill-Qwen-7B-Q4_0.gguf
> -pg 128,128 -t 8
> | model         |     size | params | backend | threads |        test |          t/s |
> | --------------|--------: |------: | ------- | ------: | ----------: | -----------: |
> 
> | qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | CPU     |       8 |       pp512 | 49.89 ± 3.83 |
> | qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | CPU     |       8 |       tg128 |  2.66 ± 1.98 |
> | qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | CPU     |       8 | pp128+tg128 |  1.92 ± 0.45 |
> 
> build: 14d627f4 (5288)
> 
> We find the 2nd and 3rd scores are both lower than kernel6.1. During analyze this issue,
> we note there is a new feature on kernel 6.6. It introduces EEVDF scheduler, instand of
> CFS used in kernel 6.1. After we try to revert some EEVDF patches below, the two scores
> are better, almost near those got from kernel 6.1.
> 
> 9ef5bc6e07a5 Revert "sched/fair: Commit to EEVDF"
> a21eaad7417a Revert "sched/eevdf: Curb wakeup-preemption"
> 2cf7e10af999 Revert "sched/eevdf: Also update slice on placement"
> a19837e0f27b Revert "sched/eevdf: Fix avg_vruntime()"
> eae55a336cf3 Revert "sched/eevdf: Fix min_deadline heap integrity"
> ba3c4b6b5aa9 Revert "sched/eevdf: Fix pick_eevdf()"
> 37561f3cdba5 Revert "sched/eevdf: Fix heap corruption more"
> 9a80e5bf2bb5 Revert "sched/eevdf: Fix vruntime adjustment on reweight"
> df483ee656d5 Revert "sched/eevdf: Always update V if se->on_rq when reweighting"
> 587fe3a23160 Revert "sched/eevdf: Fix miscalculation in reweight_entity() when se is not curr"
> 65f847ba8cc3 Revert "sched/eevdf: Prevent vlag from going out of bounds in reweight_eevdf()"
> 
> Does anyone encounter the similar issue? What suggestions do you have to us?

Try a newer kernel, like 6.15. 6.6 is ancient and I can't remember that
it looked like.

Then try and run your workload using SCHED_BATCH and or increase
/debug/sched/base_slice_ns to 15000000 or so.