[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250623105739.GR1613200@noisy.programming.kicks-ass.net>
Date: Mon, 23 Jun 2025 12:57:39 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Gary Yang <gary.yang@...tech.com>
Cc: linux-kernel@...r.kernel.org
Subject: Re: [sched/eevdf] llama-bench performace drop
On Mon, Jun 23, 2025 at 06:27:18PM +0800, Gary Yang wrote:
> Problem: The llama-bench test uses cpu to run AI model. It can create a
> lot of threads, so it belongs to cpu-bounds type process.
How many threads per CPU? Typically compute workloads stick with 1
thread per CPU.
> It can outputs
> three scores. 1st score is primarily influenced by CPU frequency, 2nd score
> is primarily influenced by memory, or L1/L2 cache, but 3rd score is influenced
> by CPU frequency and memory.
>
> when run llama-bench test on ARM A720 with kernel6.1, it outputs three scores:
> root# taskset -c 0,5,6,7,8,9,10,11 llama-bench -m DeepSeek-R1-Distill-Qwen-7B-Q4_0.gguf
> -pg 128,128 -t 8
> | model | size | params | backend | threads | test | t/s |
> | ------------- |--------: |------: | ------- | ------: |------ ---: | -----------: |
>
> | qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | CPU | 8 | pp512 | 58.67 ± 3.08 |
> | qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | CPU | 8 | tg128 | 9.32 ± 0.22 |
> | qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | CPU | 8 |pp128+tg128 | 15.10 ± 1.08 |
Your taskset has 8 CPUs listed, and the threads column has 8. So 1
thread per CPU. This should be a boring workload. Are they sleeping
frequently to sync up or something?
> build: 14d627f4 (5288)
>
> when run llama-bench test on ARM A720 with kernel6.6.89, it outputs three scores:
> root# taskset -c 0,5,6,7,8,9,10,11 llama-bench -m DeepSeek-R1-Distill-Qwen-7B-Q4_0.gguf
> -pg 128,128 -t 8
> | model | size | params | backend | threads | test | t/s |
> | --------------|--------: |------: | ------- | ------: | ----------: | -----------: |
>
> | qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | CPU | 8 | pp512 | 49.89 ± 3.83 |
> | qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | CPU | 8 | tg128 | 2.66 ± 1.98 |
> | qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | CPU | 8 | pp128+tg128 | 1.92 ± 0.45 |
>
> build: 14d627f4 (5288)
>
> We find the 2nd and 3rd scores are both lower than kernel6.1. During analyze this issue,
> we note there is a new feature on kernel 6.6. It introduces EEVDF scheduler, instand of
> CFS used in kernel 6.1. After we try to revert some EEVDF patches below, the two scores
> are better, almost near those got from kernel 6.1.
>
> 9ef5bc6e07a5 Revert "sched/fair: Commit to EEVDF"
> a21eaad7417a Revert "sched/eevdf: Curb wakeup-preemption"
> 2cf7e10af999 Revert "sched/eevdf: Also update slice on placement"
> a19837e0f27b Revert "sched/eevdf: Fix avg_vruntime()"
> eae55a336cf3 Revert "sched/eevdf: Fix min_deadline heap integrity"
> ba3c4b6b5aa9 Revert "sched/eevdf: Fix pick_eevdf()"
> 37561f3cdba5 Revert "sched/eevdf: Fix heap corruption more"
> 9a80e5bf2bb5 Revert "sched/eevdf: Fix vruntime adjustment on reweight"
> df483ee656d5 Revert "sched/eevdf: Always update V if se->on_rq when reweighting"
> 587fe3a23160 Revert "sched/eevdf: Fix miscalculation in reweight_entity() when se is not curr"
> 65f847ba8cc3 Revert "sched/eevdf: Prevent vlag from going out of bounds in reweight_eevdf()"
>
> Does anyone encounter the similar issue? What suggestions do you have to us?
Try a newer kernel, like 6.15. 6.6 is ancient and I can't remember that
it looked like.
Then try and run your workload using SCHED_BATCH and or increase
/debug/sched/base_slice_ns to 15000000 or so.
Powered by blists - more mailing lists