[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <425cb94a-96b3-4863-8bbb-78e18d5a4939@arm.com>
Date: Mon, 4 Nov 2024 12:55:00 +0100
From: Dietmar Eggemann <dietmar.eggemann@....com>
To: Phil Auld <pauld@...hat.com>, Peter Zijlstra <peterz@...radead.org>
Cc: mingo@...hat.com, juri.lelli@...hat.com, vincent.guittot@...aro.org,
rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
vschneid@...hat.com, linux-kernel@...r.kernel.org, kprateek.nayak@....com,
wuyun.abel@...edance.com, youssefesmat@...omium.org, tglx@...utronix.de,
efault@....de, Christian Loehle <christian.loehle@....com>
Subject: Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
+cc Christian Loehle <christian.loehle@....com>
On 04/11/2024 10:28, Dietmar Eggemann wrote:
> Hi Phil,
>
> On 01/11/2024 13:47, Phil Auld wrote:
>>
>> Hi Peterm
>>
>> On Sat, Jul 27, 2024 at 12:27:49PM +0200 Peter Zijlstra wrote:
>>> Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
>>> noting that lag is fundamentally a temporal measure. It should not be
>>> carried around indefinitely.
>>>
>>> OTOH it should also not be instantly discarded, doing so will allow a
>>> task to game the system by purposefully (micro) sleeping at the end of
>>> its time quantum.
>>>
>>> Since lag is intimately tied to the virtual time base, a wall-time
>>> based decay is also insufficient, notably competition is required for
>>> any of this to make sense.
>>>
>>> Instead, delay the dequeue and keep the 'tasks' on the runqueue,
>>> competing until they are eligible.
>>>
>>> Strictly speaking, we only care about keeping them until the 0-lag
>>> point, but that is a difficult proposition, instead carry them around
>>> until they get picked again, and dequeue them at that point.
>>
>> This one is causing a 10-20% performance hit on our filesystem tests.
>>
>> On 6.12-rc5 (so with the latest follow ons) we get:
>>
>> with DELAY_DEQUEUE the bandwidth is 510 MB/s
>> with NO_DELAY_DEQUEUE the bandwidth is 590 MB/s
>>
>> The test is fio, something like this:
>>
>> taskset -c 1,2,3,4,5,6,7,8 fio --rw randwrite --bs 4k --runtime 1m --fsync 0 --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --nrfiles 1 --loops 1 --name default --randrepeat 1 --time_based --group_reporting --directory /testfs
>
> I'm not seeing this on my i7-13700K running tip sched/core (1a6151017ee5
> - sched: psi: pass enqueue/dequeue flags to psi callbacks directly
> (2024-10-26 Johannes Weiner)) (6.12.0-rc4 - based)
>
> Using 'taskset 0xaaaaa' avoiding SMT and running only on P-cores.
^^^^^^^
>
> vanilla features: 990MB/s (mean out of 5 runs, σ: 9.38)
> NO_DELAY_DEQUEUE: 992MB/s (mean out of 5 runs, σ: 10.61)
Christian Loehle just told me that my cpumask looks odd. Should be
0xaaaa instead.
Retested:
vanilla features: 954MB/s (mean out of 5 runs, σ: 30.83)
NO_DELAY_DEQUEUE: 932MB/s (mean out of 5 runs, σ: 28.10)
Now there are only 8 CPUs (instead of 10) for the 8 (+2) fio tasks. σ
went up probably because of more wakeup/preemption latency.
>
> # sudo lshw -class disk -class storage
> *-nvme
> description: NVMe device
> product: GIGABYTE GP-ASM2NE6500GTTD
> vendor: Phison Electronics Corporation
> physical id: 0
> bus info: pci@...0:01:00.0
> logical name: /dev/nvme0
> version: EGFM13.2
> ...
> capabilities: nvme pciexpress msix msi pm nvm_express bus_master cap_list
> configuration: driver=nvme latency=0 nqn=nqn.2014.08.org.nvmexpress:19871987SN215108954872 GIGABYTE GP-ASM2NE6500GTTD state=live
> resources: irq:16 memory:70800000-70803fff
>
> # mount | grep ^/dev/nvme0
> /dev/nvme0n1p2 on / type ext4 (rw,relatime,errors=remount-ro)
>
> Which disk device you're using?
>
>>
>> In this case it's ext4, but I'm not sure it will be FS specific.
>>
>> I should have the machine and setup next week to poke further but I wanted
>> to mention it now just in case any one has an "aha" moment.
>>
>> It seems to only effect these FS loads. Other perf tests are not showing any
>> issues that I am aware of.
>
> [...]
>
>
Powered by blists - more mailing lists