linux-kernel - Re: [PATCH 17/24] sched/fair: Implement delayed dequeue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <425cb94a-96b3-4863-8bbb-78e18d5a4939@arm.com>
Date: Mon, 4 Nov 2024 12:55:00 +0100
From: Dietmar Eggemann <dietmar.eggemann@....com>
To: Phil Auld <pauld@...hat.com>, Peter Zijlstra <peterz@...radead.org>
Cc: mingo@...hat.com, juri.lelli@...hat.com, vincent.guittot@...aro.org,
 rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
 vschneid@...hat.com, linux-kernel@...r.kernel.org, kprateek.nayak@....com,
 wuyun.abel@...edance.com, youssefesmat@...omium.org, tglx@...utronix.de,
 efault@....de, Christian Loehle <christian.loehle@....com>
Subject: Re: [PATCH 17/24] sched/fair: Implement delayed dequeue

+cc Christian Loehle <christian.loehle@....com>

On 04/11/2024 10:28, Dietmar Eggemann wrote:
> Hi Phil,
> 
> On 01/11/2024 13:47, Phil Auld wrote:
>>
>> Hi Peterm
>>
>> On Sat, Jul 27, 2024 at 12:27:49PM +0200 Peter Zijlstra wrote:
>>> Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
>>> noting that lag is fundamentally a temporal measure. It should not be
>>> carried around indefinitely.
>>>
>>> OTOH it should also not be instantly discarded, doing so will allow a
>>> task to game the system by purposefully (micro) sleeping at the end of
>>> its time quantum.
>>>
>>> Since lag is intimately tied to the virtual time base, a wall-time
>>> based decay is also insufficient, notably competition is required for
>>> any of this to make sense.
>>>
>>> Instead, delay the dequeue and keep the 'tasks' on the runqueue,
>>> competing until they are eligible.
>>>
>>> Strictly speaking, we only care about keeping them until the 0-lag
>>> point, but that is a difficult proposition, instead carry them around
>>> until they get picked again, and dequeue them at that point.
>>
>> This one is causing a 10-20% performance hit on our filesystem tests.
>>
>> On 6.12-rc5 (so with the latest follow ons) we get:
>>
>> with DELAY_DEQUEUE the bandwidth is 510 MB/s
>> with NO_DELAY_DEQUEUE the bandwidth is 590 MB/s
>>
>> The test is fio, something like this:
>>
>> taskset -c 1,2,3,4,5,6,7,8 fio --rw randwrite --bs 4k --runtime 1m --fsync 0 --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --nrfiles 1 --loops 1 --name default --randrepeat 1 --time_based --group_reporting --directory /testfs
> 
> I'm not seeing this on my i7-13700K running tip sched/core (1a6151017ee5
> - sched: psi: pass enqueue/dequeue flags to psi callbacks directly
> (2024-10-26 Johannes Weiner)) (6.12.0-rc4 - based)
> 
> Using 'taskset 0xaaaaa' avoiding SMT and running only on P-cores.
                 ^^^^^^^
> 
> vanilla features: 990MB/s (mean out of 5 runs, σ:  9.38)
> NO_DELAY_DEQUEUE: 992MB/s (mean out of 5 runs, σ: 10.61)

Christian Loehle just told me that my cpumask looks odd. Should be
0xaaaa instead.

Retested:

vanilla features: 954MB/s (mean out of 5 runs, σ: 30.83)
NO_DELAY_DEQUEUE: 932MB/s (mean out of 5 runs, σ: 28.10)

Now there are only 8 CPUs (instead of 10) for the 8 (+2) fio tasks. σ
went up probably because of more wakeup/preemption latency.

> 
> # sudo lshw -class disk -class storage
>   *-nvme                    
>        description: NVMe device
>        product: GIGABYTE GP-ASM2NE6500GTTD
>        vendor: Phison Electronics Corporation
>        physical id: 0
>        bus info: pci@...0:01:00.0
>        logical name: /dev/nvme0
>        version: EGFM13.2
>        ...
>        capabilities: nvme pciexpress msix msi pm nvm_express bus_master cap_list
>        configuration: driver=nvme latency=0 nqn=nqn.2014.08.org.nvmexpress:19871987SN215108954872 GIGABYTE GP-ASM2NE6500GTTD state=live
>        resources: irq:16 memory:70800000-70803fff
> 
> # mount | grep ^/dev/nvme0
> /dev/nvme0n1p2 on / type ext4 (rw,relatime,errors=remount-ro)
> 
> Which disk device you're using?
> 
>>
>> In this case it's ext4, but I'm not sure it will be FS specific.
>>
>> I should have the machine and setup next week to poke further but I wanted
>> to mention it now just in case any one has an "aha" moment.
>>
>> It seems to only effect these FS loads. Other perf tests are not showing any
>> issues that I am aware of.
> 
> [...]
> 
>