linux-kernel - Re: [PATCH 17/24] sched/fair: Implement delayed dequeue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ed46d844-e0b0-46fd-a164-9bfad538a7a9@arm.com>
Date: Mon, 4 Nov 2024 10:28:37 +0100
From: Dietmar Eggemann <dietmar.eggemann@....com>
To: Phil Auld <pauld@...hat.com>, Peter Zijlstra <peterz@...radead.org>
Cc: mingo@...hat.com, juri.lelli@...hat.com, vincent.guittot@...aro.org,
 rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
 vschneid@...hat.com, linux-kernel@...r.kernel.org, kprateek.nayak@....com,
 wuyun.abel@...edance.com, youssefesmat@...omium.org, tglx@...utronix.de,
 efault@....de
Subject: Re: [PATCH 17/24] sched/fair: Implement delayed dequeue

Hi Phil,

On 01/11/2024 13:47, Phil Auld wrote:
> 
> Hi Peterm
> 
> On Sat, Jul 27, 2024 at 12:27:49PM +0200 Peter Zijlstra wrote:
>> Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
>> noting that lag is fundamentally a temporal measure. It should not be
>> carried around indefinitely.
>>
>> OTOH it should also not be instantly discarded, doing so will allow a
>> task to game the system by purposefully (micro) sleeping at the end of
>> its time quantum.
>>
>> Since lag is intimately tied to the virtual time base, a wall-time
>> based decay is also insufficient, notably competition is required for
>> any of this to make sense.
>>
>> Instead, delay the dequeue and keep the 'tasks' on the runqueue,
>> competing until they are eligible.
>>
>> Strictly speaking, we only care about keeping them until the 0-lag
>> point, but that is a difficult proposition, instead carry them around
>> until they get picked again, and dequeue them at that point.
> 
> This one is causing a 10-20% performance hit on our filesystem tests.
> 
> On 6.12-rc5 (so with the latest follow ons) we get:
> 
> with DELAY_DEQUEUE the bandwidth is 510 MB/s
> with NO_DELAY_DEQUEUE the bandwidth is 590 MB/s
> 
> The test is fio, something like this:
> 
> taskset -c 1,2,3,4,5,6,7,8 fio --rw randwrite --bs 4k --runtime 1m --fsync 0 --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --nrfiles 1 --loops 1 --name default --randrepeat 1 --time_based --group_reporting --directory /testfs

I'm not seeing this on my i7-13700K running tip sched/core (1a6151017ee5
- sched: psi: pass enqueue/dequeue flags to psi callbacks directly
(2024-10-26 Johannes Weiner)) (6.12.0-rc4 - based)

Using 'taskset 0xaaaaa' avoiding SMT and running only on P-cores.

vanilla features: 990MB/s (mean out of 5 runs, σ:  9.38)
NO_DELAY_DEQUEUE: 992MB/s (mean out of 5 runs, σ: 10.61)

# sudo lshw -class disk -class storage
  *-nvme                    
       description: NVMe device
       product: GIGABYTE GP-ASM2NE6500GTTD
       vendor: Phison Electronics Corporation
       physical id: 0
       bus info: pci@...0:01:00.0
       logical name: /dev/nvme0
       version: EGFM13.2
       ...
       capabilities: nvme pciexpress msix msi pm nvm_express bus_master cap_list
       configuration: driver=nvme latency=0 nqn=nqn.2014.08.org.nvmexpress:19871987SN215108954872 GIGABYTE GP-ASM2NE6500GTTD state=live
       resources: irq:16 memory:70800000-70803fff

# mount | grep ^/dev/nvme0
/dev/nvme0n1p2 on / type ext4 (rw,relatime,errors=remount-ro)

Which disk device you're using?

> 
> In this case it's ext4, but I'm not sure it will be FS specific.
> 
> I should have the machine and setup next week to poke further but I wanted
> to mention it now just in case any one has an "aha" moment.
> 
> It seems to only effect these FS loads. Other perf tests are not showing any
> issues that I am aware of.

[...]