linux-kernel - Re: [PATCH 17/24] sched/fair: Implement delayed dequeue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <93e73a42-0afc-4749-89db-73b9f72c8b0b@arm.com>
Date: Tue, 5 Nov 2024 09:53:49 +0000
From: Christian Loehle <christian.loehle@....com>
To: Phil Auld <pauld@...hat.com>, Dietmar Eggemann <dietmar.eggemann@....com>
Cc: Peter Zijlstra <peterz@...radead.org>, mingo@...hat.com,
 juri.lelli@...hat.com, vincent.guittot@...aro.org, rostedt@...dmis.org,
 bsegall@...gle.com, mgorman@...e.de, vschneid@...hat.com,
 linux-kernel@...r.kernel.org, kprateek.nayak@....com,
 wuyun.abel@...edance.com, youssefesmat@...omium.org, tglx@...utronix.de,
 efault@....de
Subject: Re: [PATCH 17/24] sched/fair: Implement delayed dequeue

On 11/4/24 12:50, Phil Auld wrote:
> 
> Hi Dietmar,
> 
> On Mon, Nov 04, 2024 at 10:28:37AM +0100 Dietmar Eggemann wrote:
>> Hi Phil,
>>
>> On 01/11/2024 13:47, Phil Auld wrote:
>>>
>>> Hi Peterm
>>>
>>> On Sat, Jul 27, 2024 at 12:27:49PM +0200 Peter Zijlstra wrote:
>>>> Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
>>>> noting that lag is fundamentally a temporal measure. It should not be
>>>> carried around indefinitely.
>>>>
>>>> OTOH it should also not be instantly discarded, doing so will allow a
>>>> task to game the system by purposefully (micro) sleeping at the end of
>>>> its time quantum.
>>>>
>>>> Since lag is intimately tied to the virtual time base, a wall-time
>>>> based decay is also insufficient, notably competition is required for
>>>> any of this to make sense.
>>>>
>>>> Instead, delay the dequeue and keep the 'tasks' on the runqueue,
>>>> competing until they are eligible.
>>>>
>>>> Strictly speaking, we only care about keeping them until the 0-lag
>>>> point, but that is a difficult proposition, instead carry them around
>>>> until they get picked again, and dequeue them at that point.
>>>
>>> This one is causing a 10-20% performance hit on our filesystem tests.
>>>
>>> On 6.12-rc5 (so with the latest follow ons) we get:
>>>
>>> with DELAY_DEQUEUE the bandwidth is 510 MB/s
>>> with NO_DELAY_DEQUEUE the bandwidth is 590 MB/s
>>>
>>> The test is fio, something like this:
>>>
>>> taskset -c 1,2,3,4,5,6,7,8 fio --rw randwrite --bs 4k --runtime 1m --fsync 0 --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --nrfiles 1 --loops 1 --name default --randrepeat 1 --time_based --group_reporting --directory /testfs
>>
>> I'm not seeing this on my i7-13700K running tip sched/core (1a6151017ee5
>> - sched: psi: pass enqueue/dequeue flags to psi callbacks directly
>> (2024-10-26 Johannes Weiner)) (6.12.0-rc4 - based)
>>
>> Using 'taskset 0xaaaaa' avoiding SMT and running only on P-cores.
>>
>> vanilla features: 990MB/s (mean out of 5 runs, σ:  9.38)
>> NO_DELAY_DEQUEUE: 992MB/s (mean out of 5 runs, σ: 10.61)
>>
>> # sudo lshw -class disk -class storage
>>   *-nvme                    
>>        description: NVMe device
>>        product: GIGABYTE GP-ASM2NE6500GTTD
>>        vendor: Phison Electronics Corporation
>>        physical id: 0
>>        bus info: pci@...0:01:00.0
>>        logical name: /dev/nvme0
>>        version: EGFM13.2
>>        ...
>>        capabilities: nvme pciexpress msix msi pm nvm_express bus_master cap_list
>>        configuration: driver=nvme latency=0 nqn=nqn.2014.08.org.nvmexpress:19871987SN215108954872 GIGABYTE GP-ASM2NE6500GTTD state=live
>>        resources: irq:16 memory:70800000-70803fff 
>>
>> # mount | grep ^/dev/nvme0
>> /dev/nvme0n1p2 on / type ext4 (rw,relatime,errors=remount-ro)
>>
>> Which disk device you're using?
> 
> Most of the reports are on various NVME drives (samsung mostly I think).
> 
> 
> One thing I should add is that it's all on LVM: 
> 
> 
> vgcreate vg /dev/nvme0n1 -y
> lvcreate -n thinMeta -L 3GB vg -y
> lvcreate -n thinPool -l 99%FREE vg -y
> lvconvert --thinpool /dev/mapper/vg-thinPool --poolmetadata /dev/mapper/vg-thinMeta -Zn -y
> lvcreate -n testLV -V 1300G --thinpool thinPool vg
> wipefs -a /dev/mapper/vg-testLV
> mkfs.ext4 /dev/mapper/vg-testLV -E lazy_itable_init=0,lazy_journal_init=0 -F
> mount /dev/mapper/vg-testLV /testfs 
> 
> 
> With VDO or thinpool (as above) it shows on both ext4 and xfs. With fs on
> drive directly it's a little more variable. Some it shows on xfs, some it show
> on ext4 and not vice-versa, seems to depend on the drive or hw raid. But when
> it shows it's 100% reproducible on that setup. 
> 
> It's always the randwrite numbers. The rest look fine.

Hi Phil,

Thanks for the detailed instructions. Unfortunately even with your LVM setup on
the platforms I've tried I don't see a regression so far, all the numbers are
about equal for DELAY_DEQUEUE and NO_DELAY_DEQUEUE.

Anyway I have some follow-ups, first let me trim the fio command for readability:
fio --rw randwrite --bs 4k --runtime 1m --fsync 0 --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --nrfiles 1 --loops 1 --name default --randrepeat 1 --time_based --group_reporting --directory /testfs

dropping defaults nr_files, loops, fsync, randrepeat
fio --rw randwrite --bs 4k --runtime 1m --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --name default --time_based --group_reporting --directory /testfs

Adding the CPU affinities directly:
fio --cpus_allowed 1-8 --rw randwrite --bs 4k --runtime 1m --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --name default --time_based --group_reporting --directory /testfs

Now I was wondering about the following:
Is it actually the kworker (not another fio) being preempted? (I'm pretty sure it is)
To test: --cpus_allowed_policy split (each fio process gets it's own CPU).

You wrote:
>I was thinking maybe the preemption was preventing some batching of IO completions or
>initiations. But that was wrong it seems.

So while it doesn't reproduce for me, the only thing being preempted regularly is
the kworker (running iomap_dio_complete_work). I don't quite follow the "that was
wrong it seems" part then. Could you elaborate?

Could you also post the other benchmark numbers? Does any of them score higher in IOPS?
Is --rw write the same issue if you set --bs 4k (assuming you set a larger bs for seqwrite).

Can you set the kworkers handling completions to SCHED_BATCH too? Just to confirm.

Regards,
Christian