[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <93e73a42-0afc-4749-89db-73b9f72c8b0b@arm.com>
Date: Tue, 5 Nov 2024 09:53:49 +0000
From: Christian Loehle <christian.loehle@....com>
To: Phil Auld <pauld@...hat.com>, Dietmar Eggemann <dietmar.eggemann@....com>
Cc: Peter Zijlstra <peterz@...radead.org>, mingo@...hat.com,
juri.lelli@...hat.com, vincent.guittot@...aro.org, rostedt@...dmis.org,
bsegall@...gle.com, mgorman@...e.de, vschneid@...hat.com,
linux-kernel@...r.kernel.org, kprateek.nayak@....com,
wuyun.abel@...edance.com, youssefesmat@...omium.org, tglx@...utronix.de,
efault@....de
Subject: Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
On 11/4/24 12:50, Phil Auld wrote:
>
> Hi Dietmar,
>
> On Mon, Nov 04, 2024 at 10:28:37AM +0100 Dietmar Eggemann wrote:
>> Hi Phil,
>>
>> On 01/11/2024 13:47, Phil Auld wrote:
>>>
>>> Hi Peterm
>>>
>>> On Sat, Jul 27, 2024 at 12:27:49PM +0200 Peter Zijlstra wrote:
>>>> Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
>>>> noting that lag is fundamentally a temporal measure. It should not be
>>>> carried around indefinitely.
>>>>
>>>> OTOH it should also not be instantly discarded, doing so will allow a
>>>> task to game the system by purposefully (micro) sleeping at the end of
>>>> its time quantum.
>>>>
>>>> Since lag is intimately tied to the virtual time base, a wall-time
>>>> based decay is also insufficient, notably competition is required for
>>>> any of this to make sense.
>>>>
>>>> Instead, delay the dequeue and keep the 'tasks' on the runqueue,
>>>> competing until they are eligible.
>>>>
>>>> Strictly speaking, we only care about keeping them until the 0-lag
>>>> point, but that is a difficult proposition, instead carry them around
>>>> until they get picked again, and dequeue them at that point.
>>>
>>> This one is causing a 10-20% performance hit on our filesystem tests.
>>>
>>> On 6.12-rc5 (so with the latest follow ons) we get:
>>>
>>> with DELAY_DEQUEUE the bandwidth is 510 MB/s
>>> with NO_DELAY_DEQUEUE the bandwidth is 590 MB/s
>>>
>>> The test is fio, something like this:
>>>
>>> taskset -c 1,2,3,4,5,6,7,8 fio --rw randwrite --bs 4k --runtime 1m --fsync 0 --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --nrfiles 1 --loops 1 --name default --randrepeat 1 --time_based --group_reporting --directory /testfs
>>
>> I'm not seeing this on my i7-13700K running tip sched/core (1a6151017ee5
>> - sched: psi: pass enqueue/dequeue flags to psi callbacks directly
>> (2024-10-26 Johannes Weiner)) (6.12.0-rc4 - based)
>>
>> Using 'taskset 0xaaaaa' avoiding SMT and running only on P-cores.
>>
>> vanilla features: 990MB/s (mean out of 5 runs, σ: 9.38)
>> NO_DELAY_DEQUEUE: 992MB/s (mean out of 5 runs, σ: 10.61)
>>
>> # sudo lshw -class disk -class storage
>> *-nvme
>> description: NVMe device
>> product: GIGABYTE GP-ASM2NE6500GTTD
>> vendor: Phison Electronics Corporation
>> physical id: 0
>> bus info: pci@...0:01:00.0
>> logical name: /dev/nvme0
>> version: EGFM13.2
>> ...
>> capabilities: nvme pciexpress msix msi pm nvm_express bus_master cap_list
>> configuration: driver=nvme latency=0 nqn=nqn.2014.08.org.nvmexpress:19871987SN215108954872 GIGABYTE GP-ASM2NE6500GTTD state=live
>> resources: irq:16 memory:70800000-70803fff
>>
>> # mount | grep ^/dev/nvme0
>> /dev/nvme0n1p2 on / type ext4 (rw,relatime,errors=remount-ro)
>>
>> Which disk device you're using?
>
> Most of the reports are on various NVME drives (samsung mostly I think).
>
>
> One thing I should add is that it's all on LVM:
>
>
> vgcreate vg /dev/nvme0n1 -y
> lvcreate -n thinMeta -L 3GB vg -y
> lvcreate -n thinPool -l 99%FREE vg -y
> lvconvert --thinpool /dev/mapper/vg-thinPool --poolmetadata /dev/mapper/vg-thinMeta -Zn -y
> lvcreate -n testLV -V 1300G --thinpool thinPool vg
> wipefs -a /dev/mapper/vg-testLV
> mkfs.ext4 /dev/mapper/vg-testLV -E lazy_itable_init=0,lazy_journal_init=0 -F
> mount /dev/mapper/vg-testLV /testfs
>
>
> With VDO or thinpool (as above) it shows on both ext4 and xfs. With fs on
> drive directly it's a little more variable. Some it shows on xfs, some it show
> on ext4 and not vice-versa, seems to depend on the drive or hw raid. But when
> it shows it's 100% reproducible on that setup.
>
> It's always the randwrite numbers. The rest look fine.
Hi Phil,
Thanks for the detailed instructions. Unfortunately even with your LVM setup on
the platforms I've tried I don't see a regression so far, all the numbers are
about equal for DELAY_DEQUEUE and NO_DELAY_DEQUEUE.
Anyway I have some follow-ups, first let me trim the fio command for readability:
fio --rw randwrite --bs 4k --runtime 1m --fsync 0 --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --nrfiles 1 --loops 1 --name default --randrepeat 1 --time_based --group_reporting --directory /testfs
dropping defaults nr_files, loops, fsync, randrepeat
fio --rw randwrite --bs 4k --runtime 1m --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --name default --time_based --group_reporting --directory /testfs
Adding the CPU affinities directly:
fio --cpus_allowed 1-8 --rw randwrite --bs 4k --runtime 1m --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --name default --time_based --group_reporting --directory /testfs
Now I was wondering about the following:
Is it actually the kworker (not another fio) being preempted? (I'm pretty sure it is)
To test: --cpus_allowed_policy split (each fio process gets it's own CPU).
You wrote:
>I was thinking maybe the preemption was preventing some batching of IO completions or
>initiations. But that was wrong it seems.
So while it doesn't reproduce for me, the only thing being preempted regularly is
the kworker (running iomap_dio_complete_work). I don't quite follow the "that was
wrong it seems" part then. Could you elaborate?
Could you also post the other benchmark numbers? Does any of them score higher in IOPS?
Is --rw write the same issue if you set --bs 4k (assuming you set a larger bs for seqwrite).
Can you set the kworkers handling completions to SCHED_BATCH too? Just to confirm.
Regards,
Christian
Powered by blists - more mailing lists