[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f340b1c4-f1ed-4c9f-adbb-b10cd3a17a85@arm.com>
Date: Fri, 8 Nov 2024 15:53:26 +0100
From: Dietmar Eggemann <dietmar.eggemann@....com>
To: Phil Auld <pauld@...hat.com>
Cc: Peter Zijlstra <peterz@...radead.org>, mingo@...hat.com,
juri.lelli@...hat.com, vincent.guittot@...aro.org, rostedt@...dmis.org,
bsegall@...gle.com, mgorman@...e.de, vschneid@...hat.com,
linux-kernel@...r.kernel.org, kprateek.nayak@....com,
wuyun.abel@...edance.com, youssefesmat@...omium.org, tglx@...utronix.de,
efault@....de
Subject: Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
On 04/11/2024 13:50, Phil Auld wrote:
>
> Hi Dietmar,
>
> On Mon, Nov 04, 2024 at 10:28:37AM +0100 Dietmar Eggemann wrote:
>> Hi Phil,
>>
>> On 01/11/2024 13:47, Phil Auld wrote:
>>>
>>> Hi Peterm
[...]
>> I'm not seeing this on my i7-13700K running tip sched/core (1a6151017ee5
>> - sched: psi: pass enqueue/dequeue flags to psi callbacks directly
>> (2024-10-26 Johannes Weiner)) (6.12.0-rc4 - based)
>>
>> Using 'taskset 0xaaaaa' avoiding SMT and running only on P-cores.
>>
>> vanilla features: 990MB/s (mean out of 5 runs, σ: 9.38)
>> NO_DELAY_DEQUEUE: 992MB/s (mean out of 5 runs, σ: 10.61)
>>
>> # sudo lshw -class disk -class storage
>> *-nvme
>> description: NVMe device
>> product: GIGABYTE GP-ASM2NE6500GTTD
>> vendor: Phison Electronics Corporation
>> physical id: 0
>> bus info: pci@...0:01:00.0
>> logical name: /dev/nvme0
>> version: EGFM13.2
>> ...
>> capabilities: nvme pciexpress msix msi pm nvm_express bus_master cap_list
>> configuration: driver=nvme latency=0 nqn=nqn.2014.08.org.nvmexpress:19871987SN215108954872 GIGABYTE GP-ASM2NE6500GTTD state=live
>> resources: irq:16 memory:70800000-70803fff
>>
>> # mount | grep ^/dev/nvme0
>> /dev/nvme0n1p2 on / type ext4 (rw,relatime,errors=remount-ro)
>>
>> Which disk device you're using?
>
> Most of the reports are on various NVME drives (samsung mostly I think).
>
>
> One thing I should add is that it's all on LVM:
>
>
> vgcreate vg /dev/nvme0n1 -y
> lvcreate -n thinMeta -L 3GB vg -y
> lvcreate -n thinPool -l 99%FREE vg -y
> lvconvert --thinpool /dev/mapper/vg-thinPool --poolmetadata /dev/mapper/vg-thinMeta -Zn -y
> lvcreate -n testLV -V 1300G --thinpool thinPool vg
> wipefs -a /dev/mapper/vg-testLV
> mkfs.ext4 /dev/mapper/vg-testLV -E lazy_itable_init=0,lazy_journal_init=0 -F
> mount /dev/mapper/vg-testLV /testfs
>
>
> With VDO or thinpool (as above) it shows on both ext4 and xfs. With fs on
> drive directly it's a little more variable. Some it shows on xfs, some it show
> on ext4 and not vice-versa, seems to depend on the drive or hw raid. But when
> it shows it's 100% reproducible on that setup.
>
> It's always the randwrite numbers. The rest look fine.
>
> Also, as yet I'm not personally doing this testing, just looking into it and
> passing on the information I have.
One reason I don't see the difference between DELAY_DEQUEUE and
NO_DELAY_DEQUEUE could be because of the affinity of the related
nvme interrupts:
$ cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 ...
132: 0 0 1523653 0 0 0 0 0 0 ... IR-PCI-MSIX-0000:01:00.0 1-edge nvme0q1
133: 0 0 0 0 0 1338451 0 0 0 ... IR-PCI-MSIX-0000:01:00.0 2-edge nvme0q2
134: 0 0 0 0 0 0 0 0 2252297 ... IR-PCI-MSIX-0000:01:00.0 3-edge nvme0q3
$ cat /proc/irq/132/smp_affinity_list
0-2
cat /proc/irq/133/smp_affinity_list
3-5
cat /proc/irq/134/smp_affinity_list
6-8
So the 8 fio tasks from:
# fio --cpus_allowed 1,2,3,4,5,6,7,8 --rw randwrite --bs 4k
--runtime 8s --iodepth 32 --direct 1 --ioengine libaio
--numjobs 8 --size 30g --name default --time_based
--group_reporting --cpus_allowed_policy shared
--directory /testfs
don't have to fight with per-CPU kworkers on each CPU.
e.g. 'nvme0q3 interrupt -> queue on workqueue dio/nvme0n1p2 ->
run iomap_dio_complete_work() in kworker/8:x'
In case I trace the 'task_on_rq_queued(p) && p->se.sched_delayed &&
rq->nr_running > 1) condition in ttwu_runnable() condition i only see
the per-CPU kworker in there, so p->nr_cpus_allowed == 1.
So the patch shouldn't make a difference for this scenario?
But maybe your VDO or thinpool setup creates waker/wakee pairs with
wakee->nr_cpus_allowed > 1?
Does your machine has single CPU smp_affinity masks for these nvme
interrupts?
[...]
Powered by blists - more mailing lists