[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241108181617.GC43508@pauld.westford.csb>
Date: Fri, 8 Nov 2024 13:16:17 -0500
From: Phil Auld <pauld@...hat.com>
To: Dietmar Eggemann <dietmar.eggemann@....com>
Cc: Peter Zijlstra <peterz@...radead.org>, mingo@...hat.com,
juri.lelli@...hat.com, vincent.guittot@...aro.org,
rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
vschneid@...hat.com, linux-kernel@...r.kernel.org,
kprateek.nayak@....com, wuyun.abel@...edance.com,
youssefesmat@...omium.org, tglx@...utronix.de, efault@....de
Subject: Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
On Fri, Nov 08, 2024 at 03:53:26PM +0100 Dietmar Eggemann wrote:
> On 04/11/2024 13:50, Phil Auld wrote:
> >
> > Hi Dietmar,
> >
> > On Mon, Nov 04, 2024 at 10:28:37AM +0100 Dietmar Eggemann wrote:
> >> Hi Phil,
> >>
> >> On 01/11/2024 13:47, Phil Auld wrote:
> >>>
> >>> Hi Peterm
>
> [...]
>
> >> I'm not seeing this on my i7-13700K running tip sched/core (1a6151017ee5
> >> - sched: psi: pass enqueue/dequeue flags to psi callbacks directly
> >> (2024-10-26 Johannes Weiner)) (6.12.0-rc4 - based)
> >>
> >> Using 'taskset 0xaaaaa' avoiding SMT and running only on P-cores.
> >>
> >> vanilla features: 990MB/s (mean out of 5 runs, σ: 9.38)
> >> NO_DELAY_DEQUEUE: 992MB/s (mean out of 5 runs, σ: 10.61)
> >>
> >> # sudo lshw -class disk -class storage
> >> *-nvme
> >> description: NVMe device
> >> product: GIGABYTE GP-ASM2NE6500GTTD
> >> vendor: Phison Electronics Corporation
> >> physical id: 0
> >> bus info: pci@...0:01:00.0
> >> logical name: /dev/nvme0
> >> version: EGFM13.2
> >> ...
> >> capabilities: nvme pciexpress msix msi pm nvm_express bus_master cap_list
> >> configuration: driver=nvme latency=0 nqn=nqn.2014.08.org.nvmexpress:19871987SN215108954872 GIGABYTE GP-ASM2NE6500GTTD state=live
> >> resources: irq:16 memory:70800000-70803fff
> >>
> >> # mount | grep ^/dev/nvme0
> >> /dev/nvme0n1p2 on / type ext4 (rw,relatime,errors=remount-ro)
> >>
> >> Which disk device you're using?
> >
> > Most of the reports are on various NVME drives (samsung mostly I think).
> >
> >
> > One thing I should add is that it's all on LVM:
> >
> >
> > vgcreate vg /dev/nvme0n1 -y
> > lvcreate -n thinMeta -L 3GB vg -y
> > lvcreate -n thinPool -l 99%FREE vg -y
> > lvconvert --thinpool /dev/mapper/vg-thinPool --poolmetadata /dev/mapper/vg-thinMeta -Zn -y
> > lvcreate -n testLV -V 1300G --thinpool thinPool vg
> > wipefs -a /dev/mapper/vg-testLV
> > mkfs.ext4 /dev/mapper/vg-testLV -E lazy_itable_init=0,lazy_journal_init=0 -F
> > mount /dev/mapper/vg-testLV /testfs
> >
> >
> > With VDO or thinpool (as above) it shows on both ext4 and xfs. With fs on
> > drive directly it's a little more variable. Some it shows on xfs, some it show
> > on ext4 and not vice-versa, seems to depend on the drive or hw raid. But when
> > it shows it's 100% reproducible on that setup.
> >
> > It's always the randwrite numbers. The rest look fine.
> >
> > Also, as yet I'm not personally doing this testing, just looking into it and
> > passing on the information I have.
>
> One reason I don't see the difference between DELAY_DEQUEUE and
> NO_DELAY_DEQUEUE could be because of the affinity of the related
> nvme interrupts:
>
> $ cat /proc/interrupts
>
> CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 ...
> 132: 0 0 1523653 0 0 0 0 0 0 ... IR-PCI-MSIX-0000:01:00.0 1-edge nvme0q1
> 133: 0 0 0 0 0 1338451 0 0 0 ... IR-PCI-MSIX-0000:01:00.0 2-edge nvme0q2
> 134: 0 0 0 0 0 0 0 0 2252297 ... IR-PCI-MSIX-0000:01:00.0 3-edge nvme0q3
>
> $ cat /proc/irq/132/smp_affinity_list
> 0-2
> cat /proc/irq/133/smp_affinity_list
> 3-5
> cat /proc/irq/134/smp_affinity_list
> 6-8
>
> So the 8 fio tasks from:
>
> # fio --cpus_allowed 1,2,3,4,5,6,7,8 --rw randwrite --bs 4k
> --runtime 8s --iodepth 32 --direct 1 --ioengine libaio
> --numjobs 8 --size 30g --name default --time_based
> --group_reporting --cpus_allowed_policy shared
> --directory /testfs
>
> don't have to fight with per-CPU kworkers on each CPU.
>
> e.g. 'nvme0q3 interrupt -> queue on workqueue dio/nvme0n1p2 ->
> run iomap_dio_complete_work() in kworker/8:x'
>
> In case I trace the 'task_on_rq_queued(p) && p->se.sched_delayed &&
> rq->nr_running > 1) condition in ttwu_runnable() condition i only see
> the per-CPU kworker in there, so p->nr_cpus_allowed == 1.
>
> So the patch shouldn't make a difference for this scenario?
>
If the kworker is waking up an fio task it could. I don't think
they are bound to a single cpu.
But yes if your trace is only showing the kworker there then it would
not help. Are you actually able to reproduce the difference?
> But maybe your VDO or thinpool setup creates waker/wakee pairs with
> wakee->nr_cpus_allowed > 1?
>
That's certainly possible but I don't know for sure. There are well more
dio kworkers on the box than cpus though if I recall. I don't know
if they all have singel cpu affinities.
> Does your machine has single CPU smp_affinity masks for these nvme
> interrupts?
>
I don't know. I had to give the machine back.
Cheers,
Phil
> [...]
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
--
Powered by blists - more mailing lists