linux-kernel - Re: [PATCH 17/24] sched/fair: Implement delayed dequeue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241108181617.GC43508@pauld.westford.csb>
Date: Fri, 8 Nov 2024 13:16:17 -0500
From: Phil Auld <pauld@...hat.com>
To: Dietmar Eggemann <dietmar.eggemann@....com>
Cc: Peter Zijlstra <peterz@...radead.org>, mingo@...hat.com,
	juri.lelli@...hat.com, vincent.guittot@...aro.org,
	rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
	vschneid@...hat.com, linux-kernel@...r.kernel.org,
	kprateek.nayak@....com, wuyun.abel@...edance.com,
	youssefesmat@...omium.org, tglx@...utronix.de, efault@....de
Subject: Re: [PATCH 17/24] sched/fair: Implement delayed dequeue

On Fri, Nov 08, 2024 at 03:53:26PM +0100 Dietmar Eggemann wrote:
> On 04/11/2024 13:50, Phil Auld wrote:
> > 
> > Hi Dietmar,
> > 
> > On Mon, Nov 04, 2024 at 10:28:37AM +0100 Dietmar Eggemann wrote:
> >> Hi Phil,
> >>
> >> On 01/11/2024 13:47, Phil Auld wrote:
> >>>
> >>> Hi Peterm
> 
> [...]
> 
> >> I'm not seeing this on my i7-13700K running tip sched/core (1a6151017ee5
> >> - sched: psi: pass enqueue/dequeue flags to psi callbacks directly
> >> (2024-10-26 Johannes Weiner)) (6.12.0-rc4 - based)
> >>
> >> Using 'taskset 0xaaaaa' avoiding SMT and running only on P-cores.
> >>
> >> vanilla features: 990MB/s (mean out of 5 runs, σ:  9.38)
> >> NO_DELAY_DEQUEUE: 992MB/s (mean out of 5 runs, σ: 10.61)
> >>
> >> # sudo lshw -class disk -class storage
> >>   *-nvme                    
> >>        description: NVMe device
> >>        product: GIGABYTE GP-ASM2NE6500GTTD
> >>        vendor: Phison Electronics Corporation
> >>        physical id: 0
> >>        bus info: pci@...0:01:00.0
> >>        logical name: /dev/nvme0
> >>        version: EGFM13.2
> >>        ...
> >>        capabilities: nvme pciexpress msix msi pm nvm_express bus_master cap_list
> >>        configuration: driver=nvme latency=0 nqn=nqn.2014.08.org.nvmexpress:19871987SN215108954872 GIGABYTE GP-ASM2NE6500GTTD state=live
> >>        resources: irq:16 memory:70800000-70803fff
> >>
> >> # mount | grep ^/dev/nvme0
> >> /dev/nvme0n1p2 on / type ext4 (rw,relatime,errors=remount-ro)
> >>
> >> Which disk device you're using?
> > 
> > Most of the reports are on various NVME drives (samsung mostly I think).
> > 
> > 
> > One thing I should add is that it's all on LVM: 
> > 
> > 
> > vgcreate vg /dev/nvme0n1 -y
> > lvcreate -n thinMeta -L 3GB vg -y
> > lvcreate -n thinPool -l 99%FREE vg -y
> > lvconvert --thinpool /dev/mapper/vg-thinPool --poolmetadata /dev/mapper/vg-thinMeta -Zn -y
> > lvcreate -n testLV -V 1300G --thinpool thinPool vg
> > wipefs -a /dev/mapper/vg-testLV
> > mkfs.ext4 /dev/mapper/vg-testLV -E lazy_itable_init=0,lazy_journal_init=0 -F
> > mount /dev/mapper/vg-testLV /testfs 
> > 
> > 
> > With VDO or thinpool (as above) it shows on both ext4 and xfs. With fs on
> > drive directly it's a little more variable. Some it shows on xfs, some it show
> > on ext4 and not vice-versa, seems to depend on the drive or hw raid. But when
> > it shows it's 100% reproducible on that setup. 
> > 
> > It's always the randwrite numbers. The rest look fine.
> > 
> > Also, as yet I'm not personally doing this testing, just looking into it and
> > passing on the information I have. 
> 
> One reason I don't see the difference between DELAY_DEQUEUE and
> NO_DELAY_DEQUEUE could be because of the affinity of the related
> nvme interrupts: 
> 
> $ cat /proc/interrupts
> 
>      CPU0 CPU1    CPU2 CPU3 CPU4    CPU5 CPU6 CPU7    CPU8 ...
> 132:   0    0  1523653    0   0        0   0    0       0  ... IR-PCI-MSIX-0000:01:00.0 1-edge nvme0q1
> 133:   0    0        0    0   0  1338451   0    0       0  ... IR-PCI-MSIX-0000:01:00.0 2-edge nvme0q2
> 134:   0    0        0    0   0        0   0    0  2252297 ... IR-PCI-MSIX-0000:01:00.0 3-edge nvme0q3
> 
> $ cat /proc/irq/132/smp_affinity_list 
> 0-2
> cat /proc/irq/133/smp_affinity_list 
> 3-5
> cat /proc/irq/134/smp_affinity_list 
> 6-8
> 
> So the 8 fio tasks from: 
> 
> # fio --cpus_allowed 1,2,3,4,5,6,7,8 --rw randwrite --bs 4k
>   --runtime 8s --iodepth 32 --direct 1 --ioengine libaio
>   --numjobs 8 --size 30g --name default --time_based
>   --group_reporting --cpus_allowed_policy shared
>   --directory /testfs
> 
> don't have to fight with per-CPU kworkers on each CPU.
> 
> e.g. 'nvme0q3 interrupt -> queue on workqueue dio/nvme0n1p2 -> 
>       run iomap_dio_complete_work() in kworker/8:x'
> 
> In case I trace the 'task_on_rq_queued(p) && p->se.sched_delayed &&
> rq->nr_running > 1) condition in ttwu_runnable() condition i only see
> the per-CPU kworker in there, so p->nr_cpus_allowed == 1.
> 
> So the patch shouldn't make a difference for this scenario?
>

If the kworker is waking up an fio task it could.  I don't think
they are bound to a single cpu.

But yes if your trace is only showing the kworker there then it would
not help.  Are you actually able to reproduce the difference?


> But maybe your VDO or thinpool setup creates waker/wakee pairs with
> wakee->nr_cpus_allowed > 1? 
>

That's certainly possible but I don't know for sure. There are well more
dio kworkers on the box than cpus though if I recall. I don't know
if they all have singel cpu affinities. 


> Does your machine has single CPU smp_affinity masks for these nvme
> interrupts?
>

I don't know. I had to give the machine back. 



Cheers,
Phil


> [...]
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 

--