linux-kernel - Re: [PATCH 17/24] sched/fair: Implement delayed dequeue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f340b1c4-f1ed-4c9f-adbb-b10cd3a17a85@arm.com>
Date: Fri, 8 Nov 2024 15:53:26 +0100
From: Dietmar Eggemann <dietmar.eggemann@....com>
To: Phil Auld <pauld@...hat.com>
Cc: Peter Zijlstra <peterz@...radead.org>, mingo@...hat.com,
 juri.lelli@...hat.com, vincent.guittot@...aro.org, rostedt@...dmis.org,
 bsegall@...gle.com, mgorman@...e.de, vschneid@...hat.com,
 linux-kernel@...r.kernel.org, kprateek.nayak@....com,
 wuyun.abel@...edance.com, youssefesmat@...omium.org, tglx@...utronix.de,
 efault@....de
Subject: Re: [PATCH 17/24] sched/fair: Implement delayed dequeue

On 04/11/2024 13:50, Phil Auld wrote:
> 
> Hi Dietmar,
> 
> On Mon, Nov 04, 2024 at 10:28:37AM +0100 Dietmar Eggemann wrote:
>> Hi Phil,
>>
>> On 01/11/2024 13:47, Phil Auld wrote:
>>>
>>> Hi Peterm

[...]

>> I'm not seeing this on my i7-13700K running tip sched/core (1a6151017ee5
>> - sched: psi: pass enqueue/dequeue flags to psi callbacks directly
>> (2024-10-26 Johannes Weiner)) (6.12.0-rc4 - based)
>>
>> Using 'taskset 0xaaaaa' avoiding SMT and running only on P-cores.
>>
>> vanilla features: 990MB/s (mean out of 5 runs, σ:  9.38)
>> NO_DELAY_DEQUEUE: 992MB/s (mean out of 5 runs, σ: 10.61)
>>
>> # sudo lshw -class disk -class storage
>>   *-nvme                    
>>        description: NVMe device
>>        product: GIGABYTE GP-ASM2NE6500GTTD
>>        vendor: Phison Electronics Corporation
>>        physical id: 0
>>        bus info: pci@...0:01:00.0
>>        logical name: /dev/nvme0
>>        version: EGFM13.2
>>        ...
>>        capabilities: nvme pciexpress msix msi pm nvm_express bus_master cap_list
>>        configuration: driver=nvme latency=0 nqn=nqn.2014.08.org.nvmexpress:19871987SN215108954872 GIGABYTE GP-ASM2NE6500GTTD state=live
>>        resources: irq:16 memory:70800000-70803fff
>>
>> # mount | grep ^/dev/nvme0
>> /dev/nvme0n1p2 on / type ext4 (rw,relatime,errors=remount-ro)
>>
>> Which disk device you're using?
> 
> Most of the reports are on various NVME drives (samsung mostly I think).
> 
> 
> One thing I should add is that it's all on LVM: 
> 
> 
> vgcreate vg /dev/nvme0n1 -y
> lvcreate -n thinMeta -L 3GB vg -y
> lvcreate -n thinPool -l 99%FREE vg -y
> lvconvert --thinpool /dev/mapper/vg-thinPool --poolmetadata /dev/mapper/vg-thinMeta -Zn -y
> lvcreate -n testLV -V 1300G --thinpool thinPool vg
> wipefs -a /dev/mapper/vg-testLV
> mkfs.ext4 /dev/mapper/vg-testLV -E lazy_itable_init=0,lazy_journal_init=0 -F
> mount /dev/mapper/vg-testLV /testfs 
> 
> 
> With VDO or thinpool (as above) it shows on both ext4 and xfs. With fs on
> drive directly it's a little more variable. Some it shows on xfs, some it show
> on ext4 and not vice-versa, seems to depend on the drive or hw raid. But when
> it shows it's 100% reproducible on that setup. 
> 
> It's always the randwrite numbers. The rest look fine.
> 
> Also, as yet I'm not personally doing this testing, just looking into it and
> passing on the information I have. 

One reason I don't see the difference between DELAY_DEQUEUE and
NO_DELAY_DEQUEUE could be because of the affinity of the related
nvme interrupts: 

$ cat /proc/interrupts

     CPU0 CPU1    CPU2 CPU3 CPU4    CPU5 CPU6 CPU7    CPU8 ...
132:   0    0  1523653    0   0        0   0    0       0  ... IR-PCI-MSIX-0000:01:00.0 1-edge nvme0q1
133:   0    0        0    0   0  1338451   0    0       0  ... IR-PCI-MSIX-0000:01:00.0 2-edge nvme0q2
134:   0    0        0    0   0        0   0    0  2252297 ... IR-PCI-MSIX-0000:01:00.0 3-edge nvme0q3

$ cat /proc/irq/132/smp_affinity_list 
0-2
cat /proc/irq/133/smp_affinity_list 
3-5
cat /proc/irq/134/smp_affinity_list 
6-8

So the 8 fio tasks from: 

# fio --cpus_allowed 1,2,3,4,5,6,7,8 --rw randwrite --bs 4k
  --runtime 8s --iodepth 32 --direct 1 --ioengine libaio
  --numjobs 8 --size 30g --name default --time_based
  --group_reporting --cpus_allowed_policy shared
  --directory /testfs

don't have to fight with per-CPU kworkers on each CPU.

e.g. 'nvme0q3 interrupt -> queue on workqueue dio/nvme0n1p2 -> 
      run iomap_dio_complete_work() in kworker/8:x'

In case I trace the 'task_on_rq_queued(p) && p->se.sched_delayed &&
rq->nr_running > 1) condition in ttwu_runnable() condition i only see
the per-CPU kworker in there, so p->nr_cpus_allowed == 1.

So the patch shouldn't make a difference for this scenario?

But maybe your VDO or thinpool setup creates waker/wakee pairs with
wakee->nr_cpus_allowed > 1? 

Does your machine has single CPU smp_affinity masks for these nvme
interrupts?

[...]