linux-kernel - Re: [PATCH 17/24] sched/fair: Implement delayed dequeue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241105155543.GB33795@pauld.westford.csb>
Date: Tue, 5 Nov 2024 10:55:43 -0500
From: Phil Auld <pauld@...hat.com>
To: Christian Loehle <christian.loehle@....com>
Cc: Dietmar Eggemann <dietmar.eggemann@....com>,
	Peter Zijlstra <peterz@...radead.org>, mingo@...hat.com,
	juri.lelli@...hat.com, vincent.guittot@...aro.org,
	rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
	vschneid@...hat.com, linux-kernel@...r.kernel.org,
	kprateek.nayak@....com, wuyun.abel@...edance.com,
	youssefesmat@...omium.org, tglx@...utronix.de, efault@....de
Subject: Re: [PATCH 17/24] sched/fair: Implement delayed dequeue


Hi Christian,

On Tue, Nov 05, 2024 at 09:53:49AM +0000 Christian Loehle wrote:
> On 11/4/24 12:50, Phil Auld wrote:
> > 
> > Hi Dietmar,
> > 
> > On Mon, Nov 04, 2024 at 10:28:37AM +0100 Dietmar Eggemann wrote:
> >> Hi Phil,
> >>
> >> On 01/11/2024 13:47, Phil Auld wrote:
> >>>
> >>> Hi Peterm
> >>>
> >>> On Sat, Jul 27, 2024 at 12:27:49PM +0200 Peter Zijlstra wrote:
> >>>> Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
> >>>> noting that lag is fundamentally a temporal measure. It should not be
> >>>> carried around indefinitely.
> >>>>
> >>>> OTOH it should also not be instantly discarded, doing so will allow a
> >>>> task to game the system by purposefully (micro) sleeping at the end of
> >>>> its time quantum.
> >>>>
> >>>> Since lag is intimately tied to the virtual time base, a wall-time
> >>>> based decay is also insufficient, notably competition is required for
> >>>> any of this to make sense.
> >>>>
> >>>> Instead, delay the dequeue and keep the 'tasks' on the runqueue,
> >>>> competing until they are eligible.
> >>>>
> >>>> Strictly speaking, we only care about keeping them until the 0-lag
> >>>> point, but that is a difficult proposition, instead carry them around
> >>>> until they get picked again, and dequeue them at that point.
> >>>
> >>> This one is causing a 10-20% performance hit on our filesystem tests.
> >>>
> >>> On 6.12-rc5 (so with the latest follow ons) we get:
> >>>
> >>> with DELAY_DEQUEUE the bandwidth is 510 MB/s
> >>> with NO_DELAY_DEQUEUE the bandwidth is 590 MB/s
> >>>
> >>> The test is fio, something like this:
> >>>
> >>> taskset -c 1,2,3,4,5,6,7,8 fio --rw randwrite --bs 4k --runtime 1m --fsync 0 --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --nrfiles 1 --loops 1 --name default --randrepeat 1 --time_based --group_reporting --directory /testfs
> >>
> >> I'm not seeing this on my i7-13700K running tip sched/core (1a6151017ee5
> >> - sched: psi: pass enqueue/dequeue flags to psi callbacks directly
> >> (2024-10-26 Johannes Weiner)) (6.12.0-rc4 - based)
> >>
> >> Using 'taskset 0xaaaaa' avoiding SMT and running only on P-cores.
> >>
> >> vanilla features: 990MB/s (mean out of 5 runs, σ:  9.38)
> >> NO_DELAY_DEQUEUE: 992MB/s (mean out of 5 runs, σ: 10.61)
> >>
> >> # sudo lshw -class disk -class storage
> >>   *-nvme                    
> >>        description: NVMe device
> >>        product: GIGABYTE GP-ASM2NE6500GTTD
> >>        vendor: Phison Electronics Corporation
> >>        physical id: 0
> >>        bus info: pci@...0:01:00.0
> >>        logical name: /dev/nvme0
> >>        version: EGFM13.2
> >>        ...
> >>        capabilities: nvme pciexpress msix msi pm nvm_express bus_master cap_list
> >>        configuration: driver=nvme latency=0 nqn=nqn.2014.08.org.nvmexpress:19871987SN215108954872 GIGABYTE GP-ASM2NE6500GTTD state=live
> >>        resources: irq:16 memory:70800000-70803fff 
> >>
> >> # mount | grep ^/dev/nvme0
> >> /dev/nvme0n1p2 on / type ext4 (rw,relatime,errors=remount-ro)
> >>
> >> Which disk device you're using?
> > 
> > Most of the reports are on various NVME drives (samsung mostly I think).
> > 
> > 
> > One thing I should add is that it's all on LVM: 
> > 
> > 
> > vgcreate vg /dev/nvme0n1 -y
> > lvcreate -n thinMeta -L 3GB vg -y
> > lvcreate -n thinPool -l 99%FREE vg -y
> > lvconvert --thinpool /dev/mapper/vg-thinPool --poolmetadata /dev/mapper/vg-thinMeta -Zn -y
> > lvcreate -n testLV -V 1300G --thinpool thinPool vg
> > wipefs -a /dev/mapper/vg-testLV
> > mkfs.ext4 /dev/mapper/vg-testLV -E lazy_itable_init=0,lazy_journal_init=0 -F
> > mount /dev/mapper/vg-testLV /testfs 
> > 
> > 
> > With VDO or thinpool (as above) it shows on both ext4 and xfs. With fs on
> > drive directly it's a little more variable. Some it shows on xfs, some it show
> > on ext4 and not vice-versa, seems to depend on the drive or hw raid. But when
> > it shows it's 100% reproducible on that setup. 
> > 
> > It's always the randwrite numbers. The rest look fine.
> 
> Hi Phil,
> 
> Thanks for the detailed instructions. Unfortunately even with your LVM setup on
> the platforms I've tried I don't see a regression so far, all the numbers are
> about equal for DELAY_DEQUEUE and NO_DELAY_DEQUEUE.
>

Yeah, that's odd.

Fwiw:

Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          48 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   32
  On-line CPU(s) list:    0-31
Vendor ID:                AuthenticAMD
  BIOS Vendor ID:         Advanced Micro Devices, Inc.
  Model name:             AMD EPYC 7313P 16-Core Processor
    BIOS Model name:      AMD EPYC 7313P 16-Core Processor                Unknown CPU @ 3.0GHz
    BIOS CPU family:      107
    CPU family:           25
    ...

16 SMT2 cores (siblings are 16-31)


#lsblk -N
NAME    TYPE MODEL                      SERIAL              REV TRAN   RQ-SIZE  MQ
nvme3n1 disk SAMSUNG MZQL21T9HCJR-00A07 S64GNJ0T605178 GDC5602Q nvme      1023  32
nvme2n1 disk SAMSUNG MZQL21T9HCJR-00A07 S64GNJ0T605128 GDC5602Q nvme      1023  32
nvme0n1 disk SAMSUNG MZQL21T9HCJR-00A07 S64GNJ0T605125 GDC5602Q nvme      1023  32
nvme1n1 disk SAMSUNG MZQL21T9HCJR-00A07 S64GNJ0T605127 GDC5602Q nvme      1023  32

Where nvme0n1 is the one I'm actually using.

I'm on 6.12.0-0.rc5.44.eln143.x86_64  which is v6.12-rc5 with RHEL .config.  This
should have little to no franken-kernel bits but now that I have the machine
I'll build from upstream (with the RHEL .config still) to make sure.

We did see it on all the RCs so far. 


> Anyway I have some follow-ups, first let me trim the fio command for readability:
> fio --rw randwrite --bs 4k --runtime 1m --fsync 0 --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --nrfiles 1 --loops 1 --name default --randrepeat 1 --time_based --group_reporting --directory /testfs
> 
> dropping defaults nr_files, loops, fsync, randrepeat
> fio --rw randwrite --bs 4k --runtime 1m --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --name default --time_based --group_reporting --directory /testfs
> 
> Adding the CPU affinities directly:
> fio --cpus_allowed 1-8 --rw randwrite --bs 4k --runtime 1m --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --name default --time_based --group_reporting --directory /testfs
>

Fair enough.  It should work the same with taskset I suppose except the below bit. I've
been given this from our perforance team. They have a framework that produces nice html
pages with red and green results and graphs and whatnot.  Right now it's in the form of
a script that pulls the KB/s number of out the json output which is nice and keeps me
from going crosseyed looking at the full fio run output.

> Now I was wondering about the following:
> Is it actually the kworker (not another fio) being preempted? (I'm pretty sure it is)
> To test: --cpus_allowed_policy split (each fio process gets it's own CPU).

with  --cpus-allowed and --cpus_allowed_policy split the results with DELAY_DEQUEUE are
better (540MB/s) but with NO_DELAY_DEQUEUE they are also better (640 MB/s). It was
510MB/s and 590MB/s before. 

> 
> You wrote:
> >I was thinking maybe the preemption was preventing some batching of IO completions or
> >initiations. But that was wrong it seems.
> 
> So while it doesn't reproduce for me, the only thing being preempted regularly is
> the kworker (running iomap_dio_complete_work). I don't quite follow the "that was
> wrong it seems" part then. Could you elaborate?
>

I was thinking that the fio batch test along with the disabling WAKEUP_PREEMPTION
was telling me that it wasn't the over preemption issue, but that also I could be
wrong about...


> Could you also post the other benchmark numbers? Does any of them score higher in IOPS?
> Is --rw write the same issue if you set --bs 4k (assuming you set a larger bs for seqwrite).
>

I don't have numbers for all of the other flavors but I ran --rw write --bs 4k:

DELAY_DEQUEUE     ~590MB/s
NO_DELAY_DEQUEUE  ~840MB/s

Those results are not good for DELAY_DEQUEUE either.

> Can you set the kworkers handling completions to SCHED_BATCH too? Just to confirm.

I think I did the wrong kworkes the first time. So I'll try again to figure out which
kworkers to twiddle (or I'll just do all 227 of them...).



Thanks,
Phil


> 
> Regards,
> Christian
> 

--