lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241104125009.GA749675@pauld.westford.csb>
Date: Mon, 4 Nov 2024 07:50:09 -0500
From: Phil Auld <pauld@...hat.com>
To: Dietmar Eggemann <dietmar.eggemann@....com>
Cc: Peter Zijlstra <peterz@...radead.org>, mingo@...hat.com,
	juri.lelli@...hat.com, vincent.guittot@...aro.org,
	rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
	vschneid@...hat.com, linux-kernel@...r.kernel.org,
	kprateek.nayak@....com, wuyun.abel@...edance.com,
	youssefesmat@...omium.org, tglx@...utronix.de, efault@....de
Subject: Re: [PATCH 17/24] sched/fair: Implement delayed dequeue


Hi Dietmar,

On Mon, Nov 04, 2024 at 10:28:37AM +0100 Dietmar Eggemann wrote:
> Hi Phil,
> 
> On 01/11/2024 13:47, Phil Auld wrote:
> > 
> > Hi Peterm
> > 
> > On Sat, Jul 27, 2024 at 12:27:49PM +0200 Peter Zijlstra wrote:
> >> Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
> >> noting that lag is fundamentally a temporal measure. It should not be
> >> carried around indefinitely.
> >>
> >> OTOH it should also not be instantly discarded, doing so will allow a
> >> task to game the system by purposefully (micro) sleeping at the end of
> >> its time quantum.
> >>
> >> Since lag is intimately tied to the virtual time base, a wall-time
> >> based decay is also insufficient, notably competition is required for
> >> any of this to make sense.
> >>
> >> Instead, delay the dequeue and keep the 'tasks' on the runqueue,
> >> competing until they are eligible.
> >>
> >> Strictly speaking, we only care about keeping them until the 0-lag
> >> point, but that is a difficult proposition, instead carry them around
> >> until they get picked again, and dequeue them at that point.
> > 
> > This one is causing a 10-20% performance hit on our filesystem tests.
> > 
> > On 6.12-rc5 (so with the latest follow ons) we get:
> > 
> > with DELAY_DEQUEUE the bandwidth is 510 MB/s
> > with NO_DELAY_DEQUEUE the bandwidth is 590 MB/s
> > 
> > The test is fio, something like this:
> > 
> > taskset -c 1,2,3,4,5,6,7,8 fio --rw randwrite --bs 4k --runtime 1m --fsync 0 --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --nrfiles 1 --loops 1 --name default --randrepeat 1 --time_based --group_reporting --directory /testfs
> 
> I'm not seeing this on my i7-13700K running tip sched/core (1a6151017ee5
> - sched: psi: pass enqueue/dequeue flags to psi callbacks directly
> (2024-10-26 Johannes Weiner)) (6.12.0-rc4 - based)
> 
> Using 'taskset 0xaaaaa' avoiding SMT and running only on P-cores.
> 
> vanilla features: 990MB/s (mean out of 5 runs, σ:  9.38)
> NO_DELAY_DEQUEUE: 992MB/s (mean out of 5 runs, σ: 10.61)
> 
> # sudo lshw -class disk -class storage
>   *-nvme                    
>        description: NVMe device
>        product: GIGABYTE GP-ASM2NE6500GTTD
>        vendor: Phison Electronics Corporation
>        physical id: 0
>        bus info: pci@...0:01:00.0
>        logical name: /dev/nvme0
>        version: EGFM13.2
>        ...
>        capabilities: nvme pciexpress msix msi pm nvm_express bus_master cap_list
>        configuration: driver=nvme latency=0 nqn=nqn.2014.08.org.nvmexpress:19871987SN215108954872 GIGABYTE GP-ASM2NE6500GTTD state=live
>        resources: irq:16 memory:70800000-70803fff
> 
> # mount | grep ^/dev/nvme0
> /dev/nvme0n1p2 on / type ext4 (rw,relatime,errors=remount-ro)
> 
> Which disk device you're using?

Most of the reports are on various NVME drives (samsung mostly I think).


One thing I should add is that it's all on LVM: 


vgcreate vg /dev/nvme0n1 -y
lvcreate -n thinMeta -L 3GB vg -y
lvcreate -n thinPool -l 99%FREE vg -y
lvconvert --thinpool /dev/mapper/vg-thinPool --poolmetadata /dev/mapper/vg-thinMeta -Zn -y
lvcreate -n testLV -V 1300G --thinpool thinPool vg
wipefs -a /dev/mapper/vg-testLV
mkfs.ext4 /dev/mapper/vg-testLV -E lazy_itable_init=0,lazy_journal_init=0 -F
mount /dev/mapper/vg-testLV /testfs 


With VDO or thinpool (as above) it shows on both ext4 and xfs. With fs on
drive directly it's a little more variable. Some it shows on xfs, some it show
on ext4 and not vice-versa, seems to depend on the drive or hw raid. But when
it shows it's 100% reproducible on that setup. 

It's always the randwrite numbers. The rest look fine.

Also, as yet I'm not personally doing this testing, just looking into it and
passing on the information I have. 


Thanks for taking a look. 

Cheers,
Phil

> 
> > 
> > In this case it's ext4, but I'm not sure it will be FS specific.
> > 
> > I should have the machine and setup next week to poke further but I wanted
> > to mention it now just in case any one has an "aha" moment.
> > 
> > It seems to only effect these FS loads. Other perf tests are not showing any
> > issues that I am aware of.
> 
> [...]
> 

-- 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ