linux-kernel - Re: [linus:master] [block] e70c301fae: stress-ng.aiol.ops_per

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Z4ipFFdAppraxrmA@xsang-OptiPlex-9020>
Date: Thu, 16 Jan 2025 14:37:08 +0800
From: Oliver Sang <oliver.sang@...el.com>
To: Niklas Cassel <cassel@...nel.org>
CC: Christoph Hellwig <hch@....de>, <oe-lkp@...ts.linux.dev>, <lkp@...el.com>,
	<linux-kernel@...r.kernel.org>, Jens Axboe <axboe@...nel.dk>,
	<linux-block@...r.kernel.org>, <virtualization@...ts.linux.dev>,
	<linux-nvme@...ts.infradead.org>, Damien Le Moal <dlemoal@...nel.org>,
	<linux-btrfs@...r.kernel.org>, <linux-aio@...ck.org>, <oliver.sang@...el.com>
Subject: Re: [linus:master] [block]  e70c301fae: stress-ng.aiol.ops_per_sec
 49.6% regression

hi, Niklas,

On Wed, Jan 15, 2025 at 12:42:33PM +0100, Niklas Cassel wrote:
> Hello Oliver,
> 
> On Fri, Jan 10, 2025 at 02:53:08PM +0800, Oliver Sang wrote:
> > On Wed, Jan 08, 2025 at 11:39:28AM +0100, Niklas Cassel wrote:
> > > > > Oliver, which I/O scheduler are you using?
> > > > > $ cat /sys/block/sdb/queue/scheduler 
> > > > > none mq-deadline kyber [bfq]
> > > > 
> > > > while our test running:
> > > > 
> > > > # cat /sys/block/sdb/queue/scheduler
> > > > none [mq-deadline] kyber bfq
> > > 
> > > The stddev numbers you showed is all over the place, so are we certain
> > > if this is a regression caused by commit e70c301faece ("block:
> > > don't reorder requests in blk_add_rq_to_plug") ?
> > > 
> > > Do you know if the stddev has such big variation for this test even before
> > > the commit?
> > 
> > in order to address your concern, we rebuild kernels for e70c301fae and its
> > parent a3396b9999, also for v6.12-rc4. the config is still same as shared
> > in our original report:
> > https://download.01.org/0day-ci/archive/20241212/202412122112.ca47bcec-lkp@intel.com/config-6.12.0-rc4-00120-ge70c301faece
> 
> Thank you for putting in the work to do some extra tests.
> 
> (Doing performance regression testing is really important IMO,
> as without it you are essentially in the blind.
> Thank you guys for taking on the role of this important work!)
> 
> 
> Looking at the extended number of iterations that you've in this email,
> it is quite clear that e70c301faece, at least with the workload provided
> by stress-ng + mq-deadline, introduced a regression:
> 
>        v6.12-rc4 a3396b99990d8b4e5797e7b16fd e70c301faece15b618e54b613b1
> ---------------- --------------------------- ---------------------------
>          %stddev     %change         %stddev     %change         %stddev
>              \          |                \          |                \
>     187.64 ±  5%      -0.6%     186.48 ±  7%     -47.6%      98.29 ± 17%  stress-ng.aiol.ops_per_sec
> 
> 
> 
> 
> Looking at your results from stress-ng + none scheduler:
> 
>          %stddev     %change         %stddev     %change         %stddev
>              \          |                \          |                \
>     114.62 ± 19%      -1.9%     112.49 ± 17%     -32.4%      77.47 ± 21%  stress-ng.aiol.ops_per_sec
> 
> 
> Which shows a change, but -32% rather than -47%, also seems to suggest a
> regression for the stress-ng workload.
> 
> 
> 
> 
> Looking closer at the raw number for stress-ng + none scheduler, in your
> other email, it seems clear that the raw values from the stress-ng workload
> can vary quite a lot. In the long run, I wonder if we perhaps can find a
> workload that has less variation. E.g. fio test for IOPS and fio test for
> throughout. But perhaps such workloads are already part of lkp-tests?

yes, we have fio tests [1].
as in [2], we get it from https://github.com/axboe/fio
not sure if it's just the fio you mentioned?

our framework is basically automatic. bot merged repo/branches it monitors
into so-called hourly kernel, then if found performance difference with base,
bisect will be triggered to capture which commit causes the change.

due to resource constraint, we cannot allot all testsuites (we have around 80)
to all platforms, and there are other various reasons which could cause us to
miss some performance differences.

if you have interests, could you help check those fio-basic-*.yaml files under
[3]? if you can spot out the correct case, we could do more tests to check
e70c301fae and its parent. thanks!

[1] https://github.com/intel/lkp-tests/tree/master/programs/fio
[2] https://github.com/intel/lkp-tests/blob/master/programs/fio/pkg/PKGBUILD
[3] https://github.com/intel/lkp-tests/tree/master/jobs

> 
> 
> Kind regards,
> Niklas