linux-kernel - Re: [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160810230840.GS16044@dastard>
Date:	Thu, 11 Aug 2016 09:08:40 +1000
From:	Dave Chinner <david@...morbit.com>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	kernel test robot <xiaolong.ye@...el.com>,
	Christoph Hellwig <hch@....de>,
	Bob Peterson <rpeterso@...hat.com>,
	LKML <linux-kernel@...r.kernel.org>, LKP <lkp@...org>
Subject: Re: [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

On Wed, Aug 10, 2016 at 11:24:16AM -0700, Linus Torvalds wrote:
> On Tue, Aug 9, 2016 at 7:33 AM, kernel test robot <xiaolong.ye@...el.com> wrote:
> >
> > FYI, we noticed a -13.6% regression of aim7.jobs-per-min due to commit:
> > 68a9f5e7007c ("xfs: implement iomap based buffered write path")
> >
> > in testcase: aim7
> > on test machine: 48 threads Ivytown Ivy Bridge-EP with 64G memory
> > with following parameters:
> >
> >         disk: 1BRD_48G
> >         fs: xfs
> >         test: disk_wrt
> >         load: 3000
> >         cpufreq_governor: performance
> 
> Christop, Dave, was this expected?

No. I would have expected the performance to go the other way -
there is less overhead in the write() path now than there was
previously, and all my numbers go the other way (5-10%
improvements) in throughput.


> From looking at the numbers, it looks like much more IO going on (and
> this less CPU load)..

I read the numbers the other way, but to me the numbers do not
indicate anything about IO load.

> >      37.23 ±  0%     +15.6%      43.04 ±  0%  aim7.time.elapsed_time
> >      37.23 ±  0%     +15.6%      43.04 ±  0%  aim7.time.elapsed_time.max
> >       6424 ±  1%     +31.3%       8432 ±  1%  aim7.time.involuntary_context_switches
> >       4003 ±  0%     +28.1%       5129 ±  1%  proc-vmstat.nr_active_file
> >     979.25 ±  0%     +63.7%       1602 ±  1%  proc-vmstat.pgactivate
> >       4699 ±  3%    +162.6%      12340 ± 73%  proc-vmstat.pgpgout
> >      50.23 ± 19%     -27.3%      36.50 ± 17%  sched_debug.cpu.cpu_load[1].avg
> >     466.50 ± 29%     -51.8%     225.00 ± 73%  sched_debug.cpu.cpu_load[1].max
> >      77.78 ± 33%     -50.6%      38.40 ± 57%  sched_debug.cpu.cpu_load[1].stddev
> >     300.50 ± 33%     -52.9%     141.50 ± 48%  sched_debug.cpu.cpu_load[2].max
> >       1836 ± 10%     +65.5%       3039 ±  8%  slabinfo.scsi_data_buffer.active_objs
> >       1836 ± 10%     +65.5%       3039 ±  8%  slabinfo.scsi_data_buffer.num_objs
> >     431.75 ± 10%     +65.6%     715.00 ±  8%  slabinfo.xfs_efd_item.active_objs
> >     431.75 ± 10%     +65.6%     715.00 ±  8%  slabinfo.xfs_efd_item.num_objs
> 
> but what do I know. Those profiles from the robot are pretty hard to
> make sense of.

Yup, I can't infer anything from most of the stats present. The only
thing that stood out is that there's clearly a significant reduction
in context switches:

	429058 ±  0%     -20.0%     343371 ±  0% aim7.time.voluntary_context_switches
	....
	972882 ±  0%     -17.4%     803990 ±  0% perf-stat.context-switches

and a significant increase in system CPU time:

	376.31 ±  0%     +28.5%     483.48 ±  0%  aim7.time.system_time
	....
	1.452e+12 ±  6%     +29.5%  1.879e+12 ±  4% perf-stat.instructions
        42168 ± 16%     +27.5%      53751 ±  6% perf-stat.instructions-per-iTLB-miss


It looks to me like the extra system time is running more loops
in the same code footprint, not because we are executing a bigger
or different footprint of code.

That, to me, says there's a change in lock contention behaviour in
the workload (which we know aim7 is good at exposing). i.e. the
iomap change shifted contention from a sleeping lock to a spinning
lock, or maybe we now trigger optimistic spinning behaviour on a
lock we previously didn't spin on at all.

We really need instruction level perf profiles to understand
this - I don't have a machine with this many cpu cores available
locally, so I'm not sure I'm going to be able to make any progress
tracking it down in the short term. Maybe the lkp team has more
in-depth cpu usage profiles they can share?

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com