linux-kernel - Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 15 Aug 2016 15:00:16 +1000
From:	Dave Chinner <david@...morbit.com>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Tejun Heo <tj@...nel.org>, Wu Fengguang <fengguang.wu@...el.com>,
	"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
	Christoph Hellwig <hch@....de>,
	"Huang, Ying" <ying.huang@...el.com>,
	LKML <linux-kernel@...r.kernel.org>,
	Bob Peterson <rpeterso@...hat.com>, LKP <lkp@...org>
Subject: Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

On Sun, Aug 14, 2016 at 07:53:40PM -0700, Linus Torvalds wrote:
> On Sun, Aug 14, 2016 at 7:28 PM, Dave Chinner <david@...morbit.com> wrote:
> >>
> >> Maybe your symbol table came from a old kernel, and functions moved
> >> around enough that the profile attributions ended up bogus.
> >
> > No, I don't think so. I don't install symbol tables on my test VMs,
> > I let /proc/kallsyms do that work for me. From an strace of 'perf
> > top -U -g":
> 
> Ok. But something is definitely wrong with your profile.
> 
> What does it say if you annotate that _raw_spin_unlock_irqrestore() function?
....

raw_spin_unlock_irqrestore  /proc/kcore
       ¿
       ¿
       ¿
       ¿    Disassembly of section load0:
       ¿
       ¿    ffffffff81e628b0 <load0>:
       ¿      nop
       ¿      push   %rbp
       ¿      mov    %rsp,%rbp
       ¿      movb   $0x0,(%rdi)
       ¿      nop
       ¿      mov    %rsi,%rdi
       ¿      push   %rdi
       ¿      popfq
 99.35 ¿      nop
       ¿      decl   %gs:0x7e1a9bc7(%rip)
  0.65 ¿    ¿ je     25
       ¿      pop    %rbp
       ¿    ¿ retq
       ¿25:   callq  0xffffffff81002000
       ¿      pop    %rbp
       ¿    ¿ retq


> I guarantee you that no, it's not spending 41% of time in
> spin_unlock_irqrestore. That just isn't a valid profile. There's
> something seriously wrong somewhere.
> 
> The fact that you then get the same profile when you turn _off_
> CONFIG_DEBUG_SPINLOCK only proves there is something going on that is
> pure garbage.
> 
> I suspect that what you did was to edit the .config file, remove
> DEBUG_SPINLOCK, and then do "make oldconfig" again.

Yes.

> And it got turned on again,

No. I'm not that stupid - I checked:

$ grep SPINLOCK .config
CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
CONFIG_QUEUED_SPINLOCKS=y
CONFIG_PARAVIRT_SPINLOCKS=y
# CONFIG_DEBUG_SPINLOCK is not set
$

> because you have one of the lock debugging
> options on that force spinlock debuggin on again:

>  - DEBUG_WW_MUTEX_SLOWPATH
>  - DEBUG_LOCK_ALLOC
>  - PROVE_LOCKING

None of which are set:

$ grep 'DEBUG\|PROVE' .config |grep -v '#'
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_DEBUG_RODATA=y
CONFIG_SLUB_DEBUG=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_PM_DEBUG=y
CONFIG_PM_SLEEP_DEBUG=y
CONFIG_DEBUG_DEVRES=y
CONFIG_PNP_DEBUG_MESSAGES=y
CONFIG_XFS_DEBUG=y
CONFIG_OCFS2_DEBUG_MASKLOG=y
CONFIG_CIFS_DEBUG=y
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_FS=y
CONFIG_DEBUG_KERNEL=y
CONFIG_HAVE_DEBUG_KMEMLEAK=y
CONFIG_DEBUG_STACK_USAGE=y
CONFIG_HAVE_DEBUG_STACKOVERFLOW=y
CONFIG_SCHED_DEBUG=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_ATOMIC_SLEEP=y
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_LIST=y
CONFIG_FAULT_INJECTION_DEBUG_FS=y
CONFIG_ARCH_HAS_DEBUG_STRICT_USER_COPY_CHECKS=y
CONFIG_DEBUG_BOOT_PARAMS=y
$

> [ Light goes on ]
> 
> Oh, no, I can see another possibility: you're not doing proper CPU
> profiles, you're doing some timer-irq profile, and the reason you get
> 41% on spin_unlock_irq_restore() is that that is where the interrupts
> are enabled again.
> 
> Timer-interrupt based profiles are not useful either.

I've just been using whatever perf defaults to. Defaults are
supposed to be useful - if they aren't then perf needs to be fixed.
perf top reports this by default:

Samples: 118K of event 'cpu-clock', Event count (approx.): 793748915
Overhead  Shared O  Symbol                                                                                                                                             ¿
  34.48%  [kernel]  [k] _raw_spin_unlock_irqrestore                                                                                                                    ¿
   7.89%  [kernel]  [k] copy_user_generic_string                                                                                                                       ¿
   5.08%  [kernel]  [k] _raw_spin_unlock_irq

...

> Make sure you actually use "perf record -e cycles:pp" or something
> that uses PEBS to get real profiles using CPU performance counters.

WTF is PEBS? I'm not a CPU nerd, and I certainly don't expect to
have to learn all the intricacies of hardware performance counters
just to profile the kernel in a correct and sane manner. That's what
the *perf defaults* are supposed to do.


Anyway: `perf top -U -e cycles:pp`:

Samples: 301K of event 'cpu-clock:ppH', Event count (approx.): 69364814
Overhead  Shared O  Symbol                                                                                                                                             ¿
  30.89%  [kernel]  [k] _raw_spin_unlock_irqrestore                                                                                                                    ¿
   7.04%  [kernel]  [k] _raw_spin_unlock_irq                                                                                                                           ¿
   4.08%  [kernel]  [k] copy_user_generic_string                                                                                                                       ¿
   2.44%  [kernel]  [k] get_page_from_freelist                                                                                                                         ¿
   1.81%  [kernel]  [k] _raw_spin_lock

No change.

$ sudo perf record -e cycles:pp -a --all-kernel -- xfs_io -f -c "pwrite 0 47g" /mnt/scratch/fooey

# Samples: 2M of event 'cpu-clock:khppH'
# Event count (approx.): 588517250000
#
# Overhead  Command          Shared Object      Symbol                                    
# ........  ...............  .................  ..........................................
#
    83.09%  swapper          [kernel.kallsyms]  [k] native_safe_halt
     1.42%  xfs_io           [kernel.kallsyms]  [k] copy_user_generic_string
     1.26%  kswapd3          [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     1.24%  kswapd1          [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     1.09%  kswapd2          [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     0.98%  kswapd0          [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     0.80%  xfs_io           [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     0.77%  kworker/u34:2    [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     0.73%  xfs_io           [kernel.kallsyms]  [k] _raw_spin_unlock_irq
     0.51%  xfs_io           [kernel.kallsyms]  [k] get_page_from_freelist
     0.39%  xfs_io           [kernel.kallsyms]  [k] __block_commit_write.isra.29
     0.16%  xfs_io           [kernel.kallsyms]  [k] _raw_spin_lock
     0.14%  xfs_io           [kernel.kallsyms]  [k] up_write
     0.14%  kworker/u34:2    [kernel.kallsyms]  [k] clear_page_dirty_for_io
     0.14%  kworker/u34:2    [kernel.kallsyms]  [k] xfs_do_writepage
....

It's exactly the same profile, just reported as a percentage of 16
CPUs rather than normalised to a single CPU.  From my ignorant
viewpoing, I'd say that's expected because perf is still using
"cpu-clock" event configuration.

The hardware event counters are undocumented in the perf man pages,
perf-list doesn't output a single "cpu" or "cycles" event counter,
or even what hardware event counters are available. Hence I've got
no idea if it's broken, why "cycles" (or "cpu-cycles") doesn't
apparently record "cycle" triggered events, or even what perf is
supposed to tell me is it's recording cycle triggered events.

-Dave.
-- 
Dave Chinner
david@...morbit.com