[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160816220250.GI16044@dastard>
Date: Wed, 17 Aug 2016 08:02:50 +1000
From: Dave Chinner <david@...morbit.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Bob Peterson <rpeterso@...hat.com>,
"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
"Huang, Ying" <ying.huang@...el.com>,
Christoph Hellwig <hch@....de>,
Wu Fengguang <fengguang.wu@...el.com>, LKP <lkp@...org>,
Tejun Heo <tj@...nel.org>, LKML <linux-kernel@...r.kernel.org>
Subject: Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Mon, Aug 15, 2016 at 06:51:42PM -0700, Linus Torvalds wrote:
> Anyway, including the direct reclaim call paths gets
> __remove_mapping() a bit higher, and _raw_spin_lock_irqsave climbs to
> 0.26%. But perhaps more importlantly, looking at what __remove_mapping
> actually *does* (apart from the spinlock) gives us:
>
> - inside remove_mapping itself (0.11% on its own - flat cost, no
> child accounting)
>
> 48.50 │ lock cmpxchg %edx,0x1c(%rbx)
>
> so that's about 0.05%
>
> - 0.40% __delete_from_page_cache (0.22%
> radix_tree_replace_clear_tags, 0.13%__radix_tree_lookup)
>
> - 0.06% workingset_eviction()
>
> so I'm not actually seeing anything *new* expensive in there. The
> __delete_from_page_cache() overhead may have changed a bit with the
> tagged tree changes, but this doesn't look like memcg.
>
> But we clearly have very different situations.
>
> What does your profile show for when you actually dig into
> __remove_mapping() itself?, Looking at your flat profile, I'm assuming
> you get
- 22.26% 0.93% [kernel] [k] __remove_mapping
- 3.86% __remove_mapping
- 18.35% _raw_spin_lock_irqsave
__pv_queued_spin_lock_slowpath
1.32% __delete_from_page_cache
- 0.92% _raw_spin_unlock_irqrestore
__raw_callee_save___pv_queued_spin_unlock
And the instruction level profile:
.....
� xor %ecx,%ecx
� mov %rax,%r15
0.39 � mov $0x2,%eax
� lock cmpxchg %ecx,0x1c(%rbx)
32.56 � cmp $0x2,%eax
� � jne 12e
� mov 0x20(%rbx),%rax
� lea -0x1(%rax),%rdx
0.39 � test $0x1,%al
� cmove %rbx,%rdx
� mov (%rdx),%rax
0.39 � test $0x10,%al
� � jne 127
� mov (%rbx),%rcx
� shr $0xf,%rcx
� and $0x1,%ecx
� � jne 14a
� mov 0x68(%r14),%rax
36.03 � xor %esi,%esi
� test %r13b,%r13b
� mov 0x50(%rax),%rdx
1.16 � � jne e8
0.96 � a9: mov %rbx,%rdi
.....
Indicates most time on the cmpxchg for the page ref followed by the
grabbing on the ->freepage op vector:
freepage = mapping->a_ops->freepage;
> I come back to wondering whether maybe you're hitting some PV-lock problem.
>
> I know queued_spin_lock_slowpath() is ok. I'm not entirely sure
> __pv_queued_spin_lock_slowpath() is.
It's the same code AFAICT, except the pv version jumps straight to
the "queue" case.
> So I'd love to see you try the non-PV case, but I also think it might
> be interesting to see what the instruction profile for
> __pv_queued_spin_lock_slowpath() itself is. They share a lot of code
> (there's some interesting #include games going on to make
> queued_spin_lock_slowpath() actually *be*
> __pv_queued_spin_lock_slowpath() with some magic hooks), but there
> might be issues.
0.03 � data16 data16 data16 xchg %ax,%ax
� push %rbp
0.00 � mov %rsp,%rbp
0.01 � push %r15
� push %r14
� push %r13
0.01 � push %r12
� mov $0x18740,%r12
� push %rbx
� mov %rdi,%rbx
� sub $0x10,%rsp
� add %gs:0x7ef0d0e0(%rip),%r12
� movslq 0xc(%r12),%rax
0.02 � mov %gs:0x7ef0d0db(%rip),%r15d
� add $0x1,%r15d
� shl $0x12,%r15d
� lea 0x1(%rax),%edx
0.01 � mov %edx,0xc(%r12)
� mov %eax,%edx
� shl $0x4,%rax
� add %rax,%r12
� shl $0x10,%edx
� movq $0x0,(%r12)
0.02 � or %edx,%r15d
� mov %gs:0x7ef0d0ad(%rip),%eax
0.00 � movl $0x0,0x8(%r12)
0.01 � mov %eax,0x40(%r12)
� movb $0x0,0x44(%r12)
� mov (%rdi),%eax
0.88 � test %ax,%ax
� � jne 8f
0.02 � mov $0x1,%edx
� lock cmpxchg %dl,(%rdi)
0.38 � test %al,%al
� � je 14a
0.02 � 8f: mov %r15d,%eax
� shr $0x10,%eax
� xchg %ax,0x2(%rbx)
2.07 � shl $0x10,%eax
� test %eax,%eax
� � jne 171
� movq $0x0,-0x30(%rbp)
0.02 � ac: movzbl 0x44(%r12),%eax
0.97 � mov $0x1,%r13d
� mov $0x100,%r14d
� cmp $0x2,%al
� sete %al
� movzbl %al,%eax
� mov %rax,-0x38(%rbp)
0.00 � ca: movb $0x0,0x44(%r12)
0.00 � mov $0x8000,%edx
� movb $0x1,0x1(%rbx)
� � jmp e6
0.04 � db: pause
8.04 � sub $0x1,%edx
� � je 229
� e6: movzbl (%rbx),%eax
7.54 � test %al,%al
� � jne db
0.10 � mov %r14d,%eax
0.06 � lock cmpxchg %r13w,(%rbx)
2.93 ? cmp $0x100,%ax
� � jne db
� fc: mov (%rbx),%edx
0.37 � mov $0x1,%ecx
� or $0x1,%edx
� � jmp 114
0.01 �108: mov %edx,%eax
� lock cmpxchg %ecx,(%rbx)
0.26 � cmp %edx,%eax
� � je 14a
� mov %eax,%edx
�114: mov %edx,%eax
0.00 � xor %ax,%ax
� cmp %r15d,%eax
� � je 108
0.01 � cmpq $0x0,-0x30(%rbp)
? movb $0x1,(%rbx)
� � je 251
�12c: mov -0x30(%rbp),%rsi
0.01 � mov $0x1,%eax
� mov $0x2,%edx
� movl $0x1,0x8(%rsi)
0.11 � lock cmpxchg %dl,0x44(%rsi)
2.34 � cmp $0x1,%al
� � je 160
�14a: decl %gs:0x7ef1b5bb(%rip)
0.02 � add $0x10,%rsp
� pop %rbx
� pop %r12
0.00 � pop %r13
� pop %r14
� pop %r15
� pop %rbp
� � retq
�160: mov -0x30(%rbp),%rsi
� movb $0x3,(%rbx)
� mov %rbx,%rdi
� callq 0xffffffff810fcf90
� � jmp 14a
�171: lea 0x44(%r12),%r14
� mov %rax,%r13
� shr $0x12,%eax
� shr $0xc,%r13
� sub $0x1,%eax
� and $0x30,%r13d
� cltq
� add $0x18740,%r13
? add -0x7d8164c0(,%rax,8),%r13
0.03 � mov %r12,0x0(%r13)
0.38 �19c: mov $0x8000,%eax
� � jmp 1b7
0.04 �1a3: test %al,%al
� � jne 1b0
� movzbl 0x44(%r13),%edx
1.66 � test %dl,%dl
� � jne 1f1
1.75 �1b0: pause
64.57 � sub $0x1
0.04 � � mov 0x8(%r12),%eax
0.03 � � test %eax,%eax
� �� jne 1d4
�1c9:� pause
� � mov 0x8(%r12),%eax
� � test %eax,%eax
� �� je 1c9
�1d4:� mov (%r12),%rax
� � test %rax,%rax
� � mov %rax,-0x30(%rbp)
0.05 � �� je ac
� � mov -0x30(%rbp),%rax
� � prefet (%rax)
0.25 � �� jmpq ac
�1f1:� mov $0x1,%eax
� � xchg %al,0x44(%r12)
� � mov 0x8(%r12),%eax
� � test %eax,%eax
� �� jne 213
� � mov %r14,%rdi
� � mov $0x1,%esi
� � callq 0xffffffff8109f7a0
� � xchg %ax,%ax
�213:� mov $0x1,%eax
� � xor %edi,%edi
� � lock cmpxchg %dil,(%r14)
� � mov 0x8(%r12),%eax
� �� jmpq 19c
�229:� cmpq $0x0,-0x38(%rbp)
� � movb $0x0,0x1(%rbx)
� �� je 276
�234:� movb $0x1,0x44(%r12)
� � mov $0x3,%esi
� � mov %rbx,%rdi
� � callq 0xffffffff8109f7a0
� � xchg %ax,%ax
� � movzbl (%rbx),%eax
� �� jmpq ca
�251:� mov (%r12),%rax
0.14 � � test %rax,%rax
� � mov %rax,-0x30(%rbp)
� �� jne 12c
�262:� pause
0.31 � � mov (%r12),%rax
� � test %rax,%rax
� �� je 262
� � mov %rax,-0x30(%rbp)
� �� jmpq 12c
�276:? mov %r12,%rsi
� � mov %rbx,%rdi
� � callq 0xffffffff810fcf90
� � mov %rax,-0x38(%rbp)
� � mov $0x3,%eax
� � xchg %al,(%rbx)
� � test %al,%al
� �� jne 234
� � mov -0x38(%rbp),%rax
� � movb $0x1,(%rbx)
� � movq $0x0,(%rax)
� ���jmpq fc
> For example, if you run a virtual 16-core system on a physical machine
> that then doesn't consistently give 16 cores to the virtual machine,
> you'll get no end of hiccups.
I learnt that lesson 6-7 years ago when I first started doing
baseline benchmarking to compare bare metal to virtualised IO
performance.
-Dave.
--
Dave Chinner
david@...morbit.com
Powered by blists - more mailing lists