[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGudoHFxme+cPsm2BVsOjoy6UZzgEZZebkvDhp7=jkevSTyb-A@mail.gmail.com>
Date: Thu, 15 Aug 2024 17:07:29 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: Andi Kleen <ak@...ux.intel.com>
Cc: Jeff Layton <jlayton@...nel.org>, Alexander Viro <viro@...iv.linux.org.uk>,
Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>,
Andrew Morton <akpm@...ux-foundation.org>, Josef Bacik <josef@...icpanda.com>,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too
On Tue, Aug 6, 2024 at 10:47 PM Andi Kleen <ak@...ux.intel.com> wrote:
>
> > Before I get to the vfs layer, there is a significant loss in the
> > memory allocator because of memcg -- it takes several irq off/on trips
> > for every alloc (needed to grab struct file *). I have a plan what to
> > do with it (handle stuff with local cmpxchg (note no lock prefix)),
> > which I'm trying to get around to. Apart from that you may note the
> > allocator fast path performs a 16-byte cmpxchg, which is again dog
> > slow and executes twice (once for the file obj, another time for the
> > namei buffer). Someone(tm) should patch it up and I have some vague
> > ideas, but 0 idea when I can take a serious stab.
>
> I just LBR sampled it on my skylake and it doesn't look
> particularly slow. You see the whole massive block including CMPXCHG16
> gets IPC 2.7, which is rather good. If you see lots of cycles on it it's likely
> a missing cache line.
>
> kmem_cache_free:
> ffffffff9944ce20 nop %edi, %edx
> ffffffff9944ce24 nopl %eax, (%rax,%rax,1)
> ffffffff9944ce29 pushq %rbp
> ffffffff9944ce2a mov %rdi, %rdx
> ffffffff9944ce2d mov %rsp, %rbp
> ffffffff9944ce30 pushq %r15
> ffffffff9944ce32 pushq %r14
> ffffffff9944ce34 pushq %r13
> ffffffff9944ce36 pushq %r12
> ffffffff9944ce38 mov $0x80000000, %r12d
> ffffffff9944ce3e pushq %rbx
> ffffffff9944ce3f mov %rsi, %rbx
> ffffffff9944ce42 and $0xfffffffffffffff0, %rsp
> ffffffff9944ce46 sub $0x10, %rsp
> ffffffff9944ce4a movq %gs:0x28, %rax
> ffffffff9944ce53 movq %rax, 0x8(%rsp)
> ffffffff9944ce58 xor %eax, %eax
> ffffffff9944ce5a add %rsi, %r12
> ffffffff9944ce5d jb 0xffffffff9944d1ea
> ffffffff9944ce63 mov $0xffffffff80000000, %rax
> ffffffff9944ce6a xor %r13d, %r13d
> ffffffff9944ce6d subq 0x17b068c(%rip), %rax
> ffffffff9944ce74 add %r12, %rax
> ffffffff9944ce77 shr $0xc, %rax
> ffffffff9944ce7b shl $0x6, %rax
> ffffffff9944ce7f addq 0x17b066a(%rip), %rax
> ffffffff9944ce86 movq 0x8(%rax), %rcx
> ffffffff9944ce8a test $0x1, %cl
> ffffffff9944ce8d jnz 0xffffffff9944d15c
> ffffffff9944ce93 nopl %eax, (%rax,%rax,1)
> ffffffff9944ce98 movq (%rax), %rcx
> ffffffff9944ce9b and $0x8, %ch
> ffffffff9944ce9e jz 0xffffffff9944cfea
> ffffffff9944cea4 test %rax, %rax
> ffffffff9944cea7 jz 0xffffffff9944cfea
> ffffffff9944cead movq 0x8(%rax), %r14
> ffffffff9944ceb1 test %r14, %r14
> ffffffff9944ceb4 jz 0xffffffff9944cfac
> ffffffff9944ceba cmp %r14, %rdx
> ffffffff9944cebd jnz 0xffffffff9944d165
> ffffffff9944cec3 test %r14, %r14
> ffffffff9944cec6 jz 0xffffffff9944cfac
> ffffffff9944cecc movq 0x8(%rbp), %r15
> ffffffff9944ced0 nopl %eax, (%rax,%rax,1)
> ffffffff9944ced5 movq 0x1fe5134(%rip), %rax
> ffffffff9944cedc test %r13, %r13
> ffffffff9944cedf jnz 0xffffffff9944ceef
> ffffffff9944cee1 mov $0xffffffff80000000, %rax
> ffffffff9944cee8 subq 0x17b0611(%rip), %rax
> ffffffff9944ceef add %rax, %r12
> ffffffff9944cef2 shr $0xc, %r12
> ffffffff9944cef6 shl $0x6, %r12
> ffffffff9944cefa addq 0x17b05ef(%rip), %r12
> ffffffff9944cf01 movq 0x8(%r12), %rax
> ffffffff9944cf06 mov %r12, %r13
> ffffffff9944cf09 test $0x1, %al
> ffffffff9944cf0b jnz 0xffffffff9944d1b1
> ffffffff9944cf11 nopl %eax, (%rax,%rax,1)
> ffffffff9944cf16 movq (%r13), %rax
> ffffffff9944cf1a movq %rbx, (%rsp)
> ffffffff9944cf1e test $0x8, %ah
> ffffffff9944cf21 mov $0x0, %eax
> ffffffff9944cf26 cmovz %rax, %r13
> ffffffff9944cf2a data16 nop
> ffffffff9944cf2c movq 0x38(%r13), %r8
> ffffffff9944cf30 cmp $0x3, %r8
> ffffffff9944cf34 jnbe 0xffffffff9944d1ca
> ffffffff9944cf3a nopl %eax, (%rax,%rax,1)
> ffffffff9944cf3f movq 0x23d6f72(%rip), %rax
> ffffffff9944cf46 mov %rbx, %rdx
> ffffffff9944cf49 sub %rax, %rdx
> ffffffff9944cf4c cmp $0x1fffff, %rdx
> ffffffff9944cf53 jbe 0xffffffff9944d03a
> ffffffff9944cf59 movq (%r14), %rax
> ffffffff9944cf5c addq %gs:0x66bccab4(%rip), %rax
> ffffffff9944cf64 movq 0x8(%rax), %rdx
> ffffffff9944cf68 cmpq %r13, 0x10(%rax)
> ffffffff9944cf6c jnz 0xffffffff9944d192
> ffffffff9944cf72 movl 0x28(%r14), %ecx
> ffffffff9944cf76 movq (%rax), %rax
> ffffffff9944cf79 add %rbx, %rcx
> ffffffff9944cf7c cmp %rbx, %rax
> ffffffff9944cf7f jz 0xffffffff9944d1ba
> ffffffff9944cf85 movq 0xb8(%r14), %rsi
> ffffffff9944cf8c mov %rcx, %rdi
> ffffffff9944cf8f bswap %rdi
> ffffffff9944cf92 xor %rax, %rsi
> ffffffff9944cf95 xor %rdi, %rsi
> ffffffff9944cf98 movq %rsi, (%rcx)
> ffffffff9944cf9b leaq 0x2000(%rdx), %rcx
> ffffffff9944cfa2 movq (%r14), %rsi
> ffffffff9944cfa5 cmpxchg16bx %gs:(%rsi)
> ffffffff9944cfaa jnz 0xffffffff9944cf59
> ffffffff9944cfac movq 0x8(%rsp), %rax
> ffffffff9944cfb1 subq %gs:0x28, %rax
> ffffffff9944cfba jnz 0xffffffff9944d1fc
> ffffffff9944cfc0 leaq -0x28(%rbp), %rsp
> ffffffff9944cfc4 popq %rbx
> ffffffff9944cfc5 popq %r12
> ffffffff9944cfc7 popq %r13
> ffffffff9944cfc9 popq %r14
> ffffffff9944cfcb popq %r15
> ffffffff9944cfcd popq %rbp
> ffffffff9944cfce retq # PRED 38 cycles [126] 2.74 IPC <-------------
Sorry for late reply, my test box was temporarily unavailable and then
I forgot about this e-mail :)
I don't have a good scientific test(tm) and I don't think coming up
with one is warranted at the moment.
But to illustrate, I slapped together a test case for will-it-scale
where I either cmpxchg8 or 16 in a loop. No lock prefix on these.
On Sapphire Rapids I see well over twice the throughput for the 8-byte variant:
# ./cmpxchg8_processes
warmup
min:481465497 max:481465497 total:481465497
min:464439645 max:464439645 total:464439645
min:461884735 max:461884735 total:461884735
min:460850043 max:460850043 total:460850043
min:461066452 max:461066452 total:461066452
min:463984473 max:463984473 total:463984473
measurement
min:461317703 max:461317703 total:461317703
min:458608942 max:458608942 total:458608942
min:460846336 max:460846336 total:460846336
[snip]
# ./cmpxchg16b_processes
warmup
min:205207128 max:205207128 total:205207128
min:205010535 max:205010535 total:205010535
min:204877781 max:204877781 total:204877781
min:204163814 max:204163814 total:204163814
min:204392000 max:204392000 total:204392000
min:204094222 max:204094222 total:204094222
measurement
min:204243282 max:204243282 total:204243282
min:204136589 max:204136589 total:204136589
min:203504119 max:203504119 total:203504119
So I would say trying it out in a real alloc is worth looking at.
Of course the 16-byte variant is not used just for kicks, so going to
8 bytes is more involved than just replacing the instruction.
The current code follows the standard idea on how to deal with the ABA
problem -- apart from replacing a pointer you validate this is what
you thought by checking the counter in the same instruction.
I note that in the kernel we can do better, but I don't have have all
kinks worked out yet. The core idea builds on the fact that we can
cheaply detect a pending alloc on the same cpu and should a
conflicting free be executing from an interrupt, it can instead add
the returning buffer to a different list and the aba problem
disappears. Should the alloc fast path fail to find a free buffer, it
can disable interrupts an take a look at the fallback list.
--
Mateusz Guzik <mjguzik gmail.com>
Powered by blists - more mailing lists