[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZB0j1WZOI/ZvdtEo@casper.infradead.org>
Date: Fri, 24 Mar 2023 04:15:17 +0000
From: Matthew Wilcox <willy@...radead.org>
To: Theodore Ts'o <tytso@....edu>
Cc: Andreas Dilger <adilger.kernel@...ger.ca>,
linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH 19/31] ext4: Convert ext4_journalled_zero_new_buffers()
to use a folio
On Tue, Mar 14, 2023 at 06:46:19PM -0400, Theodore Ts'o wrote:
> On Thu, Jan 26, 2023 at 08:24:03PM +0000, Matthew Wilcox (Oracle) wrote:
> > Remove a call to compound_head().
>
> Same question as with other commits, plus one more; why is it notable
> that calls to compound_head() are being reduced. I've looked at the
> implementation, and it doesn't look all _that_ heavyweight....
It gets a lot more heavyweight when you turn on
CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP, which more and more distro
kernels are doing, because it's such a win for VM host kernels.
eg SuSE do it here:
https://github.com/SUSE/kernel-source/blob/master/config/x86_64/default
and UEK does it here:
https://github.com/oracle/linux-uek/blob/uek7/ga/uek-rpm/ol9/config-x86_64
Debian also has it enabled.
It didn't use to be so expensive, but now it's something like 50-60
bytes of text per invocation on x86 [1]. And the compiler doesn't get
to remember the result of calling compound_head() because we might have
changed page->compound_head between invocations. It doesn't even know
that compound_head() is idempotent.
Anyway, each of these patches can be justified as "This patch shrinks
the kernel by 0.0001%". Of course my real motivation for doing this
is to reduce the number of callers of the page APIs so we can start to
remove them and lessen the cognitive complexity of having both page &
folio APIs that parallel each other. And I would like ext4 to support
large folios sometime soon, and it's a step towards that goal too.
But that's a lot to write out in each changelog.
[1] For example, the disassembly of unlock_page() with the UEK
config:
c0: f3 0f 1e fa endbr64
c4: e8 00 00 00 00 call c9 <unlock_page+0x9>
c5: R_X86_64_PLT32 __fentry__-0x4
c9: 55 push %rbp
ca: 48 8b 47 08 mov 0x8(%rdi),%rax
ce: 48 89 e5 mov %rsp,%rbp
d1: a8 01 test $0x1,%al
d3: 75 2f jne 104 <unlock_page+0x44>
d5: eb 0b jmp e2 <unlock_page+0x22>
d7: e8 00 00 00 00 call dc <unlock_page+0x1c>
d8: R_X86_64_PLT32 folio_unlock-0x4
dc: 5d pop %rbp
dd: e9 00 00 00 00 jmp e2 <unlock_page+0x22>
de: R_X86_64_PLT32 __x86_return_thunk-0x4
e2: f7 c7 ff 0f 00 00 test $0xfff,%edi
e8: 75 ed jne d7 <unlock_page+0x17>
ea: 48 8b 07 mov (%rdi),%rax
ed: a9 00 00 01 00 test $0x10000,%eax
f2: 74 e3 je d7 <unlock_page+0x17>
f4: 48 8b 47 48 mov 0x48(%rdi),%rax
f8: 48 8d 50 ff lea -0x1(%rax),%rdx
fc: a8 01 test $0x1,%al
fe: 48 0f 45 fa cmovne %rdx,%rdi
102: eb d3 jmp d7 <unlock_page+0x17>
104: 48 8d 78 ff lea -0x1(%rax),%rdi
108: e8 00 00 00 00 call 10d <unlock_page+0x4d>
109: R_X86_64_PLT32 folio_unlock-0x4
10d: 5d pop %rbp
10e: e9 00 00 00 00 jmp 113 <unlock_page+0x53>
10f: R_X86_64_PLT32 __x86_return_thunk-0x4
113: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1)
11a: 00 00 00 00
11e: 66 90 xchg %ax,%ax
Everything between 0xd3 and 0x104 is "maybe it's a fake head". That's 41
bytes as a minimum per callsite, and typically it's much more becase we
also need the test for PageTail and the lea for the actual compound_head.
Powered by blists - more mailing lists