[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <DC94NN6SM15D.3DQVRLO2E282W@google.com>
Date: Fri, 22 Aug 2025 17:20:28 +0000
From: Brendan Jackman <jackmanb@...gle.com>
To: Uladzislau Rezki <urezki@...il.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, <peterz@...radead.org>, <bp@...en8.de>,
<dave.hansen@...ux.intel.com>, <mingo@...hat.com>, <tglx@...utronix.de>,
<akpm@...ux-foundation.org>, <david@...hat.com>, <derkling@...gle.com>,
<junaids@...gle.com>, <linux-kernel@...r.kernel.org>, <linux-mm@...ck.org>,
<reijiw@...gle.com>, <rientjes@...gle.com>, <rppt@...nel.org>,
<vbabka@...e.cz>, <x86@...nel.org>, <yosry.ahmed@...ux.dev>,
Matthew Wilcox <willy@...radead.org>, Liam Howlett <liam.howlett@...cle.com>,
"Kirill A. Shutemov" <kas@...nel.org>, Harry Yoo <harry.yoo@...cle.com>, Jann Horn <jannh@...gle.com>,
Pedro Falcato <pfalcato@...e.de>, Andy Lutomirski <luto@...nel.org>,
Josh Poimboeuf <jpoimboe@...nel.org>, Kees Cook <kees@...nel.org>
Subject: Re: [Discuss] First steps for ASI (ASI is fast again)
On Fri Aug 22, 2025 at 4:56 PM UTC, Uladzislau Rezki wrote:
>> >> 2. The ephmap implementation is extremely stupid. It only works for the simple
>> >> shmem usecase. I don't think this is really important though, whatever we end
>> >> up with needs to be very simple, and it's not even clear that we actually
>> >> want a whole new subsystem anyway. (e.g. maybe it's better to just adapt
>> >> kmap_local_page() itself).
>> >
>> > Right just testing stuff out, fair enough. Obviously not an upstremable thing
>> > but sort of test case right?
>>
>> Yeah exactly.
>>
>> Maybe worth adding here that I explored just using vmalloc's allocator
>> for this. My experience was that despite looking quite nicely optimised
>> re avoiding synchronisation, just the simple fact of traversing its data
>> structures is too slow for this usecase (at least, it did poorly on my
>> super-sensitive FIO benchmark setup).
>>
> Could you please elaborate here? Which test case and what is a problem
> for it?
What I'm trying to do here is allocate some virtual space, map some
memory into it, read it through that mapping, then tear it down again.
The test case was an FIO benchmark reading 4k blocks from tmpfs, which I
think is a pretty tight loop. Maybe this is the kinda thing where the
syscall overhead is pretty significant, so that it's an unrealistic
workload, I'm not too sure. But it was a nice way to get a maximal
measure of the ASI perf hit on filesystem access.
I didn't make careful notes but I vaguely remember I was seeing
something like 10% hits to this workload that I attributed to the
vmalloc calls based on profiling with perf.
I didn't interpret this as "vmalloc is bad" but rather "this is an abuse
of vmalloc". Allocating anything at all for this usecase is quite
unfortunate really.
Anyway, the good news is I don't think we actually need a general
purpose allocator here. I think we can just have something very simple,
stack based and completely CPU-local. I just tried vmalloc() at the
beginning coz it was the hammer I happened to be holding at the time!
> You can fragment the main KVA space where we use a rb-tree to manage
> free blocks. But the question is how important your use case and
> workload for you?
Powered by blists - more mailing lists