[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e1cc19.5287.196ae733594.Coremail.00107082@163.com>
Date: Thu, 8 May 2025 13:51:48 +0800 (CST)
From: "David Wang" <00107082@....com>
To: "Kent Overstreet" <kent.overstreet@...ux.dev>
Cc: "Suren Baghdasaryan" <surenb@...gle.com>, akpm@...ux-foundation.org,
linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] alloc_tag: avoid mem alloc and iter reset when reading
allocinfo
At 2025-05-08 12:07:40, "Kent Overstreet" <kent.overstreet@...ux.dev> wrote:
>On Thu, May 08, 2025 at 11:35:11AM +0800, David Wang wrote:
>>
>>
>> At 2025-05-08 11:31:12, "Kent Overstreet" <kent.overstreet@...ux.dev> wrote:
>> >On Thu, May 08, 2025 at 11:06:35AM +0800, David Wang wrote:
>> >> Thanks for the feedback~
>> >> I agree that memory allocation normally dose not take major part of a profiling report,
>> >> even profiling a fio test, kmem_cache_alloc only takes ~1% perf samples.
>> >>
>> >> I don't know why I have this "the less memory allocation, the better' mindset, maybe
>> >> I was worrying about memory fragmentation, or something else I learned on some "textbook",
>> >> To be honest, I have never had real experience with those worries....
>> >
>> >It's a common bias. "Memory allocations" take up a lot of conceptual
>> >space in our heads, and generally for good reason - i.e. handling memory
>> >allocation errors is often a major concern, and you do always want to be
>> >aware of memory layout.
>> >
>> >But this can turn into an aversion that's entirely disproportionate -
>> >e.g. using linked linked lists and fixed size arrays in ways that are
>> >entirely inappropriate, instead of vectors and other better data
>> >structures; good data structures always require allocations.
>> >
>> >Profile, profile, profile, and remember your basic CS (big O notation) -
>> >90% of the time, simple code with good big O running time is all you
>> >need.
>>
>> copy that~!
>
>Another thing to note is that memory layout - avoiding pointer chasing -
>is hugely important, but it'll almost never show up as allocator calls.
>
>To give you some examples, mempools and biosets used to be separately
>allocated. This was mainly to make error paths in outer object
>constructors/destructors easier and safer: instead of keeping track of
>what's initialized and what's not, if you've got a pointer to a
>mempool/bioset you call *_free() on it.
>
>(People hadn't yet clued that you can just kzalloc() the entire outer
>object, and then if the inner object is zeroed it wasn't initialized).
>
>But that means you're adding a pointer chase to every mempool_alloc()
>call, and since bioset itself has mempools allocating bios had _two_
>unnecessary pointer derefs. That's death for performance when you're
>running cache cold, but since everyone benchmarks cache-hot...
>
>(I was the one who fixed that).
>
>Another big one was generic_file_buffered_read(). Main buffered read
>path, everyone wants it to be as fast as possible.
>
>But the core is (was) a loop that walks the pagecache radix tree to get
>the page, then copies 4k of data out to userspace (there goes l1), then
>repeats all that pointer chasing for the next 4k. Pre large folios, it
>was horrific.
>
>Solution - vectorize it. Look up all the pages we're copying from all at
>once, stuff them in a (dynamically allocated! for each read!) vector,
>and then do the copying out to userspace all at once. Massive
>performance gain.
>
>Of course, to do that I first had to clean up a tangled 250+ line
>monstrosity of half baked, poorly thought out "optimizations" (the worst
>spaghetti of gotos you'd ever seen) and turn it into something
>manageable...
>
>So - keep things simple, don't overthink the little stuff, so you can
>spot and tackle the big algorithmic wins :)
I will keep this in mind~! :)
And thanks for the enlightening notes~!!
Though I could not quite catch up with the first one, I think I got the point:
avoid unnecessary pointer chasing and keep the pointer chasing as short(balanced) as possible~
The second one, about copy 4k by 4k, seems quite similar to seq_file, at least the "4k" part, literally.
seq_file read() defaults to alloc 4k buffer, and read data until EOF or the 4k buffer is full, and start over
again for the next read().
One solution could be make changes to seq_file, do not stop until user buffer is full for each read.
kind of similar to your second note, in a sequential style, I think.
If user read with 128K buffer, and seq_file fill the buffer 4k by 4k, it would only need ~3 read calls for allocinfo.
(I did post a patch for seq_file to fill user buffer, but start/stop still happens at 4k boundary , so no help for
the iterator rewinding when read /proc/allocinfo yet.
https://lore.kernel.org/lkml/20241220140819.9887-1-00107082@163.com/ )
The solution in this patch is keeping the iterator alive and valid cross read boundary, this can also avoid the cost for
each start over.
David
Powered by blists - more mailing lists