[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c45fdd75-44be-82a6-8e47-42bbc5ee4795@huawei.com>
Date: Thu, 18 Apr 2024 17:39:04 +0800
From: Yunsheng Lin <linyunsheng@...wei.com>
To: Alexander H Duyck <alexander.duyck@...il.com>, <davem@...emloft.net>,
<kuba@...nel.org>, <pabeni@...hat.com>
CC: <netdev@...r.kernel.org>, <linux-kernel@...r.kernel.org>, Andrew Morton
<akpm@...ux-foundation.org>, <linux-mm@...ck.org>
Subject: Re: [PATCH net-next v2 09/15] mm: page_frag: reuse MSB of 'size'
field for pfmemalloc
On 2024/4/17 23:11, Alexander H Duyck wrote:
> On Wed, 2024-04-17 at 21:19 +0800, Yunsheng Lin wrote:
>> On 2024/4/17 0:22, Alexander H Duyck wrote:
>>> On Mon, 2024-04-15 at 21:19 +0800, Yunsheng Lin wrote:
>>>> The '(PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)' case is for the
>>>> system with page size less than 32KB, which is 0x8000 bytes
>>>> requiring 16 bits space, change 'size' to 'size_mask' to avoid
>>>> using the MSB, and change 'pfmemalloc' field to reuse the that
>>>> MSB, so that we remove the orginal space needed by 'pfmemalloc'.
>>>>
>>>> For another case, the MSB of 'offset' is reused for 'pfmemalloc'.
>>>>
>>>> Signed-off-by: Yunsheng Lin <linyunsheng@...wei.com>
>>>> ---
>>>> include/linux/page_frag_cache.h | 13 ++++++++-----
>>>> mm/page_frag_cache.c | 5 +++--
>>>> 2 files changed, 11 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/include/linux/page_frag_cache.h b/include/linux/page_frag_cache.h
>>>> index fe5faa80b6c3..40a7d6da9ef0 100644
>>>> --- a/include/linux/page_frag_cache.h
>>>> +++ b/include/linux/page_frag_cache.h
>>>> @@ -12,15 +12,16 @@ struct page_frag_cache {
>>>> void *va;
>>>> #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>>>> __u16 offset;
>>>> - __u16 size;
>>>> + __u16 size_mask:15;
>>>> + __u16 pfmemalloc:1;
>>>> #else
>>>> - __u32 offset;
>>>> + __u32 offset:31;
>>>> + __u32 pfmemalloc:1;
>>>> #endif
>>>
>>> This seems like a really bad idea. Using a bit-field like this seems
>>> like a waste as it means that all the accesses now have to add
>>> additional operations to access either offset or size. It wasn't as if
>>> this is an oversized struct, or one that we are allocating a ton of. As
>>> such I am not sure why we need to optmize for size like this.
>>
>> For the old 'struct page_frag' use case, there is one 'struct page_frag'
>> for every socket and task_struct, there may be tens of thousands of
>> them even in a 32 bit sysmem, which might mean a lof of memory even for
>> a single byte saving, see patch 13.
>>
>
> Yeah, I finally had time to finish getting through the patch set last
> night. Sorry for quick firing reviews but lately I haven't had much
> time to work on upstream work, and as you mentioned last time I only
> got to 3 patches so I was trying to speed through.
>
> I get that you are trying to reduce the size but in the next patch you
> actually end up overshooting that on x86_64 systems. I am assuming that
> is to try to account for the 32b use case? On 64b I am pretty sure you
> don't get any gain since a long followed by two u16s and an int will
> still be 16B. What we probably need to watch out for is the
> optimization for size versus having to add instructions to extract and
> insert the data back into the struct.
>
> Anyway as far as this layout I am not sure it is the best way to go.
> You are combining pfmemalloc with either size *OR* offset, and then
Does it really matter if pfmemalloc is conbined with size or offset?
as we are using bitfield for pfmemalloc.
> combining the pagecnt_bias with the va. I'm wondering if it wouldn't
> make more sense to look at putting together the structure something
> like:
>
> #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> typedef u16 page_frag_bias_t;
> #else
> typedef u32 page_frag_bias_t;
> #endif
>
> struct page_frag_cache {
> /* page address and offset */
> void *va;
Generally I am agreed with combining the virtual address with the
offset for the reason you mentioned below.
> page_frag_bias_t pagecnt_bias;
> u8 pfmemalloc;
> u8 page_frag_order;
> }
The issue with the 'page_frag_order' I see is that we might need to do
a 'PAGE << page_frag_order' to get actual size, and we might also need
to do 'size - 1' to get the size_mask if we want to mask out the offset
from the 'va'.
For page_frag_order, we need to:
size = PAGE << page_frag_order
size_mask = size - 1
For size_mask, it seem we only need to do:
size = size_mask + 1
And as PAGE_FRAG_CACHE_MAX_SIZE = 32K, which can be fitted into 15 bits
if we use size_mask instead of size.
Does it make sense to use below, so that we only need to use bitfield
for SIZE < PAGE_FRAG_CACHE_MAX_SIZE in 32 bits system? And 'struct
page_frag' is using a similar '(BITS_PER_LONG > 32)' checking trick.
struct page_frag_cache {
/* page address and offset */
void *va;
#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) && (BITS_PER_LONG <= 32)
u16 pagecnt_bias;
u16 size_mask:15;
u16 pfmemalloc:1;
#else
u32 pagecnt_bias;
u16 size_mask;
u16 pfmemalloc;
#endif
};
>
> The basic idea would be that we would be able to replace the size mask
> with just a shift value representing the page order of the page being
> fragmented. With that we can reduce the size to just a single byte. In
> addition we could probably leave it there regardless of build as the
> order should be initialized to 0 when this is allocated to it would be
> correct even in the case where it isn't used (and there isn't much we
> can do about the hole anyway).
>
> In addition by combining the virtual address with the offset we can
> just use the combined result for what we need. The only item that has
> to be worked out is how to deal with the end of a page in the count up
> case. However the combination seems the most logical one since they are
> meant to be combined ultimately anyway. It does put limits on when we
> can align things as we don't want to align ourselves into the next
I guess it means we need to mask out the offset 'va' before doing the
aligning operation and 'offset + fragsz > size' checking, right?
> page, but I think it makes more sense then the other limits that had to
> be put on allocations and such in order to allow us to squeeze
> pagecnt_bias into the virtual address.
>
> Anyway I pulled in your patches and plan to do a bit of testing, after
> I figure out what the nvme disk ID regression is I am seeing. My main
> concern can be summed up as the NIC driver use case
> (netdev/napi_alloc_frag callers) versus the socket/vhost use case. The
> main thing in the case of the NIC driver callers is that we have a need
> for isolation and guarantees that we won't lose cache line alignment. I
> think those are the callers you are missing in your benchmarks, but
> arguably that might be something you cannot test as I don't know what
> NICs you have access to and if you have any that are using those calls.
I guess we just need to replace the API used by socket/vhost with the one
used by netdev/napi_alloc_frag callers in mm/page_frag_test.c in patch 1,
which is introduced to test performance of page_frag implementation, see:
https://lore.kernel.org/all/20240415131941.51153-2-linyunsheng@huawei.com/
> .
>
Powered by blists - more mailing lists