netdev - Re: [PATCH net-next v9 06/13] mm: page_frag: reuse existing space for 'size' and 'pfmemalloc'

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <29e8ac53-f7da-4896-8121-2abc25ec2c95@gmail.com>
Date: Sat, 13 Jul 2024 13:20:34 +0800
From: Yunsheng Lin <yunshenglin0825@...il.com>
To: Alexander Duyck <alexander.duyck@...il.com>,
 Yunsheng Lin <linyunsheng@...wei.com>
Cc: davem@...emloft.net, kuba@...nel.org, pabeni@...hat.com,
 netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
 Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org
Subject: Re: [PATCH net-next v9 06/13] mm: page_frag: reuse existing space for
 'size' and 'pfmemalloc'

On 7/13/2024 12:55 AM, Alexander Duyck wrote:

...

>>
>> So it is about ensuring the additional room due to alignment requirement
>> being placed at the end of a fragment, in order to avoid false sharing
>> between the last fragment and the current fragment as much as possible,
>> right?
>>
>> I am generally agreed with above if we can also ensure skb coaleasing by
>> doing offset count-up instead of offset countdown.
>>
>> If there is conflict between them, I am assuming that enabling skb frag
>> coaleasing is prefered over avoiding false sharing, right?
> 
> The question I would have is where do we have opportunities for this
> to result in coalescing? If we are using this to allocate skb->data
> then there isn't such a chance as the tailroom gets in the way.
> 
> If this is for a device allocating for an Rx buffer we won't get that
> chance as we have to preallocate some fixed size not knowing the
> buffer that is coming, and it needs to usually be DMA aligned in order
> to avoid causing partial cacheline reads/writes. The only way these
> would combine well is if you are doing aligned fixed blocks and are
> the only consumer of the page frag cache. It is essentially just
> optimizing for jumbo frames in that case.

And hw-gro or sw-gro.

> 
> If this is for some software interface why wouldn't it request the
> coalesced size and do one allocation rather than trying to figure out
> how to perform a bunch of smaller allocations and then trying to merge
> them together after the fact.

I am not sure I understand what 'some software interface' is referring
to, I hope you are not suggesting the below optimizations utilizing of
skb_can_coalesce() checking is unnecessary:(

https://elixir.bootlin.com/linux/v6.10-rc7/C/ident/skb_can_coalesce

Most of the usecases do that for the reason below as mentioned in the
Documentation/mm/page_frags.rst as my understanding:
"There is also a use case that needs minimum memory in order for forward 
progress, but more performant if more memory is available."

> 
>>>
>>>> The above is why I added the below paragraph in the doc to make the semantic
>>>> more explicit:
>>>> "Depending on different aligning requirement, the page_frag API caller may call
>>>> page_frag_alloc*_align*() to ensure the returned virtual address or offset of
>>>> the page is aligned according to the 'align/alignment' parameter. Note the size
>>>> of the allocated fragment is not aligned, the caller needs to provide an aligned
>>>> fragsz if there is an alignment requirement for the size of the fragment."
>>>>
>>>> And existing callers of page_frag aligned API does seems to follow the above
>>>> rule last time I checked.
>>>>
>>>> Or did I miss something obvious here?
>>>
>>> No you didn't miss anything. It is just that there is now more
>>> potential for error than there was before.
>>
>> I guess the 'error' is referred to the 'false sharing' mentioned above,
>> right? If it is indeed an error, are we not supposed to fix it instead
>> of allowing such implicit implication? Allowing implicit implication
>> seems to make the 'error' harder to reproduce and debug.
> 
> The concern with the code as it stands is that if I am not mistaken
> remaining isn't actually aligned. You aligned it, then added fragsz.
> That becomes the start of the next frame so if fragsz isn't aligned
> the next requester will be getting an unaligned buffer, or one that is
> only aligned to the previous caller's alignment.

As mentioned below:
https://lore.kernel.org/all/3da33d4c-a70e-23a4-8e00-23fe96dd0c1a@huawei.com/

what alignment semantics are we providing here:
1. Ensure alignment for both offset and fragsz.
2. Ensure alignment for offset only.
3. Ensure alignment for fragsz only.

As my understanding, the original code before this patchset is only 
ensuring alignment for offset too. So there may be 'false sharing'
both before this patchset and after this patchset. It would be better
not to argue about which implementation having more/less potential
to avoid the 'false sharing', it is an undefined behavior, the argument
would be endless depending on usecase and personal preference.

As I said before, I would love to retain the old undefined behavior
when there is a reasonable way to support the new usecases.

> 
>>>
>>>>> need to align the remaining, then add fragsz, and then I guess you
>>>>> could store remaining and then subtract fragsz from your final virtual
>>>>> address to get back to where the starting offset is actually located.
>>>>
>>>> remaining = __ALIGN_KERNEL_MASK(nc->remaining, ~align_mask);
>>>> remaining += fragsz;
>>>> nc->remaining = remaining;
>>>> return encoded_page_address(nc->encoded_va) + (size + remaining) - fragsz;
>>>>
>>>> If yes, I am not sure what is the point of doing the above yet, it
>>>> just seem to make thing more complicated and harder to understand.
>>>
>>> That isn't right. I am not sure why you are adding size + remaining or
>>> what those are supposed to represent.
>>
>> As the above assumes 'remaining' is a negative value as you suggested,
>> (size + remaining) is supposed to represent the offset of next fragment
>> to ensure we have count-up offset for enabling skb frag coaleasing, and
>> '- fragsz' is used to get the offset of current fragment.
>>
>>>
>>> The issue was that the "remaining" ends up being an unaligned value as
>>> you were starting by aligning it and then adding fragsz. So by
>>> subtracting fragsz you can get back to the aliglined start. What this
>>> patch was doing before was adding the raw unaligned nc->remaining at
>>> the end of the function.
>>>
>>>>>
>>>>> Basically your "remaining" value isn't a safe number to use for an
>>>>> offset since it isn't aligned to your starting value at any point.
>>>>
>>>> Does using 'aligned_remaining' local variable to make it more obvious
>>>> seem reasonable to you?
>>>
>>> No, as the value you are storing above isn't guaranteed to be aligned.
>>> If you stored it before adding fragsz then it would be aligned.
>>
>> I have a feeling that what you are proposing may be conflict with enabling
>> skb frag coaleasing, as doing offset count-up seems to need some room to
>> be reserved at the begin of a allocated fragment due to alignment requirement,
>> and that may mean we need to do both fragsz and offset aligning.
>>
>> Perhaps the 'remaining' changing in this patch does seems to make things
>> harder to discuss. Anyway, it would be more helpful if there is some pseudo
>> code to show the steps of how the above can be done in your mind.
> 
> Basically what you would really need do for all this is:
>    remaining = __ALIGN_KERNEL_MASK(nc->remaining, ~align_mask);
>    nc->remaining = remaining + fragsz;
>    return encoded_page_address(nc->encoded_va) + size + remaining;

I am assuming 'size + remaining' is supposed to represent the offset of 
current allocted fragment?
Are you sure the above is what you wanted?

suppose remaining = -32768 & size = 32768 initially:
Step 1: __page_frag_alloc_align() is called with fragsz=7 and
         align_mask=~0u, the remaining after this is -32761, the true
         fragsz is 7, the offset is 0

Step 2: __page_frag_alloc_align() is called with fragsz=7 and
         align_mask=-16, the offset after this is -32745, the true
         fragsz is 7, the offset = 16

The semantics of the above implementation seems to be the same as
the semantics of implementation of patch 3 in v10：
https://lore.kernel.org/all/20240709132741.47751-4-linyunsheng@huawei.com/
     aligned_remaining = nc->remaining & align_mask;
     remaining = aligned_remaining - fragsz;
     nc->remaining = remaining;
     return encoded_page_address(encoded_va) + size - aligned_remaining;

The main difference seems to be about using a negative value for 
nc->remaining or not, if yes, I am not sure what is other gain of
using a negative value for nc->remaining besides the LEA instruction
opt trick.

As using a unsigned value and a 'aligned_remaining' local variable
does seems to make thing easier to understand and avoid adding three 
checkings as mentioned below.

> 
> The issue is you cannot be using page_frag_cache_current_va as that
> pointer will always be garbage as it isn't aligned to anything. It
> isn't like the code that I had that was working backwards as the
> offset cannot be used as soon as you compute it since you add fragsz
> to it.

The above was a mistake in v9, the intend is to do:
nc->remaining &= align_mask;

That was why there were three 'align > PAGE_SIZE' checking in v10 to
avoid doing 'nc->remaining &= align_mask' prematurely if caller passes
a large align_mask value.

So 'aligned_remaining' local variable in v10 is used to avoid doing 
'nc->remaining &= align_mask', thus three 'align > PAGE_SIZE' checking
is not needed too.

> 
> For any countup you have to align the current offset/remaining value,
> then that value is used for computing the page address, and you store
> the offset/remaining plus fragsz adjustment. At which point your
> offset/remaining pointer value isn't good for the current buffer
> anymore.
>