[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f4ff5a42-9371-bc54-8523-b11d8511c39a@huawei.com>
Date: Wed, 3 Jul 2024 20:33:27 +0800
From: Yunsheng Lin <linyunsheng@...wei.com>
To: Alexander H Duyck <alexander.duyck@...il.com>, <davem@...emloft.net>,
<kuba@...nel.org>, <pabeni@...hat.com>
CC: <netdev@...r.kernel.org>, <linux-kernel@...r.kernel.org>, Andrew Morton
<akpm@...ux-foundation.org>, <linux-mm@...ck.org>
Subject: Re: [PATCH net-next v9 06/13] mm: page_frag: reuse existing space for
'size' and 'pfmemalloc'
On 2024/7/2 22:55, Alexander H Duyck wrote:
...
>
>>>
>>>> +#define PAGE_FRAG_CACHE_ORDER_MASK GENMASK(7, 0)
>>>> +#define PAGE_FRAG_CACHE_PFMEMALLOC_BIT BIT(8)
>>>> +#define PAGE_FRAG_CACHE_PFMEMALLOC_SHIFT 8
>>>> +
>>>> +static inline struct encoded_va *encode_aligned_va(void *va,
>>>> + unsigned int order,
>>>> + bool pfmemalloc)
>>>> +{
>>>> #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>>>> - __u16 offset;
>>>> - __u16 size;
>>>> + return (struct encoded_va *)((unsigned long)va | order |
>>>> + pfmemalloc << PAGE_FRAG_CACHE_PFMEMALLOC_SHIFT);
>>>> #else
>>>> - __u32 offset;
>>>> + return (struct encoded_va *)((unsigned long)va |
>>>> + pfmemalloc << PAGE_FRAG_CACHE_PFMEMALLOC_SHIFT);
>>>> +#endif
>>>> +}
>>>> +
>
> This is missing any and all protection of the example you cited. If I
> want to pass order as a 32b value I can and I can corrupt the virtual
> address. Same thing with pfmemalloc. Lets only hope you don't hit an
> architecture where a bool is a u8 in size as otherwise that shift is
> going to wipe out the value, and if it is a u32 which is usually the
> case lets hope they stick to the values of 0 and 1.
I explicitly checked that the protection is not really needed due to
performance consideration:
1. For the 'pfmemalloc' part, it does always stick to the values of 0
and 1 as below:
https://elixir.bootlin.com/linux/v6.10-rc6/source/Documentation/process/coding-style.rst#L1053
2. For the 'order' part, its range can only be within 0~3.
>
>>>> +static inline unsigned long encoded_page_order(struct encoded_va *encoded_va)
>>>> +{
>>>> +#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>>>> + return PAGE_FRAG_CACHE_ORDER_MASK & (unsigned long)encoded_va;
>>>> +#else
>>>> + return 0;
>>>> +#endif
>>>> +}
>>>> +
>>>> +static inline bool encoded_page_pfmemalloc(struct encoded_va *encoded_va)
>>>> +{
>>>> + return PAGE_FRAG_CACHE_PFMEMALLOC_BIT & (unsigned long)encoded_va;
>>>> +}
>>>> +
>>>
>>> My advice is that if you just make encoded_va an unsigned long this
>>> just becomes some FIELD_GET and bit operations.
>>
>> As above.
>>
>
> The code you mentioned had one extra block of bits that was in it and
> had strict protections on what went into and out of those bits. You
> don't have any of those protections.
As above, the protection masking/checking is explicitly avoided due
to performance consideration and reasons as above for encoded_va.
But I guess it doesn't hurt to have a VM_BUG_ON() checking to catch
possible future mistake.
>
> I suggest you just use a long and don't bother storing this as a
> pointer.
>
...
>>>> -
>>>> + remaining = nc->remaining & align_mask;
>>>> + remaining -= fragsz;
>>>> + if (unlikely(remaining < 0)) {
>>>
>>> Now this is just getting confusing. You essentially just added an
>>> additional addition step and went back to the countdown approach I was
>>> using before except for the fact that you are starting at 0 whereas I
>>> was actually moving down through the page.
>>
>> Does the 'additional addition step' mean the additional step to calculate
>> the offset using the new 'remaining' field? I guess that is the disadvantage
>> by changing 'offset' to 'remaining', but it also some advantages too:
>>
>> 1. it is better to replace 'offset' with 'remaining', which
>> is the remaining size for the cache in a 'page_frag_cache'
>> instance, we are able to do a single 'fragsz > remaining'
>> checking for the case of cache not being enough, which should be
>> the fast path if we ensure size is zoro when 'va' == NULL by
>> memset'ing 'struct page_frag_cache' in page_frag_cache_init()
>> and page_frag_cache_drain().
>> 2. It seems more convenient to implement the commit/probe API too
>> when using 'remaining' instead of 'offset' as those API needs
>> the remaining size of the page_frag_cache anyway.
>>
>> So it is really a trade-off between using 'offset' and 'remaining',
>> it is like the similar argument about trade-off between allocating
>> fragment 'countdown' and 'countup' way.
>>
>> About confusing part, as the nc->remaining does mean how much cache
>> is left in the 'nc', and nc->remaining does start from
>> PAGE_FRAG_CACHE_MAX_SIZE/PAGE_SIZE to zero naturally if that was what
>> you meant by 'countdown', but it is different from the 'offset countdown'
>> before this patchset as my understanding.
>>
>>>
>>> What I would suggest doing since "remaining" is a negative offset
>>> anyway would be to look at just storing it as a signed negative number.
>>> At least with that you can keep to your original approach and would
>>> only have to change your check to be for "remaining + fragsz <= 0".
>>
>> Did you mean by using s16/s32 for 'remaining'? And set nc->remaining like
>> below?
>> nc->remaining = -PAGE_SIZE or
>> nc->remaining = -PAGE_FRAG_CACHE_MAX_SIZE
>
> Yes. Basically if we used it as a signed value then you could just work
> your way up and do your aligned math still.
For the aligned math, I am not sure how can 'align_mask' be appiled to
a signed value yet. It seems that after masking nc->remaining leans
towards -PAGE_SIZE/-PAGE_FRAG_CACHE_MAX_SIZE instead of zero for
a unsigned value, for example:
nc->remaining = -4094;
nc->remaining &= -64;
It seems we got -4096 for above, is that what we are expecting?
>
>> struct page_frag_cache {
>> struct encoded_va *encoded_va;
>>
>> #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) && (BITS_PER_LONG <= 32)
>> u16 pagecnt_bias;
>> s16 remaining;
>> #else
>> u32 pagecnt_bias;
>> s32 remaining;
>> #endif
>> };
>>
>> If I understand above correctly, it seems we really need a better name
>> than 'remaining' to reflect that.
>
> It would effectively work like a countdown. Instead of T - X in this
> case it is size - X.
>
>>> With that you can still do your math but it becomes an addition instead
>>> of a subtraction.
>>
>> And I am not really sure what is the gain here by using an addition
>> instead of a subtraction here.
>>
>
> The "remaining" as it currently stands is doing the same thing so odds
> are it isn't too big a deal. It is just that the original code was
> already somewhat confusing and this is just making it that much more
> complex.
>
>>>> + page = virt_to_page(encoded_va);
>>>> if (!page_ref_sub_and_test(page, nc->pagecnt_bias))
>>>> goto refill;
>>>>
>>>> - if (unlikely(nc->pfmemalloc)) {
>>>> - free_unref_page(page, compound_order(page));
>>>> + if (unlikely(encoded_page_pfmemalloc(encoded_va))) {
>>>> + VM_BUG_ON(compound_order(page) !=
>>>> + encoded_page_order(encoded_va));
>>>> + free_unref_page(page, encoded_page_order(encoded_va));
>>>> goto refill;
>>>> }
>>>>
>>>> /* OK, page count is 0, we can safely set it */
>>>> set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
>>>>
>>>> - /* reset page count bias and offset to start of new frag */
>>>> + /* reset page count bias and remaining of new frag */
>>>> nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
>>>> - offset = 0;
>>>> - if (unlikely(fragsz > PAGE_SIZE)) {
>>>> + nc->remaining = remaining = page_frag_cache_page_size(encoded_va);
>>>> + remaining -= fragsz;
>>>> + if (unlikely(remaining < 0)) {
>>>> /*
>>>> * The caller is trying to allocate a fragment
>>>> * with fragsz > PAGE_SIZE but the cache isn't big
>>>
>>> I find it really amusing that you went to all the trouble of flipping
>>> the logic just to flip it back to being a countdown setup. If you were
>>> going to bother with all that then why not just make the remaining
>>> negative instead? You could save yourself a ton of trouble that way and
>>> all you would need to do is flip a few signs.
>>
>> I am not sure I understand the 'a ton of trouble' part here, if 'flipping
>> a few signs' does save a ton of trouble here, I would like to avoid 'a
>> ton of trouble' here, but I am not really understand the gain here yet as
>> mentioned above.
>
> It isn't about flipping the signs. It is more the fact that the logic
> has now become even more complex then it originally was. With my work
> backwards approach the advantage was that the offset could be updated
> and then we just recorded everything and reported it up. Now we have to
I am supposing the above is referring to 'offset countdown' way
before this patchset, right?
> keep a temporary "remaining" value, generate our virtual address and
> store that to a temp variable, we can record the new "remaining" value,
> and then we can report the virtual address. If we get the ordering on
Yes, I noticed it when coding too, that is why current virtual address is
generated in page_frag_cache_current_va() basing on nc->remaining instead
of the local variable 'remaining' before assigning the local variable
'remaining' to nc->remaining. But I am not sure I understand how using a
signed negative number for 'remaining' will change the above steps. If
not, I still fail to see the gain of using a signed negative number for
'nc->remaining'.
> the steps 2 and 3 in this it will cause issues as we will be pointing
> to the wrong values. It is something I can see someone easily messing
> up.
Yes, agreed. It would be good to be more specific about how to avoid
the above problem using a signed negative number for 'remaining' as
I am not sure how it can be done yet.
>
Powered by blists - more mailing lists