[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <96b04ebb7f46d73482d5f71213bd800c8195f00d.camel@gmail.com>
Date: Wed, 10 Jul 2024 08:28:57 -0700
From: Alexander H Duyck <alexander.duyck@...il.com>
To: Yunsheng Lin <linyunsheng@...wei.com>, davem@...emloft.net,
kuba@...nel.org, pabeni@...hat.com
Cc: netdev@...r.kernel.org, linux-kernel@...r.kernel.org, Andrew Morton
<akpm@...ux-foundation.org>, linux-mm@...ck.org
Subject: Re: [PATCH net-next v9 06/13] mm: page_frag: reuse existing space
for 'size' and 'pfmemalloc'
On Wed, 2024-07-03 at 20:33 +0800, Yunsheng Lin wrote:
> On 2024/7/2 22:55, Alexander H Duyck wrote:
>
> ...
>
> >
> > > >
> > > > > +#define PAGE_FRAG_CACHE_ORDER_MASK GENMASK(7, 0)
> > > > > +#define PAGE_FRAG_CACHE_PFMEMALLOC_BIT BIT(8)
> > > > > +#define PAGE_FRAG_CACHE_PFMEMALLOC_SHIFT 8
> > > > > +
> > > > > +static inline struct encoded_va *encode_aligned_va(void *va,
> > > > > + unsigned int order,
> > > > > + bool pfmemalloc)
> > > > > +{
> > > > > #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> > > > > - __u16 offset;
> > > > > - __u16 size;
> > > > > + return (struct encoded_va *)((unsigned long)va | order |
> > > > > + pfmemalloc << PAGE_FRAG_CACHE_PFMEMALLOC_SHIFT);
> > > > > #else
> > > > > - __u32 offset;
> > > > > + return (struct encoded_va *)((unsigned long)va |
> > > > > + pfmemalloc << PAGE_FRAG_CACHE_PFMEMALLOC_SHIFT);
> > > > > +#endif
> > > > > +}
> > > > > +
> >
> > This is missing any and all protection of the example you cited. If I
> > want to pass order as a 32b value I can and I can corrupt the virtual
> > address. Same thing with pfmemalloc. Lets only hope you don't hit an
> > architecture where a bool is a u8 in size as otherwise that shift is
> > going to wipe out the value, and if it is a u32 which is usually the
> > case lets hope they stick to the values of 0 and 1.
>
> I explicitly checked that the protection is not really needed due to
> performance consideration:
> 1. For the 'pfmemalloc' part, it does always stick to the values of 0
> and 1 as below:
> https://elixir.bootlin.com/linux/v6.10-rc6/source/Documentation/process/coding-style.rst#L1053
>
> 2. For the 'order' part, its range can only be within 0~3.
>
> >
> > > > > +static inline unsigned long encoded_page_order(struct encoded_va *encoded_va)
> > > > > +{
> > > > > +#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> > > > > + return PAGE_FRAG_CACHE_ORDER_MASK & (unsigned long)encoded_va;
> > > > > +#else
> > > > > + return 0;
> > > > > +#endif
> > > > > +}
> > > > > +
> > > > > +static inline bool encoded_page_pfmemalloc(struct encoded_va *encoded_va)
> > > > > +{
> > > > > + return PAGE_FRAG_CACHE_PFMEMALLOC_BIT & (unsigned long)encoded_va;
> > > > > +}
> > > > > +
> > > >
> > > > My advice is that if you just make encoded_va an unsigned long this
> > > > just becomes some FIELD_GET and bit operations.
> > >
> > > As above.
> > >
> >
> > The code you mentioned had one extra block of bits that was in it and
> > had strict protections on what went into and out of those bits. You
> > don't have any of those protections.
>
> As above, the protection masking/checking is explicitly avoided due
> to performance consideration and reasons as above for encoded_va.
>
> But I guess it doesn't hurt to have a VM_BUG_ON() checking to catch
> possible future mistake.
>
> >
> > I suggest you just use a long and don't bother storing this as a
> > pointer.
> >
>
> ...
>
> > > > > -
> > > > > + remaining = nc->remaining & align_mask;
> > > > > + remaining -= fragsz;
> > > > > + if (unlikely(remaining < 0)) {
> > > >
> > > > Now this is just getting confusing. You essentially just added an
> > > > additional addition step and went back to the countdown approach I was
> > > > using before except for the fact that you are starting at 0 whereas I
> > > > was actually moving down through the page.
> > >
> > > Does the 'additional addition step' mean the additional step to calculate
> > > the offset using the new 'remaining' field? I guess that is the disadvantage
> > > by changing 'offset' to 'remaining', but it also some advantages too:
> > >
> > > 1. it is better to replace 'offset' with 'remaining', which
> > > is the remaining size for the cache in a 'page_frag_cache'
> > > instance, we are able to do a single 'fragsz > remaining'
> > > checking for the case of cache not being enough, which should be
> > > the fast path if we ensure size is zoro when 'va' == NULL by
> > > memset'ing 'struct page_frag_cache' in page_frag_cache_init()
> > > and page_frag_cache_drain().
> > > 2. It seems more convenient to implement the commit/probe API too
> > > when using 'remaining' instead of 'offset' as those API needs
> > > the remaining size of the page_frag_cache anyway.
> > >
> > > So it is really a trade-off between using 'offset' and 'remaining',
> > > it is like the similar argument about trade-off between allocating
> > > fragment 'countdown' and 'countup' way.
> > >
> > > About confusing part, as the nc->remaining does mean how much cache
> > > is left in the 'nc', and nc->remaining does start from
> > > PAGE_FRAG_CACHE_MAX_SIZE/PAGE_SIZE to zero naturally if that was what
> > > you meant by 'countdown', but it is different from the 'offset countdown'
> > > before this patchset as my understanding.
> > >
> > > >
> > > > What I would suggest doing since "remaining" is a negative offset
> > > > anyway would be to look at just storing it as a signed negative number.
> > > > At least with that you can keep to your original approach and would
> > > > only have to change your check to be for "remaining + fragsz <= 0".
> > >
> > > Did you mean by using s16/s32 for 'remaining'? And set nc->remaining like
> > > below?
> > > nc->remaining = -PAGE_SIZE or
> > > nc->remaining = -PAGE_FRAG_CACHE_MAX_SIZE
> >
> > Yes. Basically if we used it as a signed value then you could just work
> > your way up and do your aligned math still.
>
> For the aligned math, I am not sure how can 'align_mask' be appiled to
> a signed value yet. It seems that after masking nc->remaining leans
> towards -PAGE_SIZE/-PAGE_FRAG_CACHE_MAX_SIZE instead of zero for
> a unsigned value, for example:
>
> nc->remaining = -4094;
> nc->remaining &= -64;
>
> It seems we got -4096 for above, is that what we are expecting?
No, you have to do an addition and then the mask like you were before
using __ALIGN_KERNEL.
As it stands I realized your alignment approach in this patch is
broken. You should be aligning the remaining at the start of this
function and then storing it before you call
page_frag_cache_current_va. Instead you do it after so the start of
your block may not be aligned to the requested mask if you have
multiple callers sharing this function or if you are passing an
unaligned size in the request.
> >
> > > struct page_frag_cache {
> > > struct encoded_va *encoded_va;
> > >
> > > #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) && (BITS_PER_LONG <= 32)
> > > u16 pagecnt_bias;
> > > s16 remaining;
> > > #else
> > > u32 pagecnt_bias;
> > > s32 remaining;
> > > #endif
> > > };
> > >
> > > If I understand above correctly, it seems we really need a better name
> > > than 'remaining' to reflect that.
> >
> > It would effectively work like a countdown. Instead of T - X in this
> > case it is size - X.
> >
> > > > With that you can still do your math but it becomes an addition instead
> > > > of a subtraction.
> > >
> > > And I am not really sure what is the gain here by using an addition
> > > instead of a subtraction here.
> > >
> >
> > The "remaining" as it currently stands is doing the same thing so odds
> > are it isn't too big a deal. It is just that the original code was
> > already somewhat confusing and this is just making it that much more
> > complex.
> >
> > > > > + page = virt_to_page(encoded_va);
> > > > > if (!page_ref_sub_and_test(page, nc->pagecnt_bias))
> > > > > goto refill;
> > > > >
> > > > > - if (unlikely(nc->pfmemalloc)) {
> > > > > - free_unref_page(page, compound_order(page));
> > > > > + if (unlikely(encoded_page_pfmemalloc(encoded_va))) {
> > > > > + VM_BUG_ON(compound_order(page) !=
> > > > > + encoded_page_order(encoded_va));
> > > > > + free_unref_page(page, encoded_page_order(encoded_va));
> > > > > goto refill;
> > > > > }
> > > > >
> > > > > /* OK, page count is 0, we can safely set it */
> > > > > set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
> > > > >
> > > > > - /* reset page count bias and offset to start of new frag */
> > > > > + /* reset page count bias and remaining of new frag */
> > > > > nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
> > > > > - offset = 0;
> > > > > - if (unlikely(fragsz > PAGE_SIZE)) {
> > > > > + nc->remaining = remaining = page_frag_cache_page_size(encoded_va);
> > > > > + remaining -= fragsz;
> > > > > + if (unlikely(remaining < 0)) {
> > > > > /*
> > > > > * The caller is trying to allocate a fragment
> > > > > * with fragsz > PAGE_SIZE but the cache isn't big
> > > >
> > > > I find it really amusing that you went to all the trouble of flipping
> > > > the logic just to flip it back to being a countdown setup. If you were
> > > > going to bother with all that then why not just make the remaining
> > > > negative instead? You could save yourself a ton of trouble that way and
> > > > all you would need to do is flip a few signs.
> > >
> > > I am not sure I understand the 'a ton of trouble' part here, if 'flipping
> > > a few signs' does save a ton of trouble here, I would like to avoid 'a
> > > ton of trouble' here, but I am not really understand the gain here yet as
> > > mentioned above.
> >
> > It isn't about flipping the signs. It is more the fact that the logic
> > has now become even more complex then it originally was. With my work
> > backwards approach the advantage was that the offset could be updated
> > and then we just recorded everything and reported it up. Now we have to
>
> I am supposing the above is referring to 'offset countdown' way
> before this patchset, right?
>
> > keep a temporary "remaining" value, generate our virtual address and
> > store that to a temp variable, we can record the new "remaining" value,
> > and then we can report the virtual address. If we get the ordering on
>
> Yes, I noticed it when coding too, that is why current virtual address is
> generated in page_frag_cache_current_va() basing on nc->remaining instead
> of the local variable 'remaining' before assigning the local variable
> 'remaining' to nc->remaining. But I am not sure I understand how using a
> signed negative number for 'remaining' will change the above steps. If
> not, I still fail to see the gain of using a signed negative number for
> 'nc->remaining'.
>
> > the steps 2 and 3 in this it will cause issues as we will be pointing
> > to the wrong values. It is something I can see someone easily messing
> > up.
>
> Yes, agreed. It would be good to be more specific about how to avoid
> the above problem using a signed negative number for 'remaining' as
> I am not sure how it can be done yet.
>
My advice would be to go back to patch 3 and figure out how to do this
re-ordering without changing the alignment behaviour. The old code
essentially aligned both the offset and fragsz by combining the two and
then doing the alignment. Since you are doing a count up setup you will
need to align the remaining, then add fragsz, and then I guess you
could store remaining and then subtract fragsz from your final virtual
address to get back to where the starting offset is actually located.
Basically your "remaining" value isn't a safe number to use for an
offset since it isn't aligned to your starting value at any point.
As far as the negative value, it is more about making it easier to keep
track of what is actually going on. Basically we can use regular
pointer math and as such I suspect the compiler is having to do extra
instructions to flip your value negative before it can combine the
values via something like the LEA (load effective address) assembler
call.
Powered by blists - more mailing lists