lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ZNpsEN5CPqlVpGHp@x1n>
Date:   Mon, 14 Aug 2023 14:01:52 -0400
From:   Peter Xu <peterx@...hat.com>
To:     Matthew Wilcox <willy@...radead.org>
Cc:     linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        Yu Zhao <yuzhao@...gle.com>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Yang Shi <shy828301@...il.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        "Kirill A . Shutemov" <kirill@...temov.name>,
        Hugh Dickins <hughd@...gle.com>,
        David Hildenbrand <david@...hat.com>,
        Ryan Roberts <ryan.roberts@....com>
Subject: Re: [PATCH RFC] mm: Properly document tail pages for compound pages

On Thu, Aug 10, 2023 at 10:43:44PM +0100, Matthew Wilcox wrote:
> On Thu, Aug 10, 2023 at 04:49:44PM -0400, Peter Xu wrote:
> > Tail page struct reuse is over-comlicated.  Not only because we have
> > implicit uses of tail page fields (mapcounts, or private for thp swap
> > support, etc., that we _may_ still use in the page structs, but not obvious
> > the relationship between that and the folio definitions), but also because
> > we have 32/64 bits layouts for struct page so it's unclear what we can use
> > and what we cannot when trying to find a new spot in folio struct.
> 
> I do not like this patch.
> 
> > We also have tricks like page->mapping, where we can reuse only the tail
> > page 1/2 but nothing more than tail page 2.  It is all mostly hidden, until
> > someone starts to read into a VM_BUG_ON_PAGE() of __split_huge_page_tail().
> 
> We can change those BUG_ON if we want to reuse mapping in more tail pages.
> Ask!
> 
> > Let's document it clearly on what we can use and what we can't, with 100%
> > explanations on each of them.  Hopefully this will make:
> 
> The explanations are still very page centric.  I do not like the style
> of them, nor how you explain them.
> 
> > One pitfall is I'll need to split part of the tail page 1 definition into
> > 32/64 bits differently, that introduced some duplications on the fields.
> > But hopefully that's worthwhile as it makes everything crystal clear.  Not
> > to mention that "pitfall" also brings a benefit that we can actually define
> > fields in different order for 32/64 bits when we want.
> 
> No.  This is going to ruin kernel-doc.

My fault. I'll make sure kernel-doc will keep looking sane.

It'll mean that my below trick on separating for 32/64 bits with duplicated
fields will stop working, but I found some other way to represent this.

> 
> > +	/*
> > +	 * Some of the tail page fields (out of 8 WORDs for either 32/64
> 
> There's your first mistake; struct page is not necessarily 8 WORDs.
> You've got 7 words for sure, then on 32-bit you have 8 because atomic_t
> is word-sized.  memcg_data might be the 9th word, virtual could be
> the tenth, two awful kmsan intrusions could bring it to twelve, and
> last_cpupid could bring it to thirteen.
> 
> Sure, it's 8 words on x86-64 with CONFIG_MEMCG enabled.  But that's
> just your system.

True.  I messed up the mapcount/refcount position on 64 bits, sorry.

I hope atomic_t can always be assumed as 4 bytes though, and should hardly
change trivially, so I can still rely upon.

> 
> > +	 * bits archs) may not be reused by the folio object because
> > +	 * they're already been used by the page struct:
> > +	 *
> > +	 * |-------+---------------|
> > +	 * | Index | Field         |
> > +	 * |-------+---------------|
> > +	 * |     0 | flag          |
> > +	 * |     1 | compound_head |
> > +	 * |     2 | N/A [0]       |
> > +	 * |     3 | mapping [1]   |
> > +	 * |     4 | N/A [0]       |
> > +	 * |     5 | private [2]   |
> > +	 * |     6 | mapcount      |
> > +	 * |     7 | N/A [0]       |
> 
> This is wrong.  You mustn't reuse refcount.  refcount must remain 0 on
> all tail pages.  And you can't use anything after refcount, because it's
> all optional on various configurations.

I got the answer in the other thread.  I still don't know a full list of
things that we may require that, the only two things I got now are: (1)
fast-gup over any-sized folio on speculative boosts of refs, and (2) pfn
walkers like has_unmovable_pages().

I'll make an example here describing this, but if anyone can have a full
list of why it must keep zero here I think it'll be great to document all
of them, and this seems to be one of the best places to do it.

> 
> > +	 * |-------+---------------|
> > +	 *
> > +	 * [0] "N/A" marks fields that are available to leverage for the
> > +	 *     large folio.
> 
> N/A is a bad way to say this.  "Free" or "Available" would be better.

Indeed.  I'll choose FREE.

> 
> > +	 * [1] "mapping" field is only used for sanity check, see
> > +	 *     TAIL_MAPPING.  Still valid to use for tail pages 1/2.
> > +	 *     (for that, see __split_huge_page_tail()).
> 
> No, definitely wrong to document this.

I can document it in a "less detailed" way.  I tend to introduce one macro
TAIL_MAPPING_REUSED_MAX=2 and use it in __split_huge_page_tail().

> 
> > +	 * [2] "private" field is used when THP_SWAP is on (disabled on 32
> > +	 *     bits, or on hugetlb folios) .
> 
> Ugh, this needs to be fixed, not documented.  If you really must
> document it, at least say that this needs to be fixed.

Sure.  Any further description on the problem and "a proper fix"?  I will
document that as "may need fixing" for now, but I'd try to reference more
on it if I can understand better on the problem.

> 
> > +	 */
> >  	union {
> >  		struct {
> > +	/* WORD 0-1: not valid to reuse */
> 
> ... so now you're repeating all the information you provided above?

I'll try to remove many of them, but keep some so we read the fields
easier, knowing the offset of the WORD.  I'm still open to remove all of
them if you think that's preferred.

> 
> >  			unsigned long _flags_1;
> >  			unsigned long _head_1;
> > -	/* public: */
> 
> ... did you check kernel-doc after removing this?
> 
> > +	/* WORD 2 */
> >  			unsigned char _folio_dtor;
> >  			unsigned char _folio_order;
> > +			unsigned char _holes_1[2];
> 
> No.  If you need to search for holes, use pahole.

Actually using _holes_* is definitely not right, because holes are
literally free spaces that can be reused.  Here IMHO I should describe it
as two free bytes just like below, and that's one major goal of such a
patch, to be clear on what can be still reused in tail pages.

> 
> > +#ifdef CONFIG_64BIT
> >  			atomic_t _entire_mapcount;
> > +	/* WORD 3 */
> >  			atomic_t _nr_pages_mapped;
> >  			atomic_t _pincount;
> > -#ifdef CONFIG_64BIT
> > +	/* WORD 4 */
> >  			unsigned int _folio_nr_pages;
> > +			unsigned int _reserved_1_1;
> > +	/* WORD 5-6: not valid to reuse */
> > +			unsigned long _used_1_2[2];

(These "_used*" things will go away; I took another trick to represent only
 "free" slots not "used", side effect of making kernel-doc happy)

> > +	/* WORD 7 */
> > +			unsigned long _reserved_1_2;
> > +#else
> > +	/* WORD 3 */
> > +			atomic_t _entire_mapcount;
> > +	/* WORD 4 */
> > +			atomic_t _nr_pages_mapped;
> > +	/* WORD 5: only valid for 32bits */
> > +			atomic_t _pincount;
> > +	/* WORD 6: not valid to reuse */
> > +			unsigned long _used_1_2;
> > +	/* WORD 7 */
> > +			unsigned long _reserved_1;
> >  #endif
> > -	/* private: the union with struct page is transitional */
> >  		};
> > +	/* private: the union with struct page is transitional */
> 
> You don't understand why I did it like this.  Again, you have to build
> the kernel-doc and you'll see why the private: and public: markers are
> where they are.
> 
> There was even a thread on it, a year or two ago, where I outlined the
> various tradeoffs between complexity of the output and the complexity
> of the input.

Links would be appreciated, at the meantime I'll send a rfcv2.  I hope that
addresses most of the issues above while still provide most of what I
wanted to provide; not sure about this last one though before I get more
information.  Let me know if the whole idea just won't work, then I can
always stop.

-- 
Peter Xu

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ