linux-kernel - Re: Linux regressions report for mainline [2023-02-11]

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Y+hl552juPj8BNux@casper.infradead.org>
Date:   Sun, 12 Feb 2023 04:07:03 +0000
From:   Matthew Wilcox <willy@...radead.org>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     "Regzbot (on behalf of Thorsten Leemhuis)" 
        <regressions@...mhuis.info>, Vlastimil Babka <vbabka@...e.cz>,
        David Chen <david.chen@...anix.com>,
        LKML <linux-kernel@...r.kernel.org>,
        Linux regressions mailing list <regressions@...ts.linux.dev>
Subject: Re: Linux regressions report for mainline [2023-02-11]

On Sat, Feb 11, 2023 at 02:31:53PM -0800, Linus Torvalds wrote:
> On Sat, Feb 11, 2023 at 1:39 PM Linus Torvalds
> <torvalds@...ux-foundation.org> wrote:
> >
> > Or even just reverting the original commit e320d3012d25
> > ("mm/page_alloc.c: fix freeing non-compound pages") and say that the
> > (very rare) memory leak is much less dangerous than that hacky fix
> > (that was buggy).
> >
> > Because it's a bit dodgy how commit e320d3012d25 ends up hooking into
> > __free_pages(),
> 
> Actually, that's not the only dodgy thing about it.
> 
> It assumes that any multi-order page allocator user doesn't use the
> page counts and only ever has a single "alloc" and a "free".
> 
> And apparently that assumption is correct, or we'd have seen a lot of problems.
> 
> But it *also* assumes that the speculative page alloc/free was for one
> single page, and while that used to be true, the whole higher-order
> folio code means that it's not necessarily true any more.
> 
> Or rather, I guess it *is* true in practice, but if you ever want to
> enable 16kB folios on some filesystem, that commit e320d3012d25 is
> just plain unfixably buggy.
> 
> Are we there yet? Clearly not, considering bugs like this. But it all
> does make me go "Hmm, maybe we'd be better off with the outright
> revert and accept the unlikely memory leak for now".
> 
> Willy?

OK, you've somehow got hold of the wrong end of this problem and that's
why you think it's larger than it is.

Compound pages are not the problem.  They carry their size with them, and
when the refcount drops to zero, we free the whole allocation as one unit.

The problem is high-order allocations that don't set __GFP_COMP.
They don't record the size of the allocation.  And so we had this problem
where if there's a speculative refcount on the first page while the owner
calls __free_pages(), the tail pages weren't freed.  And the first page
isn't a head page, it looks like an order-0 page, so when the speculative
owner calls put_page(), we still don't free the tail page.

Somewhat complicating this is that some places which allocate a
compound page free it by calling __free_pages().  It's not wrong,
but we can't free the tail pages at this time because they'll be
freed by put_page().  So that's why we're testing PageHead() --
it's "Is this a compound page".

What I was vaguely afraid of was that some code would do something like:

p = alloc_pages(gfp, 2);
get_page(p);
__free_pages(p, 2);
... do something with p[1] ...
__free_pages(p, 2);

but it seems like nobody does that, or we'd've seen complaints by now.

If we could get rid of all non-compound allocations, I'd be happy, but I
haven't even looked to see how hard that would be.  Slab, page cache and
anon memory all use compound pages, including hugetlb.  I think crypto
is the main remaining user of non-compound high-order allocations.