linux-kernel - Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230919064914.GA124289@cmpxchg.org>
Date:   Tue, 19 Sep 2023 02:49:14 -0400
From:   Johannes Weiner <hannes@...xchg.org>
To:     Mike Kravetz <mike.kravetz@...cle.com>
Cc:     Vlastimil Babka <vbabka@...e.cz>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Miaohe Lin <linmiaohe@...wei.com>,
        Kefeng Wang <wangkefeng.wang@...wei.com>,
        Zi Yan <ziy@...dia.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene

On Mon, Sep 18, 2023 at 10:40:37AM -0700, Mike Kravetz wrote:
> On 09/18/23 10:52, Johannes Weiner wrote:
> > On Mon, Sep 18, 2023 at 09:16:58AM +0200, Vlastimil Babka wrote:
> > > On 9/16/23 21:57, Mike Kravetz wrote:
> > > > On 09/15/23 10:16, Johannes Weiner wrote:
> > > >> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
> > > > 
> > > > With the patch below applied, a slightly different workload triggers the
> > > > following warnings.  It seems related, and appears to go away when
> > > > reverting the series.
> > > > 
> > > > [  331.595382] ------------[ cut here ]------------
> > > > [  331.596665] page type is 5, passed migratetype is 1 (nr=512)
> > > > [  331.598121] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:662 expand+0x1c9/0x200
> > > 
> > > Initially I thought this demonstrates the possible race I was suggesting in
> > > reply to 6/6. But, assuming you have CONFIG_CMA, page type 5 is cma and we
> > > are trying to get a MOVABLE page from a CMA page block, which is something
> > > that's normally done and the pageblock stays CMA. So yeah if the warnings
> > > are to stay, they need to handle this case. Maybe the same can happen with
> > > HIGHATOMIC blocks?

Ok, the CMA thing gave me pause because Mike's pagetypeinfo didn't
show any CMA pages.

5 is actually MIGRATE_ISOLATE - see the double use of 3 for PCPTYPES
and HIGHATOMIC.

> > This means we have an order-10 page where one half is MOVABLE and the
> > other is CMA.

This means the scenario is different:

We get a MAX_ORDER page off the MOVABLE freelist. The removal checks
that the first pageblock is indeed MOVABLE. During the expand, the
second pageblock turns out to be of type MIGRATE_ISOLATE.

The page allocator wouldn't have merged those types. It triggers a bit
too fast to be a race condition.

It appears that MIGRATE_ISOLATE is simply set on the tail pageblock
while the head is on the list, and then stranded there.

Could this be an issue in the page_isolation code? Maybe a range
rounding error?

Zi Yan, does this ring a bell for you?

I don't quite see how my patches could have caused this. But AFAICS we
also didn't have warnings for this scenario so it could be an old bug.

> > Mike, could you describe the workload that is triggering this?
> 
> This 'slightly different workload' is actually a slightly different
> environment.  Sorry for mis-speaking!  The slight difference is that this
> environment does not use the 'alloc hugetlb gigantic pages from CMA'
> (hugetlb_cma) feature that triggered the previous issue.
> 
> This is still on a 16G VM.  Kernel command line here is:
> "BOOT_IMAGE=(hd0,msdos1)/vmlinuz-6.6.0-rc1-next-20230913+
> root=UUID=49c13301-2555-44dc-847b-caabe1d62bdf ro console=tty0
> console=ttyS0,115200 audit=0 selinux=0 transparent_hugepage=always
> hugetlb_free_vmemmap=on"
> 
> The workload is just running this script:
> while true; do
>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
>  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> done
> 
> > 
> > Does this reproduce instantly and reliably?
> > 
> 
> It is not 'instant' but will reproduce fairly reliably within a minute
> or so.
> 
> Note that the 'echo 4 > .../hugepages-1048576kB/nr_hugepages' is going
> to end up calling alloc_contig_pages -> alloc_contig_range.  Those pages
> will eventually be freed via __free_pages(folio, 9).

No luck reproducing this yet, but I have a question. In that crash
stack trace, the expand() is called via this:

 [  331.645847]  get_page_from_freelist+0x3ed/0x1040
 [  331.646837]  ? prepare_alloc_pages.constprop.0+0x197/0x1b0
 [  331.647977]  __alloc_pages+0xec/0x240
 [  331.648783]  alloc_buddy_hugetlb_folio.isra.0+0x6a/0x150
 [  331.649912]  __alloc_fresh_hugetlb_folio+0x157/0x230
 [  331.650938]  alloc_pool_huge_folio+0xad/0x110
 [  331.651909]  set_max_huge_pages+0x17d/0x390

I don't see an __alloc_fresh_hugetlb_folio() in my tree. Only
alloc_fresh_hugetlb_folio(), which has this:

        if (hstate_is_gigantic(h))
                folio = alloc_gigantic_folio(h, gfp_mask, nid, nmask);
        else
                folio = alloc_buddy_hugetlb_folio(h, gfp_mask,
                                nid, nmask, node_alloc_noretry);

where gigantic is defined as the order exceeding MAX_ORDER, which
should be the case for 1G pages on x86.

So the crashing stack must be from a 2M allocation, no? I'm confused
how that could happen with the above test case.