linux-kernel - Re: [External] Re: [PATCH v5 11/21] mm/hugetlb: Allocate the vmemmap pages associated with each hugetlb page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20201120111033.GN3200@dhcp22.suse.cz>
Date:   Fri, 20 Nov 2020 12:10:33 +0100
From:   Michal Hocko <mhocko@...e.com>
To:     Muchun Song <songmuchun@...edance.com>
Cc:     Jonathan Corbet <corbet@....net>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Thomas Gleixner <tglx@...utronix.de>, mingo@...hat.com,
        bp@...en8.de, x86@...nel.org, hpa@...or.com,
        dave.hansen@...ux.intel.com, luto@...nel.org,
        Peter Zijlstra <peterz@...radead.org>, viro@...iv.linux.org.uk,
        Andrew Morton <akpm@...ux-foundation.org>, paulmck@...nel.org,
        mchehab+huawei@...nel.org, pawan.kumar.gupta@...ux.intel.com,
        Randy Dunlap <rdunlap@...radead.org>, oneukum@...e.com,
        anshuman.khandual@....com, jroedel@...e.de,
        Mina Almasry <almasrymina@...gle.com>,
        David Rientjes <rientjes@...gle.com>,
        Matthew Wilcox <willy@...radead.org>,
        Oscar Salvador <osalvador@...e.de>,
        "Song Bao Hua (Barry Song)" <song.bao.hua@...ilicon.com>,
        Xiongchun duan <duanxiongchun@...edance.com>,
        linux-doc@...r.kernel.org, LKML <linux-kernel@...r.kernel.org>,
        Linux Memory Management List <linux-mm@...ck.org>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>
Subject: Re: [External] Re: [PATCH v5 11/21] mm/hugetlb: Allocate the vmemmap
 pages associated with each hugetlb page

On Fri 20-11-20 17:37:09, Muchun Song wrote:
> On Fri, Nov 20, 2020 at 5:28 PM Michal Hocko <mhocko@...e.com> wrote:
> >
> > On Fri 20-11-20 16:51:59, Muchun Song wrote:
> > > On Fri, Nov 20, 2020 at 4:11 PM Michal Hocko <mhocko@...e.com> wrote:
> > > >
> > > > On Fri 20-11-20 14:43:15, Muchun Song wrote:
> > > > [...]
> > > > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > > > > index eda7e3a0b67c..361c4174e222 100644
> > > > > --- a/mm/hugetlb_vmemmap.c
> > > > > +++ b/mm/hugetlb_vmemmap.c
> > > > > @@ -117,6 +117,8 @@
> > > > >  #define RESERVE_VMEMMAP_NR           2U
> > > > >  #define RESERVE_VMEMMAP_SIZE         (RESERVE_VMEMMAP_NR << PAGE_SHIFT)
> > > > >  #define TAIL_PAGE_REUSE                      -1
> > > > > +#define GFP_VMEMMAP_PAGE             \
> > > > > +     (GFP_KERNEL | __GFP_NOFAIL | __GFP_MEMALLOC)
> > > >
> > > > This is really dangerous! __GFP_MEMALLOC would allow a complete memory
> > > > depletion. I am not even sure triggering the OOM killer is a reasonable
> > > > behavior. It is just unexpected that shrinking a hugetlb pool can have
> > > > destructive side effects. I believe it would be more reasonable to
> > > > simply refuse to shrink the pool if we cannot free those pages up. This
> > > > sucks as well but it isn't destructive at least.
> > >
> > > I find the instructions of __GFP_MEMALLOC from the kernel doc.
> > >
> > > %__GFP_MEMALLOC allows access to all memory. This should only be used when
> > > the caller guarantees the allocation will allow more memory to be freed
> > > very shortly.
> > >
> > > Our situation is in line with the description above. We will free a HugeTLB page
> > > to the buddy allocator which is much larger than that we allocated shortly.
> >
> > Yes that is a part of the description. But read it in its full entirety.
> >  * %__GFP_MEMALLOC allows access to all memory. This should only be used when
> >  * the caller guarantees the allocation will allow more memory to be freed
> >  * very shortly e.g. process exiting or swapping. Users either should
> >  * be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
> >  * Users of this flag have to be extremely careful to not deplete the reserve
> >  * completely and implement a throttling mechanism which controls the
> >  * consumption of the reserve based on the amount of freed memory.
> >  * Usage of a pre-allocated pool (e.g. mempool) should be always considered
> >  * before using this flag.
> >
> > GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_HIGH
> 
> We want to free the HugeTLB page to the buddy allocator, but before that,
> we need to allocate some pages as vmemmap pages, so here we cannot
> handle allocation failures.

Why cannot you simply refuse to shrink the pool size?

> I think that we should replace the
> __GFP_RETRY_MAYFAIL to __GFP_NOFAIL.
> 
> GFP_KERNEL | __GFP_NOFAIL | __GFP_HIGH
> 
> This meets our needs here. Thanks.

Please read again my concern about the disruptive behavior or explain
why it is desirable.

-- 
Michal Hocko
SUSE Labs