lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 18 Nov 2020 00:29:07 +0800
From:   Muchun Song <songmuchun@...edance.com>
To:     "Song Bao Hua (Barry Song)" <song.bao.hua@...ilicon.com>
Cc:     "corbet@....net" <corbet@....net>,
        "mike.kravetz@...cle.com" <mike.kravetz@...cle.com>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "bp@...en8.de" <bp@...en8.de>, "x86@...nel.org" <x86@...nel.org>,
        "hpa@...or.com" <hpa@...or.com>,
        "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
        "luto@...nel.org" <luto@...nel.org>,
        "peterz@...radead.org" <peterz@...radead.org>,
        "viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
        "paulmck@...nel.org" <paulmck@...nel.org>,
        "mchehab+huawei@...nel.org" <mchehab+huawei@...nel.org>,
        "pawan.kumar.gupta@...ux.intel.com" 
        <pawan.kumar.gupta@...ux.intel.com>,
        "rdunlap@...radead.org" <rdunlap@...radead.org>,
        "oneukum@...e.com" <oneukum@...e.com>,
        "anshuman.khandual@....com" <anshuman.khandual@....com>,
        "jroedel@...e.de" <jroedel@...e.de>,
        "almasrymina@...gle.com" <almasrymina@...gle.com>,
        "rientjes@...gle.com" <rientjes@...gle.com>,
        "willy@...radead.org" <willy@...radead.org>,
        "osalvador@...e.de" <osalvador@...e.de>,
        "mhocko@...e.com" <mhocko@...e.com>,
        "duanxiongchun@...edance.com" <duanxiongchun@...edance.com>,
        "linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>
Subject: Re: [External] RE: [PATCH v4 00/21] Free some vmemmap pages of
 hugetlb page

On Tue, Nov 17, 2020 at 7:08 PM Song Bao Hua (Barry Song)
<song.bao.hua@...ilicon.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Muchun Song [mailto:songmuchun@...edance.com]
> > Sent: Tuesday, November 17, 2020 11:50 PM
> > To: Song Bao Hua (Barry Song) <song.bao.hua@...ilicon.com>
> > Cc: corbet@....net; mike.kravetz@...cle.com; tglx@...utronix.de;
> > mingo@...hat.com; bp@...en8.de; x86@...nel.org; hpa@...or.com;
> > dave.hansen@...ux.intel.com; luto@...nel.org; peterz@...radead.org;
> > viro@...iv.linux.org.uk; akpm@...ux-foundation.org; paulmck@...nel.org;
> > mchehab+huawei@...nel.org; pawan.kumar.gupta@...ux.intel.com;
> > rdunlap@...radead.org; oneukum@...e.com; anshuman.khandual@....com;
> > jroedel@...e.de; almasrymina@...gle.com; rientjes@...gle.com;
> > willy@...radead.org; osalvador@...e.de; mhocko@...e.com;
> > duanxiongchun@...edance.com; linux-doc@...r.kernel.org;
> > linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> > linux-fsdevel@...r.kernel.org
> > Subject: Re: [External] RE: [PATCH v4 00/21] Free some vmemmap pages of
> > hugetlb page
> >
> > On Tue, Nov 17, 2020 at 6:16 PM Song Bao Hua (Barry Song)
> > <song.bao.hua@...ilicon.com> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: owner-linux-mm@...ck.org [mailto:owner-linux-mm@...ck.org] On
> > > > Behalf Of Muchun Song
> > > > Sent: Saturday, November 14, 2020 12:00 AM
> > > > To: corbet@....net; mike.kravetz@...cle.com; tglx@...utronix.de;
> > > > mingo@...hat.com; bp@...en8.de; x86@...nel.org; hpa@...or.com;
> > > > dave.hansen@...ux.intel.com; luto@...nel.org; peterz@...radead.org;
> > > > viro@...iv.linux.org.uk; akpm@...ux-foundation.org; paulmck@...nel.org;
> > > > mchehab+huawei@...nel.org; pawan.kumar.gupta@...ux.intel.com;
> > > > rdunlap@...radead.org; oneukum@...e.com;
> > anshuman.khandual@....com;
> > > > jroedel@...e.de; almasrymina@...gle.com; rientjes@...gle.com;
> > > > willy@...radead.org; osalvador@...e.de; mhocko@...e.com
> > > > Cc: duanxiongchun@...edance.com; linux-doc@...r.kernel.org;
> > > > linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> > > > linux-fsdevel@...r.kernel.org; Muchun Song
> > <songmuchun@...edance.com>
> > > > Subject: [PATCH v4 00/21] Free some vmemmap pages of hugetlb page
> > > >
> > > > Hi all,
> > > >
> > > > This patch series will free some vmemmap pages(struct page structures)
> > > > associated with each hugetlbpage when preallocated to save memory.
> > > >
> > > > Nowadays we track the status of physical page frames using struct page
> > > > structures arranged in one or more arrays. And here exists one-to-one
> > > > mapping between the physical page frame and the corresponding struct
> > page
> > > > structure.
> > > >
> > > > The HugeTLB support is built on top of multiple page size support that
> > > > is provided by most modern architectures. For example, x86 CPUs normally
> > > > support 4K and 2M (1G if architecturally supported) page sizes. Every
> > > > HugeTLB has more than one struct page structure. The 2M HugeTLB has
> > 512
> > > > struct page structure and 1G HugeTLB has 4096 struct page structures. But
> > > > in the core of HugeTLB only uses the first 4 (Use of first 4 struct page
> > > > structures comes from HUGETLB_CGROUP_MIN_ORDER.) struct page
> > > > structures to
> > > > store metadata associated with each HugeTLB. The rest of the struct page
> > > > structures are usually read the compound_head field which are all the same
> > > > value. If we can free some struct page memory to buddy system so that we
> > > > can save a lot of memory.
> > > >
> > > > When the system boot up, every 2M HugeTLB has 512 struct page
> > structures
> > > > which size is 8 pages(sizeof(struct page) * 512 / PAGE_SIZE).
> > > >
> > > >    hugetlbpage                  struct pages(8 pages)          page
> > > > frame(8 pages)
> > > >   +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
> > > >   |           |                     |     0     | -------------> |
> > 0
> > > > |
> > > >   |           |                     |     1     | -------------> |
> > 1
> > > > |
> > > >   |           |                     |     2     | -------------> |
> > 2
> > > > |
> > > >   |           |                     |     3     | -------------> |
> > 3
> > > > |
> > > >   |           |                     |     4     | -------------> |
> > 4
> > > > |
> > > >   |     2M    |                     |     5     | -------------> |
> > > > 5     |
> > > >   |           |                     |     6     | -------------> |
> > 6
> > > > |
> > > >   |           |                     |     7     | -------------> |
> > 7
> > > > |
> > > >   |           |                     +-----------+
> > > > +-----------+
> > > >   |           |
> > > >   |           |
> > > >   +-----------+
> > > >
> > > >
> > > > When a hugetlbpage is preallocated, we can change the mapping from
> > above
> > > > to
> > > > bellow.
> > > >
> > > >    hugetlbpage                  struct pages(8 pages)          page
> > > > frame(8 pages)
> > > >   +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
> > > >   |           |                     |     0     | -------------> |
> > 0
> > > > |
> > > >   |           |                     |     1     | -------------> |
> > 1
> > > > |
> > > >   |           |                     |     2     | ------------->
> > > > +-----------+
> > > >   |           |                     |     3     | -----------------^ ^
> > ^ ^
> > > > ^
> > > >   |           |                     |     4     | -------------------+
> > | |
> > > > |
> > > >   |     2M    |                     |     5     |
> > ---------------------+ |
> > > > |
> > > >   |           |                     |     6     |
> > -----------------------+ |
> > > >   |           |                     |     7     |
> > -------------------------+
> > > >   |           |                     +-----------+
> > > >   |           |
> > > >   |           |
> > > >   +-----------+
> > > >
> > > > For tail pages, the value of compound_head is the same. So we can reuse
> > > > first page of tail page structs. We map the virtual addresses of the
> > > > remaining 6 pages of tail page structs to the first tail page struct,
> > > > and then free these 6 pages. Therefore, we need to reserve at least 2
> > > > pages as vmemmap areas.
> > > >
> > > > When a hugetlbpage is freed to the buddy system, we should allocate six
> > > > pages for vmemmap pages and restore the previous mapping relationship.
> > > >
> > > > If we uses the 1G hugetlbpage, we can save 4088 pages(There are 4096
> > pages
> > > > for
> > > > struct page structures, we reserve 2 pages for vmemmap and 8 pages for
> > page
> > > > tables. So we can save 4088 pages). This is a very substantial gain. On our
> > > > server, run some SPDK/QEMU applications which will use 1024GB
> > hugetlbpage.
> > > > With this feature enabled, we can save ~16GB(1G hugepage)/~11GB(2MB
> > > > hugepage)
> > >
> > > Hi Muchun,
> > >
> > > Do we really save 11GB for 2MB hugepage?
> > > How much do we save if we only get one 2MB hugetlb from one 128MB
> > mem_section?
> > > It seems we need to get at least one page for the PTEs since we are splitting
> > PMD of
> > > vmemmap into PTE?
> >
> > There are 524288(1024GB/2MB) 2MB HugeTLB pages. We can save 6 pages for
> > each
> > 2MB HugeTLB page. So we can save 3145728 pages. But we need to split PMD
> > page
> > table for every one 128MB mem_section and every section need one page
> > as PTE page
> > table. So we need 8192(1024GB/128MB) pages as PTE page tables.
> > Finally, we can save
> > 3137536(3145728-8192) pages which is 11.97GB.
>
> The worst case I can see is that:
> if we get 100 hugetlb with 2MB size, but the 100 hugetlb comes from different
> mem_section, we won't save 11.97GB. we only save 5/8 * 16GB=10GB.
>
> Anyway, it seems 11GB is in the middle of 10GB and 11.97GB,
> so sounds sensible :-)
>
> ideally, we should be able to free PageTail if we change struct page in some way.
> Then we will save much more for 2MB hugetlb. but it seems it is not easy.

Now for the 2MB HugrTLB page, we only free 6 vmemmap pages.
But your words woke me up. Maybe we really can free 7 vmemmap
pages. In this case, we can see 8 of the 512 struct page structures
has beed set PG_head flag. If we can adjust compound_head()
slightly and make compound_head() return the real head struct
page when the parameter is the tail struct page but with PG_head
flag set. I will start an investigation and a test.

Thanks.

>
> Thanks
> Barry



-- 
Yours,
Muchun

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ