lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200903162527.GF60440@carbon.dhcp.thefacebook.com>
Date:   Thu, 3 Sep 2020 09:25:27 -0700
From:   Roman Gushchin <guro@...com>
To:     Michal Hocko <mhocko@...e.com>
CC:     Zi Yan <ziy@...dia.com>, <linux-mm@...ck.org>,
        Rik van Riel <riel@...riel.com>,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        Matthew Wilcox <willy@...radead.org>,
        Shakeel Butt <shakeelb@...gle.com>,
        Yang Shi <yang.shi@...ux.alibaba.com>,
        David Nellans <dnellans@...dia.com>,
        <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Thu, Sep 03, 2020 at 09:32:54AM +0200, Michal Hocko wrote:
> On Wed 02-09-20 14:06:12, Zi Yan wrote:
> > From: Zi Yan <ziy@...dia.com>
> > 
> > Hi all,
> > 
> > This patchset adds support for 1GB THP on x86_64. It is on top of
> > v5.9-rc2-mmots-2020-08-25-21-13.
> > 
> > 1GB THP is more flexible for reducing translation overhead and increasing the
> > performance of applications with large memory footprint without application
> > changes compared to hugetlb.
> 
> Please be more specific about usecases. This better have some strong
> ones because THP code is complex enough already to add on top solely
> based on a generic TLB pressure easing.

Hello, Michal!

We at Facebook are using 1 GB hugetlbfs pages and are getting noticeable
performance wins on some workloads.

Historically we allocated gigantic pages at the boot time, but recently moved
to cma-based dynamic approach. Still, hugetlbfs interface requires more management
than we would like to do. 1 GB THP seems to be a better alternative. So I definitely
see it as a very useful feature.

Given the cost of an allocation, I'm slightly skeptical about an automatic
heuristics-based approach, but if an application can explicitly mark target areas
with madvise(), I don't see why it wouldn't work.

In our case we'd like to have a reliable way to get 1 GB THPs at some point
(usually at the start of an application), and transparently destroy them on
the application exit.

Once we'll have the patchset in a relatively good shape, I'll be happy to give
it a test in our environment and share results.

Thanks!

> 
> > Design
> > =======
> > 
> > 1GB THP implementation looks similar to exiting THP code except some new designs
> > for the additional page table level.
> > 
> > 1. Page table deposit and withdraw using a new pagechain data structure:
> >    instead of one PTE page table page, 1GB THP requires 513 page table pages
> >    (one PMD page table page and 512 PTE page table pages) to be deposited
> >    at the page allocaiton time, so that we can split the page later. Currently,
> >    the page table deposit is using ->lru, thus only one page can be deposited.
> >    A new pagechain data structure is added to enable multi-page deposit.
> > 
> > 2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD,
> >    and PTE entries. Mixing PUD an PTE mapping can be achieved with existing
> >    PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and
> >    sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base
> >    page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using
> >    page[N*512 + 3].compound_mapcount.
> > 
> > 3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane
> >    to use something less intrusive. So all 1GB THPs are allocated from reserved
> >    CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB
> >    THP is cleared as the resulting pages can be freed via normal page free path.
> >    We can fall back to alloc_contig_pages for 1GB THP if necessary.
> 
> Do those pages get instantiated during the page fault or only via
> khugepaged? This is an important design detail because then we have to
> think carefully about how much automatic we want this to be. Memory
> overhead can be quite large with 2MB THPs already. Also what about the
> allocation overhead? Do you have any numbers?
> 
> Maybe all these details are described in the patcheset but the cover
> letter should contain all that information. It doesn't make much sense
> to dig into details in a patchset this large without having an idea how
> feasible this is.
> 
> Thanks.
>  
> > Patch Organization
> > =======
> > 
> > Patch 01 adds the new pagechain data structure.
> > 
> > Patch 02 to 13 adds 1GB THP support in variable places.
> > 
> > Patch 14 tries to use alloc_contig_pages for 1GB THP allocaiton.
> > 
> > Patch 15 moves hugetlb_cma reservation to cma.c and rename it to hugepage_cma.
> > 
> > Patch 16 use hugepage_cma reservation for 1GB THP allocation.
> > 
> > 
> > Any suggestions and comments are welcome.
> > 
> > 
> > Zi Yan (16):
> >   mm: add pagechain container for storing multiple pages.
> >   mm: thp: 1GB anonymous page implementation.
> >   mm: proc: add 1GB THP kpageflag.
> >   mm: thp: 1GB THP copy on write implementation.
> >   mm: thp: handling 1GB THP reference bit.
> >   mm: thp: add 1GB THP split_huge_pud_page() function.
> >   mm: stats: make smap stats understand PUD THPs.
> >   mm: page_vma_walk: teach it about PMD-mapped PUD THP.
> >   mm: thp: 1GB THP support in try_to_unmap().
> >   mm: thp: split 1GB THPs at page reclaim.
> >   mm: thp: 1GB THP follow_p*d_page() support.
> >   mm: support 1GB THP pagemap support.
> >   mm: thp: add a knob to enable/disable 1GB THPs.
> >   mm: page_alloc: >=MAX_ORDER pages allocation an deallocation.
> >   hugetlb: cma: move cma reserve function to cma.c.
> >   mm: thp: use cma reservation for pud thp allocation.
> > 
> >  .../admin-guide/kernel-parameters.txt         |   2 +-
> >  arch/arm64/mm/hugetlbpage.c                   |   2 +-
> >  arch/powerpc/mm/hugetlbpage.c                 |   2 +-
> >  arch/x86/include/asm/pgalloc.h                |  68 ++
> >  arch/x86/include/asm/pgtable.h                |  26 +
> >  arch/x86/kernel/setup.c                       |   8 +-
> >  arch/x86/mm/pgtable.c                         |  38 +
> >  drivers/base/node.c                           |   3 +
> >  fs/proc/meminfo.c                             |   2 +
> >  fs/proc/page.c                                |   2 +
> >  fs/proc/task_mmu.c                            | 122 ++-
> >  include/linux/cma.h                           |  18 +
> >  include/linux/huge_mm.h                       |  84 +-
> >  include/linux/hugetlb.h                       |  12 -
> >  include/linux/memcontrol.h                    |   5 +
> >  include/linux/mm.h                            |  29 +-
> >  include/linux/mm_types.h                      |   1 +
> >  include/linux/mmu_notifier.h                  |  13 +
> >  include/linux/mmzone.h                        |   1 +
> >  include/linux/page-flags.h                    |  47 +
> >  include/linux/pagechain.h                     |  73 ++
> >  include/linux/pgtable.h                       |  34 +
> >  include/linux/rmap.h                          |  10 +-
> >  include/linux/swap.h                          |   2 +
> >  include/linux/vm_event_item.h                 |   7 +
> >  include/uapi/linux/kernel-page-flags.h        |   2 +
> >  kernel/events/uprobes.c                       |   4 +-
> >  kernel/fork.c                                 |   5 +
> >  mm/cma.c                                      | 119 +++
> >  mm/gup.c                                      |  60 +-
> >  mm/huge_memory.c                              | 939 +++++++++++++++++-
> >  mm/hugetlb.c                                  | 114 +--
> >  mm/internal.h                                 |   2 +
> >  mm/khugepaged.c                               |   6 +-
> >  mm/ksm.c                                      |   4 +-
> >  mm/memcontrol.c                               |  13 +
> >  mm/memory.c                                   |  51 +-
> >  mm/mempolicy.c                                |  21 +-
> >  mm/migrate.c                                  |  12 +-
> >  mm/page_alloc.c                               |  57 +-
> >  mm/page_vma_mapped.c                          | 129 ++-
> >  mm/pgtable-generic.c                          |  56 ++
> >  mm/rmap.c                                     | 289 ++++--
> >  mm/swap.c                                     |  31 +
> >  mm/swap_slots.c                               |   2 +
> >  mm/swapfile.c                                 |   8 +-
> >  mm/userfaultfd.c                              |   2 +-
> >  mm/util.c                                     |  16 +-
> >  mm/vmscan.c                                   |  58 +-
> >  mm/vmstat.c                                   |   8 +
> >  50 files changed, 2270 insertions(+), 349 deletions(-)
> >  create mode 100644 include/linux/pagechain.h
> > 
> > --
> > 2.28.0
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ