lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <52bc2d5d-eb8a-83de-1c93-abd329132d58@redhat.com>
Date:   Mon, 5 Oct 2020 20:48:03 +0200
From:   David Hildenbrand <david@...hat.com>
To:     Zi Yan <ziy@...dia.com>
Cc:     Michal Hocko <mhocko@...e.com>, linux-mm@...ck.org,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        Rik van Riel <riel@...riel.com>,
        Roman Gushchin <guro@...com>,
        Matthew Wilcox <willy@...radead.org>,
        Shakeel Butt <shakeelb@...gle.com>,
        Yang Shi <shy828301@...il.com>,
        Jason Gunthorpe <jgg@...dia.com>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        William Kucharski <william.kucharski@...cle.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        John Hubbard <jhubbard@...dia.com>,
        David Nellans <dnellans@...dia.com>,
        linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64

> The real control of hugetlbfs comes from the interfaces provided by
> the kernel. If kernel provides similar interfaces to control page sizes
> of THPs, it should work the same as hugetlbfs. Mixing page sizes usually
> comes from system memory fragmentation and hugetlbfs does not have this
> mixture because of its special allocation pools not because of the code

With hugeltbfs, you have a guarantee that all pages within your VMA have
the same page size. This is an important property. With THP you have the
guarantee that any page can be operated on, as if it would be base-page
granularity.

Example: KVM on s390x

a) It cannot deal with THP. If you supply THP, the kernel will simply
split up all THP and prohibit new ones from getting formed. All works
well (well, no speedup because no THP).
b) It can deal with 1MB huge pages (in some configurations).
c) It cannot deal with 2G huge pages.

So user space really has to control which pagesize to use in case of
hugetlbfs.

> itself. If THPs are allocated from the same pools, they would act
> the same as hugetlbfs. What am I missing here?

Did I mention that I dislike taking THP from the CMA pool? ;)

> 
> I just do not get why hugetlbfs is so special that it can have pagesize
> fine control when normal pages cannot get. The “it should be invisible
> to userpsace” argument suddenly does not hold for hugetlbfs.

It's not about "cannot get", it's about "do we need it". We do have a
trigger "THP yes/no". I wonder in which cases that wouldn't be sufficient.


The name "Transparent" implies that they *should* be transparent to user
space. This, unfortunately, is not completely true:

1. Performance aspects: Breaking up THP is bad for performance. This can
be observed fairly easily by when using 4k-based memory ballooning in
virtualized environments. If we stick to the current THP size (e.g.,
2MB), we are mostly fine. Breaking up 1G THP into 2MB THP when required
 is completely acceptable.

2. Wasting memory: Touch a 4K page, get 2M populated. Somewhat
acceptable / controllable. Touch 4K, get 1G populated is not desirable.
And I think we mostly agree that we should operate only on
fully-populated ranges to replace by 1G THP.


But then, there is no observerable difference between 1G THP and 2M THP
from user space point of view except performance.

So we are debating about "Should the kernel tell us that we can use 1G
THP for a VMA".  What if we were suddenly to support 2G THP (look at
arm64 how they support all kinds of huge pages for hugetlbfs)? Do we
really need *another* trigger?

What Michal proposed (IIUC) is rather user space telling the kernel
"this large memory range here is *really* important for performance,
please try to optimize the memory layout, give me the best you've got".

MADV_HUGEPAGE_1GB is just ugly.


-- 
Thanks,

David / dhildenb

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ