linux-kernel - Re: [PATCH v18 4/9] mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <b6f204af-ff17-3088-c717-d299b33e6fcb@oracle.com>
Date:   Fri, 12 Mar 2021 09:50:37 -0800
From:   Mike Kravetz <mike.kravetz@...cle.com>
To:     Michal Hocko <mhocko@...e.com>
Cc:     "Paul E. McKenney" <paulmck@...nel.org>,
        Muchun Song <songmuchun@...edance.com>, corbet@....net,
        tglx@...utronix.de, mingo@...hat.com, bp@...en8.de, x86@...nel.org,
        hpa@...or.com, dave.hansen@...ux.intel.com, luto@...nel.org,
        peterz@...radead.org, viro@...iv.linux.org.uk,
        akpm@...ux-foundation.org, mchehab+huawei@...nel.org,
        pawan.kumar.gupta@...ux.intel.com, rdunlap@...radead.org,
        oneukum@...e.com, anshuman.khandual@....com, jroedel@...e.de,
        almasrymina@...gle.com, rientjes@...gle.com, willy@...radead.org,
        osalvador@...e.de, song.bao.hua@...ilicon.com, david@...hat.com,
        naoya.horiguchi@....com, joao.m.martins@...cle.com,
        duanxiongchun@...edance.com, linux-doc@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        linux-fsdevel@...r.kernel.org, Chen Huang <chenhuang5@...wei.com>,
        Bodeddula Balasubramaniam <bodeddub@...zon.com>
Subject: Re: [PATCH v18 4/9] mm: hugetlb: alloc the vmemmap pages associated
 with each HugeTLB page

On 3/12/21 12:15 AM, Michal Hocko wrote:
> On Thu 11-03-21 14:53:08, Mike Kravetz wrote:
>> On 3/11/21 9:59 AM, Mike Kravetz wrote:
>>> On 3/11/21 4:17 AM, Michal Hocko wrote:
>>>>> Yeah per cpu preempt counting shouldn't be noticeable but I have to
>>>>> confess I haven't benchmarked it.
>>>>
>>>> But all this seems moot now http://lkml.kernel.org/r/YEoA08n60+jzsnAl@hirez.programming.kicks-ass.net
>>>>
>>>
>>> The proper fix for free_huge_page independent of this series would
>>> involve:
>>>
>>> - Make hugetlb_lock and subpool lock irq safe
>>> - Hand off freeing to a workque if the freeing could sleep
>>>
>>> Today, the only time we can sleep in free_huge_page is for gigantic
>>> pages allocated via cma.  I 'think' the concern about undesirable
>>> user visible side effects in this case is minimal as freeing/allocating
>>> 1G pages is not something that is going to happen at a high frequency.
>>> My thinking could be wrong?
>>>
>>> Of more concern, is the introduction of this series.  If this feature
>>> is enabled, then ALL free_huge_page requests must be sent to a workqueue.
>>> Any ideas on how to address this?
>>>
>>
>> Thinking about this more ...
>>
>> A call to free_huge_page has two distinct outcomes
>> 1) Page is freed back to the original allocator: buddy or cma
>> 2) Page is put on hugetlb free list
>>
>> We can only possibly sleep in the first case 1.  In addition, freeing a
>> page back to the original allocator involves these steps:
>> 1) Removing page from hugetlb lists
>> 2) Updating hugetlb counts: nr_hugepages, surplus
>> 3) Updating page fields
>> 4) Allocate vmemmap pages if needed as in this series
>> 5) Calling free routine of original allocator
>>
>> If hugetlb_lock is irq safe, we can perform the first 3 steps under that
>> lock without issue.  We would then use a workqueue to perform the last
>> two steps.  Since we are updating hugetlb user visible data under the
>> lock, there should be no delays.  Of course, giving those pages back to
>> the original allocator could still be delayed, and a user may notice
>> that.  Not sure if that would be acceptable?
> 
> Well, having many in-flight huge pages can certainly be visible. Say you
> are freeing hundreds of huge pages and your echo n > nr_hugepages will
> return just for you to find out that the memory hasn't been freed and
> therefore cannot be reused for another use - recently there was somebody
> mentioning their usecase to free up huge pages to prevent OOM for
> example. I do expect more people doing something like that.
> 
> Now, nr_hugepages can be handled by blocking on the same WQ until all
> pre-existing items are processed. Maybe we will need to have a more
> generic API to achieve the same for in kernel users but let's wait for
> those requests.
> 
>> I think Muchun had a
>> similar setup just for vmemmmap allocation in an early version of this
>> series.
>>
>> This would also require changes to where accounting is done in
>> dissolve_free_huge_page and update_and_free_page as mentioned elsewhere.
> 
> Normalizing dissolve_free_huge_page is definitely a good idea. It is
> really tricky how it sticks out and does half of the job of
> update_and_free_page.
> 
> That being said, if it is possible to have a fully consistent h state
> before handing over to WQ for sleeping operation then we should be all
> fine. I am slightly worried about potential tricky situations where the
> sleeping operation fails because that would require that page to be
> added back to the pool again. As said above we would need some sort of
> sync with in-flight operations before returning to the userspace.

Those sysfs interfaces to allocate/free huge pages will need to be
reworked.  One thing that is totally unacceptable with hugetlb_lock
being irq safe, are the calls to cond_resched_lock(&hugetlb_lock).
We will need to significantly reduce lock hold time in these situations.
I have some ideas on how this might work, but it is going to require
some a good deal of code restructuring and will take some time.
-- 
Mike Kravetz