linux-kernel - Re: [Question] Should direct reclaim time be bounded?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <c801da70-1aa5-666a-615e-852100d6145e@oracle.com>
Date:   Thu, 4 Jul 2019 08:11:44 -0700
From:   Mike Kravetz <mike.kravetz@...cle.com>
To:     Michal Hocko <mhocko@...nel.org>
Cc:     Mel Gorman <mgorman@...hsingularity.net>,
        Mel Gorman <mgorman@...e.de>, Vlastimil Babka <vbabka@...e.cz>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Johannes Weiner <hannes@...xchg.org>
Subject: Re: [Question] Should direct reclaim time be bounded?

On 7/4/19 4:09 AM, Michal Hocko wrote:
> On Wed 03-07-19 16:54:35, Mike Kravetz wrote:
>> On 7/3/19 2:43 AM, Mel Gorman wrote:
>>> Indeed. I'm getting knocked offline shortly so I didn't give this the
>>> time it deserves but it appears that part of this problem is
>>> hugetlb-specific when one node is full and can enter into this continual
>>> loop due to __GFP_RETRY_MAYFAIL requiring both nr_reclaimed and
>>> nr_scanned to be zero.
>>
>> Yes, I am not aware of any other large order allocations consistently made
>> with __GFP_RETRY_MAYFAIL.  But, I did not look too closely.  Michal believes
>> that hugetlb pages allocations should use __GFP_RETRY_MAYFAIL.
> 
> Yes. The argument is that this is controlable by an admin and failures
> should be prevented as much as possible. I didn't get to understand
> should_continue_reclaim part of the problem but I have a strong feeling
> that __GFP_RETRY_MAYFAIL handling at that layer is not correct. What
> happens if it is simply removed and we rely only on the retry mechanism
> from the page allocator instead? Does the success rate is reduced
> considerably?

It certainly will be reduced.  I 'think' it will be hard to predict how
much it will be reduced as this will depend on the state of memory usage
and fragmentation at the time of the attempt.

I can try to measure this, but I will be a few days due to U.S. holiday.
-- 
Mike Kravetz