lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 9 Sep 2021 15:45:45 +0200
From:   Vlastimil Babka <vbabka@...e.cz>
To:     Michal Hocko <mhocko@...e.com>,
        Mike Kravetz <mike.kravetz@...cle.com>
Cc:     Hillf Danton <hdanton@...a.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality

On 9/9/21 13:54, Michal Hocko wrote:
> On Wed 08-09-21 14:00:19, Mike Kravetz wrote:
>> On 9/7/21 1:50 AM, Hillf Danton wrote:
>> > On Mon, 6 Sep 2021 16:40:28 +0200 Vlastimil Babka wrote:
>> > 
>> > And/or clamp reclaim retries for costly orders
>> > 
>> > 	reclaim retries = MAX_RECLAIM_RETRIES - order;
>> > 
>> > to pull down the chance for stall as low as possible.
>> 
>> Thanks, and sorry for not replying quickly.  I only get back to this as
>> time allows.
>> 
>> We could clamp the number of compaction and reclaim retries in
>> __alloc_pages_slowpath as suggested.  However, I noticed that a single
>> reclaim call could take a bunch of time.  As a result, I instrumented
>> shrink_node to see what might be happening.  Here is some information
>> from a long stall.  Note that I only dump stats when jiffies > 100000.
>> 
>> [ 8136.874706] shrink_node: 507654 total jiffies,  3557110 tries
>> [ 8136.881130]              130596341 reclaimed, 32 nr_to_reclaim
>> [ 8136.887643]              compaction_suitable results:
>> [ 8136.893276]     idx COMPACT_SKIPPED, 3557109
> 
> Can you get a more detailed break down of where the time is spent. Also
> How come the number of reclaimed pages is so excessive comparing to the
> reclaim target? There is something fishy going on here.

I would say it's simply should_continue_reclaim() behaving similarly to
should_compact_retry(). We'll get compaction_suitable() returning
COMPACT_SKIPPED because the reclaimed pages have been immediately stolen,
and compaction indicates there's not enough base pages to begin with to form
a high-order pages. Since the stolen pages will appear on inactive lru, it
seems to be worth continuing reclaim to make enough free base pages for
compaction to no longer be skipped, because "inactive_lru_pages >
pages_for_compaction" is true.

So, both should_continue_reclaim() and should_compact_retry() are unable to
recognize that reclaimed pages are being stolen and limit the retries in
that case. The scenario seems to be uncommon, otherwise we'd be getting more
reports of that.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ