lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 9 Sep 2021 13:54:31 +0200
From:   Michal Hocko <mhocko@...e.com>
To:     Mike Kravetz <mike.kravetz@...cle.com>
Cc:     Hillf Danton <hdanton@...a.com>, Vlastimil Babka <vbabka@...e.cz>,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality

On Wed 08-09-21 14:00:19, Mike Kravetz wrote:
> On 9/7/21 1:50 AM, Hillf Danton wrote:
> > On Mon, 6 Sep 2021 16:40:28 +0200 Vlastimil Babka wrote:
> >> On 9/2/21 20:17, Mike Kravetz wrote:
> >>>
> >>> Here is some very high level information from a long stall that was
> >>> interrupted.  This was an order 9 allocation from alloc_buddy_huge_page().
> >>>
> >>> 55269.530564] __alloc_pages_slowpath: jiffies 47329325 tries 609673 cpu_tries 1   node 0 FAIL
> >>> [55269.539893]     r_tries 25       c_tries 609647   reclaim 47325161 compact 607     
> >>>
> >>> Yes, in __alloc_pages_slowpath for 47329325 jiffies before being interrupted.
> >>> should_reclaim_retry returned true 25 times and should_compact_retry returned
> >>> true 609647 times.
> >>> Almost all time (47325161 jiffies) spent in __alloc_pages_direct_reclaim, and
> >>> 607 jiffies spent in __alloc_pages_direct_compact.
> >>>
> >>> Looks like both
> >>> reclaim retries > MAX_RECLAIM_RETRIES
> >>> and
> >>> compaction retries > MAX_COMPACT_RETRIES
> >>>
> >> Yeah AFAICS that's only possible with the scenario I suspected. I guess
> >> we should put a limit on compact retries (maybe some multiple of
> >> MAX_COMPACT_RETRIES) even if it thinks that reclaim could help, while
> >> clearly it doesn't (i.e. because somebody else is stealing the page like
> >> in your test case).
> > 
> > And/or clamp reclaim retries for costly orders
> > 
> > 	reclaim retries = MAX_RECLAIM_RETRIES - order;
> > 
> > to pull down the chance for stall as low as possible.
> 
> Thanks, and sorry for not replying quickly.  I only get back to this as
> time allows.
> 
> We could clamp the number of compaction and reclaim retries in
> __alloc_pages_slowpath as suggested.  However, I noticed that a single
> reclaim call could take a bunch of time.  As a result, I instrumented
> shrink_node to see what might be happening.  Here is some information
> from a long stall.  Note that I only dump stats when jiffies > 100000.
> 
> [ 8136.874706] shrink_node: 507654 total jiffies,  3557110 tries
> [ 8136.881130]              130596341 reclaimed, 32 nr_to_reclaim
> [ 8136.887643]              compaction_suitable results:
> [ 8136.893276]     idx COMPACT_SKIPPED, 3557109

Can you get a more detailed break down of where the time is spent. Also
How come the number of reclaimed pages is so excessive comparing to the
reclaim target? There is something fishy going on here.
-- 
Michal Hocko
SUSE Labs

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ