linux-kernel - Re: Sleeping BUG in khugepaged for i586

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.10.1706111621330.36347@chino.kir.corp.google.com>
Date:   Sun, 11 Jun 2017 16:28:11 -0700 (PDT)
From:   David Rientjes <rientjes@...gle.com>
To:     Michal Hocko <mhocko@...nel.org>
cc:     Matthew Wilcox <willy@...radead.org>,
        Vlastimil Babka <vbabka@...e.cz>,
        Larry Finger <Larry.Finger@...inger.net>,
        Andrew Morton <akpm@...ux-foundation.org>,
        LKML <linux-kernel@...r.kernel.org>, linux-mm@...ck.org
Subject: Re: Sleeping BUG in khugepaged for i586

On Sat, 10 Jun 2017, Michal Hocko wrote:

> > > I would just pull the cond_resched out of __collapse_huge_page_copy
> > > right after pte_unmap. But I am not really sure why this cond_resched is
> > > really needed because the changelog of the patch which adds is is quite
> > > terse on details.
> > 
> > I'm not sure what could possibly be added to the changelog.  We have 
> > encountered need_resched warnings during the iteration.
> 
> Well, the part the changelog is not really clear about is whether the
> HPAGE_PMD_NR loops itself is the source of the stall. This would be
> quite surprising because doing 512 iterations taking up to 20+s sounds
> way to much.

I have no idea where you come up with 20+ seconds.

These are not soft lockups, these are need_resched warnings.  We monitor 
how long need_resched has been set and when a thread takes an excessive 
amount of time to reschedule after it has been set.  A loop of 512 pages 
with ptl contention and doing {clear,copy}_user_highpage() shows that 
need_resched can sit without scheduling for an excessive amount of time.

> So is it possible that we are missing a cond_resched
> somewhere up the __collapse_huge_page_copy call path?

No.