lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LSU.2.11.2007251343370.3804@eggly.anvils>
Date:   Sat, 25 Jul 2020 14:19:46 -0700 (PDT)
From:   Hugh Dickins <hughd@...gle.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
cc:     Oleg Nesterov <oleg@...hat.com>, Hugh Dickins <hughd@...gle.com>,
        Michal Hocko <mhocko@...nel.org>,
        Linux-MM <linux-mm@...ck.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Tim Chen <tim.c.chen@...ux.intel.com>,
        Michal Hocko <mhocko@...e.com>
Subject: Re: [RFC PATCH] mm: silence soft lockups from unlock_page

On Sat, 25 Jul 2020, Linus Torvalds wrote:
> On Sat, Jul 25, 2020 at 3:14 AM Oleg Nesterov <oleg@...hat.com> wrote:
> >
> > Heh. I too thought about this. And just in case, your patch looks correct
> > to me. But I can't really comment this behavioural change. Perhaps it
> > should come in a separate patch?
> 
> We could do that. At the same time, I think both parts change how the
> waitqueue works that it might as well just be one "fix page_bit_wait
> waitqueue usage".
> 
> But let's wait to see what Hugh's numbers say.

Oh no, no no: sorry for getting your hopes up there, I won't come up
with any numbers more significant than "0 out of 10" machines crashed.
I know it would be *really* useful if I could come up with performance
comparisons, or steer someone else to do so: but I'm sorry, cannot.

Currently it's actually 1 out of 10 machines crashed, for the same
driverland issue seen last time, maybe it's a bad machine; and another
1 out of the 10 machines went AWOL for unknown reasons, but probably
something outside the kernel got confused by the stress.  No reason
to suspect your changes at all (but some unanalyzed "failure"s, of
dubious significance, accumulating like last time).

I'm optimistic: nothing has happened to warn us off your changes.

And on Fri, 24 Jul 2020, Linus Torvalds had written:
> So the loads you are running are known to have sensitivity to this
> particular area, and are why you've done your patches to the page wait
> bit code?

Yes. It's a series of nineteen ~hour-long tests, of which about five
exhibited wake_up_page_bit problems in the past, and one has remained
intermittently troublesome that way.  Intermittently: usually it does
get through, so getting through yesterday and today won't even tell
us that your changes fixed it - that we shall learn over time later.

Hugh

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ