linux-kernel - Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Sun, 29 Jan 2017 16:50:03 -0600
From:   Trevor Cordes <trevor@...nopolis.ca>
To:     Michal Hocko <mhocko@...nel.org>
Cc:     Mel Gorman <mgorman@...hsingularity.net>,
        linux-kernel@...r.kernel.org, Joonsoo Kim <iamjoonsoo.kim@....com>,
        Minchan Kim <minchan@...nel.org>,
        Rik van Riel <riel@...riel.com>,
        Srikar Dronamraju <srikar@...ux.vnet.ibm.com>
Subject: Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

On 2017-01-25 Michal Hocko wrote:
> On Wed 25-01-17 04:02:46, Trevor Cordes wrote:
> > OK, I patched & compiled mhocko's git tree from the other day
> > 4.9.0+. (To confirm, weird, but mhocko's git tree I'm using from a
> > couple of weeks ago shows the newest commit (git log) is
> > 69973b830859bc6529a7a0468ba0d80ee5117826 "Linux 4.9"?  Let me know
> > if I'm doing something wrong, see below.)  
> 
> My fault. I should have noted that you should use since-4.9 branch.

OK, I have good news.  I compiled your mhocko git tree (properly this
tim!) using since-4.9 branch (last commit
ca63ff9b11f958efafd8c8fa60fda14baec6149c Jan 25) and the box survived 3
3am's, over 60 hours, and I made sure all the usual oom culprits ran,
and I ran extras (finds on the whole tree, extra rdiff-backups) to try
to tax it.  Based on my previous criteria I would say your since-4.9 as
of the above commit solves my bug, at least over a 3 day test span
(which it never survives when the bug is present)!

I tested WITHOUT any cgroup/mem boot options.  I do still have my
mem=6G limiter on, though (I've never tested with it off, until I solve
the bug with it on, since I've had it on for many months for other
reasons).

On 2017-01-27 Michal Hocko wrote:
> OK, that matches the theory that these OOMs are caused by the
> incorrect active list aging fixed by b4536f0c829c ("mm, memcg: fix
> the active list aging for lowmem requests when memcg is enabled")

b4536f0c829c isn't in the since-4.9 I tested above though?  So
something else you did must have fixed it (also)?  I don't think I've
run any tests yet with b4536f0c829c in them?  I think the vanillas I
was doing a couple of weeks ago were before b4536f0c829c, but I can't
be sure.

What do I test next?  Does the since-4.9 stuff get pushed into vanilla
(4.9 hopefully?) so it can find its way into Fedora's stuck F24
kernel?

I want to also note that the RHBZ
https://bugzilla.redhat.com/show_bug.cgi?id=1401012 is garnering more
interest as more people start me-too'ing.  The situation is almost
always the same: large rsync's or similar tree-scan accesses cause oom
on PAE boxes.  However, I wanted to note that many people there reported
that cgroup_disable=memory doesn't fix anything for them, whereas that
always makes the problem go away on my boxes.  Strange.

Thanks Michal and Mel, I really appreciate it!