[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170129165003.31bd8384@pog.tecnopolis.ca>
Date: Sun, 29 Jan 2017 16:50:03 -0600
From: Trevor Cordes <trevor@...nopolis.ca>
To: Michal Hocko <mhocko@...nel.org>
Cc: Mel Gorman <mgorman@...hsingularity.net>,
linux-kernel@...r.kernel.org, Joonsoo Kim <iamjoonsoo.kim@....com>,
Minchan Kim <minchan@...nel.org>,
Rik van Riel <riel@...riel.com>,
Srikar Dronamraju <srikar@...ux.vnet.ibm.com>
Subject: Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)
On 2017-01-25 Michal Hocko wrote:
> On Wed 25-01-17 04:02:46, Trevor Cordes wrote:
> > OK, I patched & compiled mhocko's git tree from the other day
> > 4.9.0+. (To confirm, weird, but mhocko's git tree I'm using from a
> > couple of weeks ago shows the newest commit (git log) is
> > 69973b830859bc6529a7a0468ba0d80ee5117826 "Linux 4.9"? Let me know
> > if I'm doing something wrong, see below.)
>
> My fault. I should have noted that you should use since-4.9 branch.
OK, I have good news. I compiled your mhocko git tree (properly this
tim!) using since-4.9 branch (last commit
ca63ff9b11f958efafd8c8fa60fda14baec6149c Jan 25) and the box survived 3
3am's, over 60 hours, and I made sure all the usual oom culprits ran,
and I ran extras (finds on the whole tree, extra rdiff-backups) to try
to tax it. Based on my previous criteria I would say your since-4.9 as
of the above commit solves my bug, at least over a 3 day test span
(which it never survives when the bug is present)!
I tested WITHOUT any cgroup/mem boot options. I do still have my
mem=6G limiter on, though (I've never tested with it off, until I solve
the bug with it on, since I've had it on for many months for other
reasons).
On 2017-01-27 Michal Hocko wrote:
> OK, that matches the theory that these OOMs are caused by the
> incorrect active list aging fixed by b4536f0c829c ("mm, memcg: fix
> the active list aging for lowmem requests when memcg is enabled")
b4536f0c829c isn't in the since-4.9 I tested above though? So
something else you did must have fixed it (also)? I don't think I've
run any tests yet with b4536f0c829c in them? I think the vanillas I
was doing a couple of weeks ago were before b4536f0c829c, but I can't
be sure.
What do I test next? Does the since-4.9 stuff get pushed into vanilla
(4.9 hopefully?) so it can find its way into Fedora's stuck F24
kernel?
I want to also note that the RHBZ
https://bugzilla.redhat.com/show_bug.cgi?id=1401012 is garnering more
interest as more people start me-too'ing. The situation is almost
always the same: large rsync's or similar tree-scan accesses cause oom
on PAE boxes. However, I wanted to note that many people there reported
that cgroup_disable=memory doesn't fix anything for them, whereas that
always makes the problem go away on my boxes. Strange.
Thanks Michal and Mel, I really appreciate it!
Powered by blists - more mailing lists