linux-kernel - Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170120003544.7e6e34d1@pog.tecnopolis.ca>
Date:   Fri, 20 Jan 2017 00:35:44 -0600
From:   Trevor Cordes <trevor@...nopolis.ca>
To:     Michal Hocko <mhocko@...nel.org>
Cc:     Mel Gorman <mgorman@...hsingularity.net>,
        linux-kernel@...r.kernel.org, Joonsoo Kim <iamjoonsoo.kim@....com>,
        Minchan Kim <minchan@...nel.org>,
        Rik van Riel <riel@...riel.com>,
        Srikar Dronamraju <srikar@...ux.vnet.ibm.com>
Subject: Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

On 2017-01-19 Michal Hocko wrote:
> On Thu 19-01-17 03:48:50, Trevor Cordes wrote:
> > On 2017-01-17 Michal Hocko wrote:  
> > > On Tue 17-01-17 14:21:14, Mel Gorman wrote:  
> > > > On Tue, Jan 17, 2017 at 02:52:28PM +0100, Michal Hocko
> > > > wrote:    
> > > > > On Mon 16-01-17 11:09:34, Mel Gorman wrote:
> > > > > [...]    
> > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > > index 532a2a750952..46aac487b89a 100644
> > > > > > --- a/mm/vmscan.c
> > > > > > +++ b/mm/vmscan.c
> > > > > > @@ -2684,6 +2684,7 @@ static void shrink_zones(struct
> > > > > > zonelist *zonelist, struct scan_control *sc) continue;
> > > > > >  
> > > > > >  			if (sc->priority != DEF_PRIORITY &&
> > > > > > +			    !buffer_heads_over_limit &&
> > > > > >  			    !pgdat_reclaimable(zone->zone_pgdat))
> > > > > >  				continue;	/* Let
> > > > > > kswapd poll it */    
> > > > > 
> > > > > I think we should rather remove pgdat_reclaimable here. This
> > > > > sounds like a wrong layer to decide whether we want to reclaim
> > > > > and how much.   
> > > > 
> > > > I had considered that but it'd also be important to add the
> > > > other 32-bit patches you have posted to see the impact. Because
> > > > of the ratio of LRU pages to slab pages, it may not have an
> > > > impact but it'd need to be eliminated.    
> > > 
> > > OK, Trevor you can pull from
> > > git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git tree
> > > fixes/highmem-node-fixes branch. This contains the current mmotm
> > > tree
> > > + the latest highmem fixes. I also do not expect this would help
> > > much in your case but as Mel've said we should rule that out at
> > > least.  
> > 
> > Hi!  The git tree above version oom'd after < 24 hours (3:02am) so
> > it doesn't solve the bug.  If you need a oom messages dump let me
> > know.  
> 
> Yes please.

The first oom from that night attached.  Note, the oom wasn't as dire
with your mhocko/4.9.0+ as it usually is with stock 4.8.x: my oom
detector and reboot script was able to do its thing cleanly before the
system became unusable.

I'll await further instructions and test right away.  Maybe I'll try a
few tuning ideas until then.  Thanks!

> > Let me know what to try next, guys, and I'll test it out.
> >   
> > > > Before prototyping such a thing, I'd like to hear the outcome of
> > > > this heavy hack and then add your 32-bit patches onto the list.
> > > > If the problem is still there then I'd next look at taking slab
> > > > pages into account in pgdat_reclaimable() instead of an
> > > > outright removal that has a much wider impact. If that doesn't
> > > > work then I'll prototype a heavy-handed forced slab reclaim
> > > > when lower zones are almost all slab pages.  
> > 
> > I don't think I've tried the "heavy hack" patch yet?  It's not in
> > the mhocko tree I just tried?  Should I try the heavy hack on top
> > of mhocko git or on vanilla or what?
> > 
> > I also want to mention that these PAE boxes suffer from another
> > problem/bug that I've worked around for almost a year now.  For some
> > reason it keeps gnawing at me that it might be related.  The disk
> > I/O goes to pot on this/these PAE boxes after a certain amount of
> > disk writes (like some unknown number of GB, around 10-ish maybe).
> > Like writes go from 500MB/s to 10MB/s!! Reboot and it's magically
> > 500MB/s again.  I detail this here:
> > https://muug.ca/pipermail/roundtable/2016-June/004669.html
> > My fix was to mem=XG where X is <8 (like 4 or 6) to force the PAE
> > kernel to be more sane about highmem choices.  I never filed a bug
> > because I read a ton of stuff saying Linus hates PAE, don't use over
> > 4G, blah blah.  But the other fix is to:
> > set /proc/sys/vm/highmem_is_dirtyable to 1  
> 
> Yes this sounds like a dirty memory throttling and there were some
> changes in that area. I do not remember when exactly.

I think my PAE-slow-IO bug started way back in Fedora 22 (4.0?), hard
to know exactly when as I didn't discover the bug for maybe a year as I
didn't realize IO was the problem right away.  Too late to bisect that
one, I guess.  I guess it's not related so we can ignore my tangent!

> > I'm not bringing this up to get attention to a new bug, I bring
> > this up because it smells like it might be related.  If something
> > slowly eats away at the box's vm to the point that I/O gets
> > horribly slow, perhaps it's related to the slab and high/lomem
> > issue we have here?  And if related, it may help to solve the oom
> > bug.  If I'm way off base here, just ignore my tangent!  
> 
> >From your OOM reports so far it doesn't really seem related because
> >you  
> never had large number of pages under the writeback when OOM.
> 
> The situation with the PAE kernel is unfortunate but it is really hard
> to do anything about that considering that the kernel and most its
> allocations have to live in a small and scarce lowmem memory. Moreover
> the more memory you have to more you have to allocated from that
> memory.

You're for sure right that the IO-slow bug was definitely worse the more
ram was in a system!  (The mem=4G really helps alleviate this bug and is
good enough for me.)

> This is why not only Linus hates 32b systems on a large memory
> systems.

Completely off-topic: it would be great if rather than pretending PAE
should work with large RAM (which seems more broken every day), the
kernel guys put out an officially stated policy of a maximum RAM you
can use, and try to have the kernel behave for <= that size, and then
people could use more RAM but clearly "at your own risk, don't bug us
about problems!".  Other than a few posts about Linus hating it,
there's nothing official I can find about it in documentation, etc.  It
gives the (mis)impression that it's perfectly fine to run PAE on a
zillion GB modern system.  Then we later learn the hard way :-)

Download attachment "oom3" of type "application/octet-stream" (22921 bytes)