linux-kernel - mm, vmscan: commit makes PAE kernel crash nightly (bisected)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170111103243.GA27795@pog.tecnopolis.ca>
Date:   Wed, 11 Jan 2017 04:32:43 -0600
From:   Trevor Cordes <trevor@...nopolis.ca>
To:     linux-kernel@...r.kernel.org
Cc:     Mel Gorman <mgorman@...hsingularity.net>,
        Joonsoo Kim <iamjoonsoo.kim@....com>,
        Michal Hocko <mhocko@...nel.org>,
        Minchan Kim <minchan@...nel.org>,
        Rik van Riel <riel@...riel.com>,
        Srikar Dronamraju <srikar@...ux.vnet.ibm.com>
Subject: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

Hi!  I have biected a nightly oom-killer flood and crash/hang on one of 
the boxes I admin.  It doesn't crash on Fedora 23/24 4.7.10 kernel but 
does on any 4.8 Fedora kernel.  I did a vanilla bisect and the bug is 
here:

commit b2e18757f2c9d1cdd746a882e9878852fdec9501
Author: Mel Gorman <mgorman@...hsingularity.net>
Date:   Thu Jul 28 15:45:37 2016 -0700

    mm, vmscan: begin reclaiming pages on a per-node basis

I bisected between:
# bad: [69973b830859bc6529a7a0468ba0d80ee5117826] Linux 4.9
# good: [523d939ef98fd712632d93a5a2b588e477a7565e] Linux 4.7

I have not tried newer than 4.8.13 Fedora kernel, but if someone thinks 
this bug is already fixed in HEAD I could try that next.  It took 3 weeks 
to bisect because the crash only seems to happen in the middle of the 
night, and not every, but most, nights.

It does not occur on most of my other boxes, just this one.  The box is a 
bit unique in that it's running 32-bit PAE on a 64-bit capable CPU, and I 
have the memory tuned down to mem=6G in the kernel command line (I think 
it has 16GB actual).  I tuned the RAM down because around 8GB the PAE 
kernel has massive IO speed issues.

It is a relatively new Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz on an 
Intel S1200BTL board.  I will eventually change it to 64-bit Fedora which 
I'm sure will solve this bug, but since there's no easy upgrade path, 
that's on the backburner on this production box.

I'm sure this will be another "PAE sucks, don't use it" issue, but like I 
said, I'm currently stuck with it, and in theory the kernel shouldn't 
crash like this (I'm guessing/hoping).

I think I pinned the trigger down to either (or both) big dir scans (like 
"find /bigdir-foo") running at around 3am.  It's either a remote box doing 
indexing via smbd and/or rsync or rdiff-backup also doing big dir scans.  
But when I do "find /" manually I can't trigger the bug.  Very weird.

The commit notes make it sound like the author thought perhaps there could 
be a problem in some scenarios?  I guess I found the scenario.

The only discussion I found on the net regarding this commit is
https://lkml.org/lkml/2016/8/29/154
And perhaps it's somewhat relevant, it's a bit over my head.

I'm available for testing, etc, and can usually rule out a bad kernel 
within 24-hours by just waiting for 3am to roll around.  I also have 
copious logs I can provide and screenshots of the crashes.

The box is extremely lightly loaded, and RAM use is almost always under 
1GB, and swap is 0-20k used most of the time with GB's free.  Everything 
looks great until all of a sudden oom-killer starts running and goes 
through 10-260 iterations before the system just dies.  I wrote a script 
to watch for oom-killer and issue "reboot" immediately, but 80% of the 
time the box will hang before the reboot actually manages to shutdown.

Any information/help I can provide, please just holler.  Thanks!