linux-kernel - Re: [3.2-rc3] OOM killer doesn't kill the obvious memory hog

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20111202033148.GA7046@dastard>
Date:	Fri, 2 Dec 2011 14:31:48 +1100
From:	Dave Chinner <david@...morbit.com>
To:	David Rientjes <rientjes@...gle.com>
Cc:	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [3.2-rc3] OOM killer doesn't kill the obvious memory hog

On Fri, Dec 02, 2011 at 12:59:21PM +1100, Dave Chinner wrote:
> On Thu, Dec 01, 2011 at 02:35:31PM -0800, David Rientjes wrote:
> > On Thu, 1 Dec 2011, Dave Chinner wrote:
> > 
> > > > /*
> > > >  * /proc/<pid>/oom_score_adj set to OOM_SCORE_ADJ_MIN disables oom killing for
> > > >  * pid.
> > > >  */
> > > > #define OOM_SCORE_ADJ_MIN       (-1000)
> > > > 
> > > >  
> > > > IIUC, this task cannot be killed by oom-killer because of oom_score_adj settings.
> > > 
> > > It's not me or the test suite that setting this, so it's something
> > > the kernel must be doing automagically.
> > > 
> > 
> > The kernel does not set oom_score_adj to ever disable oom killing for a 
> > thread.  The only time the kernel touches oom_score_adj is when setting it 
> > to "1000" in ksm and swap to actually prefer a memory allocator for oom 
> > killing.
> > 
> > It's also possible to change this value via the deprecated 
> > /proc/pid/oom_adj interface until it is removed next year.  Check your 
> > dmesg for warnings about using the deprecated oom_adj interface or change 
> > the printk_once() in oom_adjust_write() to a normal printk() to catch it.
> 
> No warnings at all, as I've already said. If it is userspace,
> whatever is doing it is using the oom_score_adj interface correctly.

.....

> <sigh>
> 
> The reports all cycle around this loop:
> 
> 	linux-mm says userspace/distro problem
> 	distro says openssh problem
> 	openssh says kernel problem
> 
> And there doesn't appear to be any resolution in any of the reports,
> just circular finger pointing and frustrated users.
> 
> I can't find anything in the distro startup or udev scripts that
> modify the oom parameters, and the openssh guys say they only
> pass on the value inhereted from ssh's parent process, so it clearly
> not obvious where the bug lies at this point. It's been around for
> some time, though...
> 
> More digging to do...

A working sshd startup and login:

Dec  2 13:16:32 test-2 sshd[2119]: Set /proc/self/oom_score_adj from 0 to -1000
Dec  2 13:16:32 test-2 sshd[2119]: debug1: Bind to port 22 on 0.0.0.0.
Dec  2 13:16:32 test-2 sshd[2119]: Server listening on 0.0.0.0 port 22.
Dec  2 13:16:32 test-2 sshd[2119]: socket: Address family not supported by protocol
Dec  2 13:16:36 test-2 sshd[2119]: debug1: Forked child 2576.
Dec  2 13:16:36 test-2 sshd[2576]: Set /proc/self/oom_score_adj to 0

The child process sets itself back to 0 correctly. Now, a non-working
startup and login:

Dec  2 13:19:56 test-2 sshd[2126]: Set /proc/self/oom_score_adj from 0 to -1000
Dec  2 13:19:56 test-2 sshd[2126]: debug1: Bind to port 22 on 0.0.0.0.
Dec  2 13:19:56 test-2 sshd[2126]: Server listening on 0.0.0.0 port 22.
Dec  2 13:19:56 test-2 sshd[2126]: socket: Address family not supported by protocol
Dec  2 13:19:57 test-2 sshd[2126]: Received signal 15; terminating.
Dec  2 13:19:57 test-2 sshd[2317]: Set /proc/self/oom_score_adj from -1000 to -1000
Dec  2 13:19:57 test-2 sshd[2317]: debug1: Bind to port 22 on 0.0.0.0.
Dec  2 13:19:57 test-2 sshd[2317]: Server listening on 0.0.0.0 port 22.
Dec  2 13:19:57 test-2 sshd[2317]: socket: Address family not supported by protocol
Dec  2 13:20:01 test-2 sshd[2317]: debug1: Forked child 2322.
Dec  2 13:20:01 test-2 sshd[2322]: Set /proc/self/oom_score_adj to -1000

Somewhere in the statup process, a sshd process is getting a SIGTERM
and dying. It is then restarted immediately form a context that has
a /proc/self/oom_score_adj value of -1000, which is where the
problem lies. So, how does this occur? Looks like a distro problem -
I added 'echo "sshd restart" >> /var/log/auth.log' to the
/etc/init.d/ssh restart command, and this pops out now:

Dec  2 14:00:08 test-2 sshd[2037]: debug3: oom_adjust_setup
Dec  2 14:00:08 test-2 sshd[2037]: Set /proc/self/oom_score_adj from 0 to -1000
Dec  2 14:00:08 test-2 sshd[2037]: debug2: fd 3 setting O_NONBLOCK
Dec  2 14:00:08 test-2 sshd[2037]: debug1: Bind to port 22 on 0.0.0.0.
Dec  2 14:00:08 test-2 sshd[2037]: Server listening on 0.0.0.0 port 22.
Dec  2 14:00:08 test-2 sshd[2037]: socket: Address family not supported by protocol
sshd restart
Dec  2 14:00:11 test-2 sshd[2037]: Received signal 15; terminating.
Dec  2 14:00:11 test-2 sshd[2330]: debug3: oom_adjust_setup
Dec  2 14:00:11 test-2 sshd[2330]: Set /proc/self/oom_score_adj from -1000 to -1000

So, something in a startup script is causing an sshd restart from a
context where the oom_score_adj = -1000. There's only two
possibilities here - the dhcp client bringing the network interface
up or a udev event after the sshd has been started.

Bingo:

$ sudo ifup --verbose eth0
Configuring interface eth0=eth0 (inet)
run-parts --verbose /etc/network/if-pre-up.d
run-parts: executing /etc/network/if-pre-up.d/bridge
run-parts: executing /etc/network/if-pre-up.d/uml-utilities

dhclient -v -pf /var/run/dhclient.eth0.pid -lf
/var/lib/dhcp/dhclient.eth0.leases eth0
Internet Systems Consortium DHCP Client 4.1.1-P1
Copyright 2004-2010 Internet Systems Consortium.
All rights reserved.
For info, please visit https://www.isc.org/software/dhcp/

Listening on LPF/eth0/00:e4:b6:63:63:6e
Sending on   LPF/eth0/00:e4:b6:63:63:6e
Sending on   Socket/fallback
DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 6
DHCPOFFER from 192.168.1.254
DHCPREQUEST on eth0 to 255.255.255.255 port 67
DHCPACK from 192.168.1.254
bound to 192.168.1.61 -- renewal in 17198 seconds.
run-parts --verbose /etc/network/if-up.d
run-parts: executing /etc/network/if-up.d/mountnfs
run-parts: executing /etc/network/if-up.d/ntpdate
run-parts: executing /etc/network/if-up.d/openssh-server
run-parts: executing /etc/network/if-up.d/uml-utilities
$

and /etc/network/if-up.d/openssh-server does a restart on the
ssh server.

And this is set in /etc/network/interfaces:

allow-hotplug eth0

which means udev can execute the ifup command whenteh device
appears, asynchronously to the startup scripts that are running.

So, it's a distro bug - sshd should never be started from from udev
context because of this inherited oom_score_adj thing.
Interestingly, the ifup ssh restart script says this:

# We'd like to use 'reload' here, but it has some problems; see #502444.
if [ -x /usr/sbin/invoke-rc.d ]; then
        invoke-rc.d ssh restart >/dev/null 2>&1 || true
else
        /etc/init.d/ssh restart >/dev/null 2>&1 || true
fi

Bug 502444 describes the exact startup race condition that I've just
found. It does a ssh server restart because reload causes the sshd
server to fail to start if a start is currently in progress.  So,
rather than solving the start vs reload race condition, it got a
bandaid (use restart to restart sshd from the reload context) and
left it as a landmine.....

<sigh>

Cheers,

Dave.

-- 
Dave Chinner
david@...morbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/