linux-kernel - Kernel falls apart under light memory pressure (i.e. linking vmlinux)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Date:	Wed, 11 May 2011 18:42:38 -0400
From:	Andrew Lutomirski <luto@....edu>
To:	linux-kernel@...r.kernel.org
Subject: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

For the last few days (since moving my disk to a new laptop), my
system has been hanging, usually unrecoverably, under light memory
pressure.  When this happens, I usually see soft lockups and no OOM
kill.  Mouse and keyboard input stop working.  Sometimes I can switch
VTs; sometimes I can't.  If I just wait it out, sometimes the system
comes back after a couple of minutes but usually even ten minutes or
so isn't enough.  If I force an OOM kill (Alt-SysRq-F), my system
sometimes recovers.  I've attached the dmesg from when that happened
(in that case the freeze was triggered by linking a kernel and the OOM
killer killed ld.)

I can trigger it about half of the time my building a kernel (it
usually dies while linking or doing the .tmp_* stuff) and 100% of the
time by running the attached script with parameters "1500 1400 1".
The script creates a 1500M file on a ramfs, sets up dm-crypt over
loopback on that file, formats it as ext4, and mounts it, then starts
writing a 1400M file over and over on the ext4 partition.

I cannot trigger the problem by running the same script on a different
machine (with 8 GB RAM) with parameters 6000 5500 1.  I can't trigger
it on this machine from initramfs (same kernel image) or from
systemd's emergency shell.  I can trigger it some of the time from
systemd's rescue shell (which as a little bit more stuff running).
The problem seems about equally prevalent with ACHI or compatibility
mode and with aesni-intel enabled and disabled.  (aesni-intel causes
cryptd to get pulled in, so I thought that might be the issue.)

I can sometimes (but not always) trigger this by enabling swap and
running dirty_ram 2048 (attached).  (One time it took the system down
completely.  I have ~8 GB of swap, all of which was empty when I ran
the program.)

I see this problem on 2.6.38.{5,6}, 2.6.39-<something from today>, and
Fedora 15's kernel, so I doubt it's an oddity of my kernel config.

I also had this problem while running Fedora 15's installer to upgrade
from Fedora 14 to 15, which rules out a lot of weird userspace issues.

This box is a Lenovo X220 Sandy Bridge laptop with 2G of RAM (the old
box had more) and runs ext4 on LVM on dm-crypt on an SSD.  I see the
problem with and without a swap partition.  I've also tried unloading
most drivers and the test still fails.  Memtest passes.

If I had to guess, I'd say that the VM gets confused when it's forced
to write data out to my LVM-over-dm-crypt partition and either starts
OOM-killing things when it's not out of memory or deadlocks because it
runs out of available RAM and can't service new dm-crypt and block
requests.

Please help fix/debug this.  It's making my shiny new laptop almost useless.

--Andy

View attachment "successful-oom-kill.txt" of type "text/plain" (88205 bytes)

Download attachment "test_mempressure.sh" of type "application/x-sh" (1993 bytes)

View attachment "OOM-with-lots-of-swap.txt" of type "text/plain" (34676 bytes)

View attachment "dirty_ram.cc" of type "text/plain" (583 bytes)