linux-kernel - Re: Page Allocation Failures/OOM with dm-crypt on software RAID10 (Intel Rapid Storage) with check/repair/sync

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Fri, 15 Jul 2016 09:11:02 +0200
From:	Matthias Dahl <ml_linux-kernel@...ary-island.eu>
To:	Michal Hocko <mhocko@...nel.org>
Cc:	Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>,
	linux-raid@...r.kernel.org, linux-mm@...ck.org,
	dm-devel@...hat.com, linux-kernel@...r.kernel.org,
	Mike Snitzer <snitzer@...hat.com>
Subject: Re: Page Allocation Failures/OOM with dm-crypt on software RAID10
 (Intel Rapid Storage) with check/repair/sync

Hello...

I am rather persistent (stubborn?) when it comes to tracking down bugs,
if somehow possible... and it seems it paid off... somewhat. ;-)

So I did quite a lot more further tests and came up with something very
interesting: As long as the RAID is in sync (as-in: sync_action=idle),
I can not for the life of me trigger this issue -- the used memory
still explodes to most of the RAM but it oscillates back and forth.

I did very stupid things to stress the machine while dd was running as
usual on the dm-crypt device. I opened a second dd instance with the
same parameters on the dm-crypt device. I wrote a simple program that
allocated random amounts of memory (up to 10 GiB), memset them and after
a random amount of time released it again -- in a continuous loop. I
put heavy network stress on the machine... whatever I could think of.

No matter what, the issue did not trigger. And I repeated said tests
quite a few times over extended time periods (usually an hour or so).
Everything worked beautifully with nice speeds and no noticeable system
slow-downs/lag.

As soon as I issued a "check" to sync_action of the RAID device, it was
just a matter of a second until the OOM killer kicked in and all hell
broke loose again. And basically all of my tests where done while the
RAID device was syncing -- due to a very unfortunate series of events.

I tried to repeat that same test with an external (USB3) connected disk
with a Linux s/w RAID10 over two partitions... but unfortunately that
behaves rather differently. I assume it is because it is connected
through USB and not SATA. While doing those tests on my RAID10 with the
4 internal SATA3 disks, you can see w/ free that the "used memory" does
explode to most of the RAM and then oscillates back and forth. With the
same test on the external disk through, that does not happen at all. The
used memory stays pretty much constant and only the buffers vary... but
most of the memory is still free in that case.

I hope my persistence on the matter is not annoying and finally leads us
somewhere where the real issue hides.

Any suggestions, opinions and ideas are greatly appreciated as I have
pretty much exhausted mine at this time.

Last but not least: I switched my testing to a OpenSuSE Tumbleweed Live
system (x86_64 w/ kernel 4.6.3) as Rawhide w/ 4.7.0rcX behaves rather
strangely and unstable at times.

Thanks,
Matthias

-- 
Dipl.-Inf. (FH) Matthias Dahl | Software Engineer | binary-island.eu
  services: custom software [desktop, mobile, web], server administration