lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Sat, 13 May 2017 02:06:31 +0200 From: MasterPrenium <masterprenium.lkml@...il.com> To: Shaohua Li <shli@...nel.org> Cc: linux-kernel@...r.kernel.org, xen-users@...ts.xen.org, linux-raid@...r.kernel.org, "MasterPrenium@...il.com" <MasterPrenium@...il.com>, xen-devel@...ts.xenproject.org Subject: Re: PROBLEM: Kernel BUG with raid5 soft + Xen + DRBD - invalid opcode Hi guys, My issue is still remaining with new kernels, at least last revision of 4.10.x branch. But I found something that can be interesting for investigations, here I attached another .config file for kernel building, with this configuration I'm not able to reproduce the kernel panic, I got no crash at all with exactly the same procedure. Tested on 4.9.16 kernel and 4.10.13 : - config_Crash.txt : result in a crash running fio within less than 2 minutes - config_NoCrash.txt : even after hours of fio, rebuilding arrays, etc ... no crash at all, neither no warning or anything in dmesg. Note : config_NoCrash is coming from another server on which I had setup similar system and which was not crashing. Tested this kernel on my crashing system, and no crash anymore... I can't believe how a different config can solve a kernel BUG... If someone has any idea... Bests, Le 09/01/2017 à 23:44, Shaohua Li a écrit : > On Sun, Jan 08, 2017 at 02:31:15PM +0100, MasterPrenium wrote: >> Hello, >> >> Replies below + : >> - I don't know if this can help but after the crash, when the system >> reboots, the Raid 5 stack is re-synchronizing >> [ 37.028239] md10: Warning: Device sdc1 is misaligned >> [ 37.028541] created bitmap (15 pages) for device md10 >> [ 37.030433] md10: bitmap initialized from disk: read 1 pages, set 59 of >> 29807 bits >> >> - Sometimes the kernel completely crash (lost serial + network connection), >> sometimes only got the "BUG" dump, but still have network access (but a >> reboot is impossible, need to reset the system). >> >> - You can find blktrace here (while running fio), I hope it's complete since >> the end of the file is when the kernel crashed : https://goo.gl/X9jZ50 > Looks most are normal full stripe writes. > >>> I'm trying to reproduce, but no success. So >>> ext4->btrfs->raid5, crash >>> btrfs->raid5, no crash >>> right? does subvolume matter? When you create the raid5 array, does adding >>> '--assume-clean' option change the behavior? I'd like to narrow down the issue. >>> If you can capture the blktrace to the raid5 array, it would be great to hint >>> us what kind of IO it is. >> Yes Correct. >> The subvolume doesn't matter. >> -- assume-clean doesn't change the behaviour. > so it's not a resync issue. > >> Don't forget that the system needs to be running on xen to crash, without >> (on native kernel) it doesn't crash (or at least, I was not able to make it >> crash). >>>> Regarding your patch, I can't find it. Is it the one sent by Konstantin >>>> Khlebnikov ? >>> Right. >> It doesn't help :(. Maybe the crash is happening a little bit later. > ok, the patch is unlikely helpful, since the IO size isn't very big. > > Don't have good idea yet. My best guess so far is virtual machine introduces > extra delay, which might trigger some race conditions which aren't seen in > native. I'll check if I could find something locally. > > Thanks, > Shaohua View attachment "Config_Crash.txt" of type "text/plain" (110513 bytes) View attachment "Config_NoCrash.txt" of type "text/plain" (121929 bytes)
Powered by blists - more mailing lists