linux-kernel - Re: [OOPS] 2.6.21-rc6-git5 in cfq_dispatch

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <5A404D4C-BB61-45AB-9A7A-B380FE222137@e18.physik.tu-muenchen.de>
Date:	Tue, 24 Apr 2007 14:27:22 +0200
From:	Roland Kuhn <rkuhn@....physik.tu-muenchen.de>
To:	Jens Axboe <jens.axboe@...cle.com>
Cc:	Thiemo.Nagel@...tum.de,
	linuxkernel Org <linux-kernel@...r.kernel.org>
Subject: Re: [OOPS] 2.6.21-rc6-git5 in cfq_dispatch_insert

Hi Jens!

[I made a typo in the Cc: list so that lkml is only included as of  
now. Actually I copied the typo from you ;-) ]

On 24 Apr 2007, at 11:40, Jens Axboe wrote:

> On Tue, Apr 24 2007, Jens Axboe wrote:
>> On Tue, Apr 24 2007, Roland Kuhn wrote:
>>> Hi Jens!
>>>
>>> On 24 Apr 2007, at 11:18, Jens Axboe wrote:
>>>
>>>> On Tue, Apr 24 2007, Roland Kuhn wrote:
>>>>> Hi Jens!
>>>>>
>>>>> We're using a custom built fileserver (dual core Athlon64, using
>>>>> x64_64 arch) with 22 disks in a RAID6 and while resyncing /dev/md2
>>>>> (9.1GB ext3) after a hardware incident (cable pulled on one  
>>>>> disk) the
>>>>> machine would reliably oops while serving some large files over
>>>>> NFSv3. The oops message scrolled partly off the screen, but the IP
>>>>> was in cfq_dispatch_insert, so I tried your debug patch from
>>>>> yesterday with 2.6.21-rc7. I used netconsole for capturing the  
>>>>> output
>>>>> (which works nicely, thanks Matt!) and as usual the condition
>>>>> triggered after about half a minute, this with the following  
>>>>> printout
>>>>> instead of crashing (still works fine):
>>>>>
>>>>> cfq: rbroot not empty, but ->next_rq == NULL! Fixing up, report  
>>>>> the
>>>>> issue to lkml@...r.kernel.org
>>>>> cfq: busy=1,drv=1,timer=0
>>>>> cfq rr_list:
>>>>> cfq busy_list:
>>>>>  4272: sort=0,next=0000000000000000,q=0/1,a=2/0,d=0/1,f=221
>>>>> cfq idle_list:
>>>>> cfq cur_rr:
>>>>> cfq: rbroot not empty, but ->next_rq == NULL! Fixing up, report  
>>>>> the
>>>>> issue to lkml@...r.kernel.org
>>>>> cfq: busy=1,drv=1,timer=0
>>>>> cfq rr_list:
>>>>> cfq busy_list:
>>>>>  4276: sort=0,next=0000000000000000,q=0/1,a=2/0,d=0/1,f=221
>>>>> cfq idle_list:
>>>>> cfq cur_rr:
>>>>>
>>>>> There was no backtrace, so the only thing I can tell is that  
>>>>> for the
>>>>> previous crashes some nfs threads were always involved, only  
>>>>> once did
>>>>> it happen inside an interrupt handler (with the "aieee" kind of
>>>>> message).
>>>>>
>>>>> If you want me to try something else, don't hesitate to ask!
>>>>
>>>> Nifty, great that you can reproduce so quickly. I'll try a 3-drive
>>>> raid6
>>>> here and see if read activity along with a resync will trigger
>>>> anything.
>>>> If that doesn't work for me, I'll provide you with a more extensive
>>>> debug patch (if you don't mind).
>>>>
>>> Sure. You might want to include NFS file access into your tests,
>>> since we've not triggered this with locally accessing the disks.  
>>> BTW:
>>
>> How are you exporting the directory (what exports options) - how  
>> is it
>> mounted by the client(s)? What chunksize is your raid6 using?
>
> And what are the nature of the files on the raid (huge, small, ?) and
> what are the client(s) doing? Just approximately, I know these things
> can be hard/difficult/impossible to specify.
>
The files are 100-400MB in size and the client is merging them into a  
new file in the same directory using the ROOT library, which does in  
essence alternating sequences of

_llseek(somewhere)
read(n bytes)
_llseek(somewhere+n)
read(m bytes)
...

and then

_llseek(somewhere)
rt_sigaction(ignore INT)
write(n bytes)
rt_sigaction(INT->DFL)
time()
_llseek(somewhere+n)
...

where n is of the the order of 30kB. The input files are treated  
sequentially, not randomly.

BTW: the machine just stopped dead, no sign whatsoever on console or  
netconsole, so I rebooted with elevator=deadline
(need to get some work done besides ;-) )

Ciao,
                     Roland

--
TU Muenchen, Physik-Department E18, James-Franck-Str., 85748 Garching
Telefon 089/289-12575; Telefax 089/289-12570
--
CERN office: 892-1-D23 phone: +41 22 7676540 mobile: +41 76 487 4482
--
Any society that would give up a little liberty to gain a little
security will deserve neither and lose both.  - Benjamin Franklin
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GS/CS/M/MU d-(++) s:+ a-> C+++ UL++++ P+++ L+++ E(+) W+ !N K- w--- M 
+ !V Y+
PGP++ t+(++) 5 R+ tv-- b+ DI++ e+++>++++ h---- y+++
------END GEEK CODE BLOCK------



Download attachment "smime.p7s" of type "application/pkcs7-signature" (4324 bytes)

Download attachment "PGP.sig" of type "application/pgp-signature" (187 bytes)