linux-kernel - Re: next-20081119: general protection fault: get_next_timer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.LFD.2.00.0811242018370.3235@localhost.localdomain>
Date:	Mon, 24 Nov 2008 20:31:08 +0100 (CET)
From:	Thomas Gleixner <tglx@...utronix.de>
To:	James Bottomley <James.Bottomley@...senPartnership.com>
cc:	Alexander Beregalov <a.beregalov@...il.com>,
	LKML <linux-kernel@...r.kernel.org>, linux-next@...r.kernel.org,
	Ingo Molnar <mingo@...e.hu>, linux-scsi@...r.kernel.org,
	David Miller <davem@...emloft.net>,
	Jens Axboe <jens.axboe@...cle.com>,
	Mike Anderson <andmike@...ux.vnet.ibm.com>
Subject: Re: next-20081119: general protection fault:
 get_next_timer_interrupt()

On Mon, 24 Nov 2008, James Bottomley wrote:
> On Mon, 2008-11-24 at 18:43 +0100, Thomas Gleixner wrote:
> > > scsi0 : LSI SAS based MegaRAID driver
> > > Driver 'sd' needs updating - please use bus_type methods
> > > scsi 0:0:0:0: Direct-Access     ATA      SAMSUNG HE160HJ  0-24 PQ: 0 ANSI: 5
> > > ------------[ cut here ]------------
> > > WARNING: at lib/debugobjects.c:215 debug_print_object+0x4f/0x57()
> > > ODEBUG: free active object type: timer_list
> > 
> > That's the cause for your boot crash. The scsi/blk code is freeing a
> > page which contains an active timer, so the timer code references gone
> > memory. You triggered it because DEBUG_PAGEALLOC unmaps the page when
> > it's freed.
> > 
> > James, or other scsi experts please.
> 
> Well, not sure.  Most likely candidate is the new block timer code.
> What seems to be happening is that the queue is being released with
> either an outstanding request (refcounting problem) or ticking timer
> with no work (block timer problem).  The way scanning works is that we
> create a request queue for each device we probe and then delete it again
> if nothing appears after the bus settle time.   The argument against
> this is that it should show up on every scanned bus.  However, these are
> getting rarer; I was just about to write that I hadn't seen it when I
> remembered that all my SCSI testing systems are currently running
> hotplug reporting busses (i.e. don't do scanning).  However,
> fortunately, I've also booted voyager recently which does use parallel
> SCSI and doesn't see this either, so it could also be megaraid_sas
> specific.

Yeah, block could it be as well. Jens, Mike ?

One note about not seeing it: We have had such bugs before where the
page was freed but not touched and the timer survived w/o tripping the
system over. Alexander noticed because of DEBUG_PAGEALLOC and you can
also see it by enabling debugobjects, which will give you the nice
backtrace.

CONFIG_DEBUG_OBJECTS=y
CONFIG_DEBUG_OBJECTS_FREE=y
CONFIG_DEBUG_OBJECTS_TIMERS=Y

and add "debug_objects" to the kernel command line.
 
> Could you turn on SCSI logging so we can see the sequences.  Probably
> since this is boot time, just enable all logging:
> 
> echo 0xffffffff > /sys/module/scsi_mod/parameters/scsi_logging_level
> 
> (kernel must be compiled with CONFIG_SCSI_LOGGING=y
> 
> James
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/