linux-kernel - RE: Boot regression (was "Re: [PATCH] genhd: Do not hold event lock when scheduling workqueue elements")

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <MWHPR03MB266957B2E9075ACDB829FD21BF580@MWHPR03MB2669.namprd03.prod.outlook.com>
Date:   Tue, 14 Feb 2017 15:54:26 +0000
From:   Dexuan Cui <decui@...rosoft.com>
To:     "hch@....de" <hch@....de>
CC:     Jens Axboe <axboe@...nel.dk>,
        Bart Van Assche <Bart.VanAssche@...disk.com>,
        "hare@...e.com" <hare@...e.com>, "hare@...e.de" <hare@...e.de>,
        "Martin K. Petersen" <martin.petersen@...cle.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-block@...r.kernel.org" <linux-block@...r.kernel.org>,
        "jth@...nel.org" <jth@...nel.org>,
        Nick Meier <Nick.Meier@...rosoft.com>,
        "Alex Ng (LIS)" <alexng@...rosoft.com>,
        Long Li <longli@...rosoft.com>,
        "Adrian Suhov (Cloudbase Solutions SRL)" <v-adsuho@...rosoft.com>,
        "Chris Valean (Cloudbase Solutions SRL)" <v-chvale@...rosoft.com>
Subject: RE: Boot regression (was "Re: [PATCH] genhd: Do not hold event lock
 when scheduling workqueue elements")

> From: hch@....de [mailto:hch@....de]
> Sent: Tuesday, February 14, 2017 22:51
> To: Dexuan Cui <decui@...rosoft.com>
> Cc: hch@....de; Jens Axboe <axboe@...nel.dk>; Bart Van Assche
> <Bart.VanAssche@...disk.com>; hare@...e.com; hare@...e.de; Martin K.
> Petersen <martin.petersen@...cle.com>; linux-kernel@...r.kernel.org;
> linux-block@...r.kernel.org; jth@...nel.org; Nick Meier
> <Nick.Meier@...rosoft.com>; Alex Ng (LIS) <alexng@...rosoft.com>; Long Li
> <longli@...rosoft.com>; Adrian Suhov (Cloudbase Solutions SRL) <v-
> adsuho@...rosoft.com>; Chris Valean (Cloudbase Solutions SRL) <v-
> chvale@...rosoft.com>
> Subject: Re: Boot regression (was "Re: [PATCH] genhd: Do not hold event lock
> when scheduling workqueue elements")
> 
> On Tue, Feb 14, 2017 at 02:46:41PM +0000, Dexuan Cui wrote:
> > > From: hch@....de [mailto:hch@....de]
> > > Sent: Tuesday, February 14, 2017 22:29
> > > To: Dexuan Cui <decui@...rosoft.com>
> > > Subject: Re: Boot regression (was "Re: [PATCH] genhd: Do not hold event
> lock
> > > when scheduling workqueue elements")
> > >
> > > Ok, thanks for testing.  Can you try the patch below?  It fixes a
> > > clear problem which was partially papered over before the commit
> > > you bisected to, although it can't explain why blk-mq still works.
> >
> > Still bad luck. :-(
> >
> > BTW, I'm using the first "bad" commit (scsi: allocate scsi_cmnd structures
> as
> > part of struct request) + the 2 patches you provided today.
> >
> > I suppose I don't need to test the 2 patches on the latest linux-next repo.
> 
> I'd love a test on that repo actually.  We had a few other for sense
> handling since then I think.

I tested today's linux-next (next-20170214) + the 2 patches just now and got
a weird result: 
sometimes the VM stills hung with a new calltrace (BUG: spinlock bad
magic) , but sometimes the VM did boot up despite the new calltrace!

Attached is the log of a "good" boot.

It looks we have a memory corruption issue somewhere...

Actually previously I saw the "BUG: spinlock bad magic" message once, but I
couldn't repro it later, so I didn't mention it to you.

The good news is that now I can repro the "spinlock bad magic" message
every time. 
I tried to dig into this by enabling Kernel hacking -> Memory debugging,
but didn't find anything abnormal. 
Is it possible that the SCSI layer passes a wrong memory address?

Thanks,
-- Dexuan

Download attachment "dmesg.log" of type "application/octet-stream" (31337 bytes)