linux-kernel - Regression for 3.0-rc1 - bisected to commit ea6949b66d084a197dd7f243b72e216a71d9f2ca

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <4DE97E12.1030900@lwfinger.net>
Date:	Fri, 03 Jun 2011 19:36:34 -0500
From:	Larry Finger <Larry.Finger@...inger.net>
To:	LKML <linux-kernel@...r.kernel.org>, Tejun Heo <tj@...nel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>
CC:	Amit Shah <amit.shah@...hat.com>, Jens Axboe <jaxboe@...ionio.com>
Subject: Regression for 3.0-rc1 - bisected to commit ea6949b66d084a197dd7f243b72e216a71d9f2ca

With kernel 3.0-rc1 and an openSUSE 11.4 distro, my HP dv2815nr notebook fails 
to boot. Bisection leads to the following:

finger@...rylap:~/linux-2.6> git bisect bad
ea6949b66d084a197dd7f243b72e216a71d9f2ca is the first bad commit
commit ea6949b66d084a197dd7f243b72e216a71d9f2ca
Author: Tejun Heo <tj@...nel.org>
Date:   Thu Apr 21 20:54:44 2011 +0200

     cdrom: always check_disk_change() on open

     cdrom_open() called check_disk_change() after the rest of open path
     succeeded which leads to the following bizarre behavior.

     * After media change, if the device opened without O_NONBLOCK,
       open_for_data() naturally fails with -ENOMEDIA and
       check_disk_change() is never called.  The media is known to be gone
       and the open failure makes it obvious to the userland but device
       invalidation never happens.

     * But if the device is opened with O_NONBLOCK, all the checks are
       bypassed and cdrom_open() doesn't notice that the media is not there
       and check_disk_change() is called and invalidation happens.

     There's nothing to be gained by avoiding calling check_disk_change()
     on open failure.  Common cases end up calling check_disk_change()
     anyway.  All we get is inconsistent behavior.

     Fix it by moving check_disk_change() invocation to the top of
     cdrom_open() so that it always gets called regardless of how the rest
     of open proceeds.

     Note for stable: 2.6.38 and later only

     Cc: stable@...nel.org
     Signed-off-by: Tejun Heo <tj@...nel.org>
     Reported-by: Amit Shah <amit.shah@...hat.com>
     Tested-by: Amit Shah <amit.shah@...hat.com>
     Signed-off-by: Jens Axboe <jaxboe@...ionio.com>

:040000 040000 6af2e041bcda269a2378f5c8bf10f11fb1903d32 
8dbfeb49db2617d65b4074094c3ff0e2a8159a23 M      drivers

Although I was concerned when I discovered that bisection led to the very first 
patch in a branch, I found that 3.0-rc1 booted when this patch was reverted.

To help debug it, I added some printk's before and after the relocated call to 
check_disk_change(). After those showed that user space was infinite looping, I 
added a dump_stack() with the following result:

Jun  2 23:11:55 (none) kernel: [   59.311058] Entering cdrom_open
Jun  2 23:11:55 (none) kernel: [   59.313351] Pid: 5776, comm: udisks-part-id 
Not tainted 3.0.0-rc1-Linus+ #47
Jun  2 23:11:55 (none) kernel: [   59.313354] Call Trace:
Jun  2 23:11:55 (none) kernel: [   59.313365]  [<ffffffffa009ba6f>] 
cdrom_open+0x4f/0x610 [cdrom]
Jun  2 23:11:55 (none) kernel: [   59.313372]  [<ffffffff81351ae1>] ? 
__mutex_unlock_slowpath+0xc1/0x150
Jun  2 23:11:55 (none) kernel: [   59.313377]  [<ffffffff81081f1d>] ? 
trace_hardirqs_on_caller+0x13d/0x180
Jun  2 23:11:55 (none) kernel: [   59.313381]  [<ffffffff81081f6d>] ? 
trace_hardirqs_on+0xd/0x10
Jun  2 23:11:55 (none) kernel: [   59.313387]  [<ffffffffa00da5d7>] 
idecd_open+0xc7/0xe0 [ide_cd_mod]
Jun  2 23:11:55 (none) kernel: [   59.313393]  [<ffffffff811515b7>] 
__blkdev_get+0x97/0x3e0
Jun  2 23:11:55 (none) kernel: [   59.313397]  [<ffffffff81151c80>] ? 
blkdev_get+0x380/0x380
Jun  2 23:11:55 (none) kernel: [   59.313401]  [<ffffffff8115194e>] 
blkdev_get+0x4e/0x380
Jun  2 23:11:55 (none) kernel: [   59.313406]  [<ffffffff81353206>] ? 
_raw_spin_unlock+0x26/0x30
Jun  2 23:11:55 (none) kernel: [   59.313410]  [<ffffffff81151c80>] ? 
blkdev_get+0x380/0x380
Jun  2 23:11:55 (none) kernel: [   59.313414]  [<ffffffff81151ce0>] 
blkdev_open+0x60/0x80
Jun  2 23:11:55 (none) kernel: [   59.313419]  [<ffffffff8111b7af>] 
__dentry_open+0x12f/0x2f0
Jun  2 23:11:55 (none) kernel: [   59.313423]  [<ffffffff81353206>] ? 
_raw_spin_unlock+0x26/0x30
Jun  2 23:11:55 (none) kernel: [   59.313427]  [<ffffffff8111c8f1>] 
nameidata_to_filp+0x71/0x80
Jun  2 23:11:55 (none) kernel: [   59.313432]  [<ffffffff8112b04a>] 
do_last+0xda/0x910
Jun  2 23:11:55 (none) kernel: [   59.313437]  [<ffffffff8112c26b>] 
path_openat+0xcb/0x400
Jun  2 23:11:55 (none) kernel: [   59.313441]  [<ffffffff810f7567>] ? 
might_fault+0x57/0xb0
Jun  2 23:11:55 (none) kernel: [   59.313445]  [<ffffffff810f7567>] ? 
might_fault+0x57/0xb0
Jun  2 23:11:55 (none) kernel: [   59.313449]  [<ffffffff8112c5e4>] 
do_filp_open+0x44/0xa0
Jun  2 23:11:55 (none) kernel: [   59.313454]  [<ffffffff81353206>] ? 
_raw_spin_unlock+0x26/0x30
Jun  2 23:11:55 (none) kernel: [   59.313458]  [<ffffffff81139104>] ? 
alloc_fd+0xf4/0x150
Jun  2 23:11:55 (none) kernel: [   59.313462]  [<ffffffff8111c9fc>] 
do_sys_open+0xfc/0x1d0
Jun  2 23:11:55 (none) kernel: [   59.313467]  [<ffffffff8111caeb>] 
sys_open+0x1b/0x20
Jun  2 23:11:55 (none) kernel: [   59.313472]  [<ffffffff81353c7b>] 
system_call_fastpath+0x16/0x1b
Jun  2 23:11:55 (none) kernel: [   59.328377] Back from check_disk_change

 From the above, it seems to me that program "udisks-part-id" is looping on the 
attempt to mount a cdrom.

 From the commit message, it seems to me that this change is correct, and that 
the user-space program has the bug. I am posting this to provide a trail for 
anyone else that hits the problem, and to solicit any help in developing a 
proper fix. Until such a fix is available, I can always revert ea6949b.

Bug 698104 on http://bugzilla.novell.com has been filed for this issue.

Thanks,

Larry
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/