[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACT4Y+Y6TpKD4pY9eGF-FFi6MHMNtBjnq-JeWLr+wVhDBP7Bfw@mail.gmail.com>
Date:	Wed, 24 Feb 2016 14:35:18 +0100
From:	Dmitry Vyukov <dvyukov@...gle.com>
To:	Al Viro <viro@...iv.linux.org.uk>
Cc:	Ian Kent <raven@...maw.net>,
	Mickaël Salaün <mic@...ikod.net>,
	"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
	LKML <linux-kernel@...r.kernel.org>,
	syzkaller <syzkaller@...glegroups.com>,
	Kostya Serebryany <kcc@...gle.com>,
	Alexander Potapenko <glider@...gle.com>,
	Sasha Levin <sasha.levin@...cle.com>
Subject: Re: fs: NULL deref in atime_needs_update
On Wed, Feb 24, 2016 at 11:15 AM, Dmitry Vyukov <dvyukov@...gle.com> wrote:
> On Wed, Feb 24, 2016 at 11:03 AM, Dmitry Vyukov <dvyukov@...gle.com> wrote:
>> On Wed, Feb 24, 2016 at 5:46 AM, Al Viro <viro@...iv.linux.org.uk> wrote:
>>> On Wed, Feb 24, 2016 at 11:12:13AM +0800, Ian Kent wrote:
>>>> On Wed, 2016-02-17 at 00:40 +0100, Mickaël Salaün wrote:
>>>> > Hi,
>>>> >
>>>> > Actually I found the same bug (without fuzzing) and I can reproduce it
>>>> > in a deterministic way (e.g. by creating a LSM that return 1 for the
>>>> > security_file_open hook). At least, from v4.2.8 I can easily trigger
>>>> > traces like this :
>>>>
>>>> Reading through this thread I wonder if this is a new problem.
>>>>
>>>> Is there a previous kernel it can't be reproduced on?
>>>> Perhaps a bisect will shed some light on what's happening.
>>>
>>> There are several things in the mix.  What Mickaël has found is that a bunch
>>> of places where _positive_ number returned instead of expected 0 or -E... can
>>> get propagated all way back to caller of do_last(), confusing the hell out
>>> of it.  That's not what Dmitry has triggered, though.  WARN_ON() in the
>>> end of do_last() would've triggered, and IMO this one, along with mitigation
>>> (map that "error value" to -EINVAL) should go into mainline and all -stable.
>>> Bogus ->open() returning a positive number had always been bad news; in the
>>> best case it would be returned to userland, leading to "open(2) has failed
>>> and returned a positive number".  Hell knows if we ever had such instances
>>> (or have them right now), but I wouldn't bet on their absense.  Rare
>>> failure exits returning bogus values in an ->open() instance in some driver
>>> can easily stay unnoticed for a long time.  Starting from at least 3.6
>>> (circa the atomic_open support) it got more unpleasant than simple "confuse
>>> the hell out of userland".  ->open() isn't the only vector for injection of
>>> such crap - ->permission() would also serve, same for several LSM turds, etc.
>>>
>>> Again, that's a separate problem.  What Dmitry seems to be catching is getting
>>> crap values fed to should_follow_link() as inode.  I see one bug in that
>>> area that does need fixing (fix present in the last patch I've posted, with
>>> WARN_ON() to indicate that this thing has triggered; _that_ WARN_ON() should
>>> be gone from the final variant, since this can trigger without driver bugs,
>>> etc.)  In RCU mode after we'd checked that dentry is a symlink one, we need
>>> to verify that it hadn't been changed since before we'd fetched the inode.
>>> It might have been e.g. a regular file, which got unlinked with symlink
>>> created in its place.  Then we'd go into get_link() with non-symlink inode
>>> and oops on attempt to call its ->i_op->get_link().  That race is real, very
>>> hard to hit (you need both the unlink(2) and symlink(2) to happen between
>>> lookup_fast() and should_follow_link() and unless you have PREEMPT_RCU you
>>> can't lose the timeslice there) and would've manifested differently.
>>>
>>> But that leaves other two kinds of bugs getting triggered: inode of some
>>> non-symlink is possible, but what we saw included NULL inode when we'd
>>> reached finish_open: in do_last().  Should be flat-out impossible - we either
>>> get lookup_fast(..., &inode, ...) return 0 and store NULL in inode, or
>>> get NULL inode from pinned d_is_symlink() dentry after having grabbed
>>> a reference and left RCU mode.  Neither should be possible without either
>>> something weird happening to lookup_fast() (and we would've seen oopsen in
>>> link_path_walk() if that could happen, BTW) or screwed dentry refcounting
>>> somewhere, combined with a race that managed to turn...
>>>
>>> Oh, shit.  No screwed refcounting is needed.  Look:
>>>         BUG_ON(nd->flags & LOOKUP_RCU);
>>>         inode = d_backing_inode(path.dentry);
>>>         seq = 0;        /* out of RCU mode, so the value doesn't matter */
>>>         if (unlikely(d_is_negative(path.dentry))) {
>>>                 path_to_nameidata(&path, nd);
>>>                 return -ENOENT;
>>>         }
>>> Suppose we come here with negative path.dentry.  We are holding a reference,
>>> all right, and for a _postive_ dentry that would've been enough to keep
>>> it positive.  Not so for a negative one, though - symlink(2) on another
>>> CPU doint d_instantiate() just before the d_is_negative() check and we
>>> are fucked - inode is stale and we sail through all the checks, all the
>>> way into should_follow_link().
>>>
>>> We also have the same kind of crap in walk_component() -
>>>                 err = lookup_slow(nd, &path);
>>>                 if (err < 0)
>>>                         return err;
>>>                 inode = d_backing_inode(path.dentry);
>>>                 seq = 0;        /* we are already out of RCU mode */
>>>                 err = -ENOENT;
>>>                 if (d_is_negative(path.dentry))
>>>                         goto out_path_put;
>>> There it's much harder to hit, though - we need it not just d_instantiate()
>>> overlapping those lines; we need the racing syscall to get from locking
>>> the parent to d_instantiate() between the point where lookup_slow() has
>>> unlocked the parent and d_is_negative().  And lookup_slow() couldn't have
>>> gone into mountpoint crossing, so it's really hard to hit - you pretty
>>> much have to get preempted just after fetching inode.
>>>
>>> OK, the next delta to try, and there definitely are several commits in
>>> that pile.  It still does not explain the traces with GPF at 0x50, though -
>>> for that we need not just a NULL getting to should_follow_link() but
>>> something non-NULL with NULL at offset 40 from it (offset of ->i_sb in
>>> struct inode).  That something *can't* be a valid struct inode or had been
>>> one in recent past - ->i_sb is assigned in new_inode(), value is always
>>> non-NULL and never modified all the way until RCU-delayed freeing of struct
>>> inode.  So that has to be something entirely different...  Anyway, the
>>> patch so far follows:
Fired after some time:
[ 3491.147607] ------------[ cut here ]------------
[ 3491.147986] WARNING: CPU: 1 PID: 17000 at fs/namei.c:1728
path_openat+0x14e5/0x1560()
[ 3491.148576] Modules linked in:
[ 3491.148818] CPU: 1 PID: 17000 Comm: syz-executor Tainted: G
W       4.5.0-rc5+ #72
[ 3491.149530] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[ 3491.149530]  0000000000000000 ffff88002d6ebcb8 ffffffff8194ad79
0000000000000000
[ 3491.152103]  ffffffff8334501c ffff88002d6ebcf0 ffffffff81172291
0000000000000000
[ 3491.152103]  ffff88002d6ebd98 ffff88002d6ebde0 0000000000048000
ffff88002d6ebefc
[ 3491.155310] Call Trace:
[ 3491.155310]  [<ffffffff8194ad79>] dump_stack+0x99/0xd0
[ 3491.155310]  [<ffffffff81172291>] warn_slowpath_common+0x81/0xc0
[ 3491.157483]  [<ffffffff81172385>] warn_slowpath_null+0x15/0x20
[ 3491.157483]  [<ffffffff81318495>] path_openat+0x14e5/0x1560
[ 3491.157483]  [<ffffffff81319549>] do_filp_open+0x79/0xd0
[ 3491.157483]  [<ffffffff82b11d92>] ? _raw_spin_unlock+0x22/0x30
[ 3491.157483]  [<ffffffff81328f08>] ? __alloc_fd+0xf8/0x200
[ 3491.157483]  [<ffffffff81306eb0>] do_sys_open+0x110/0x1f0
[ 3491.157483]  [<ffffffff81306fbf>] SyS_openat+0xf/0x20
[ 3491.157483]  [<ffffffff82b12836>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 3491.161521] ---[ end trace c65af9be7536efae ]---
[ 3491.161838] BUG: unable to handle kernel NULL pointer dereference
at 000000000000000c
[ 3491.162339] IP: [<ffffffff81327429>] atime_needs_update+0x9/0xc0
[ 3491.162577] PGD 2d626067 PUD 2c4cc067 PMD 0
[ 3491.163036] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 3491.163039] Modules linked in:
[ 3491.163039] CPU: 1 PID: 17000 Comm: syz-executor Tainted: G
W       4.5.0-rc5+ #72
[ 3491.163039] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[ 3491.164386] task: ffff88002c9196c0 ti: ffff88002d6e8000 task.ti:
ffff88002d6e8000
[ 3491.164386] RIP: 0010:[<ffffffff81327429>]  [<ffffffff81327429>]
atime_needs_update+0x9/0xc0
[ 3491.164386] RSP: 0018:ffff88002d6ebcb0  EFLAGS: 00010282
[ 3491.165591] RAX: 0000000000000030 RBX: ffff88002d6ebde0 RCX: 0000000000000002
[ 3491.165591] RDX: ffff88002d6ebe38 RSI: 0000000000000000 RDI: ffff88002d6ebe38
[ 3491.165591] RBP: ffff88002d6ebcc0 R08: ffff88002d6ebe38 R09: 0000000000000001
[ 3491.167202] R10: 0000000000000001 R11: 0000000000001816 R12: 0000000000000000
[ 3491.167368] R13: ffff88002d6ebe38 R14: ffff880072508980 R15: ffff88002d6ebefc
[ 3491.167368] FS:  00007f02454e5700(0000) GS:ffff88003ed00000(0000)
knlGS:0000000000000000
[ 3491.167368] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3491.167368] CR2: 000000000000000c CR3: 000000002cb5f000 CR4: 00000000000006e0
[ 3491.167368] Stack:
[ 3491.169740]  0000000000000000 ffff88002d6ebde0 ffff88002d6ebd00
ffffffff81314922
[ 3491.169740]  0000000000000001 0000000000000000 ffff88002d6ebd98
ffff88002d6ebde0
[ 3491.169740]  0000000000048000 ffff88002d6ebefc ffff88002d6ebdd0
ffffffff8131729d
[ 3491.169740] Call Trace:
[ 3491.169740]  [<ffffffff81314922>] trailing_symlink+0x62/0x260
[ 3491.169740]  [<ffffffff8131729d>] path_openat+0x2ed/0x1560
[ 3491.169740]  [<ffffffff81319549>] do_filp_open+0x79/0xd0
[ 3491.169740]  [<ffffffff82b11d92>] ? _raw_spin_unlock+0x22/0x30
[ 3491.169740]  [<ffffffff81328f08>] ? __alloc_fd+0xf8/0x200
[ 3491.169740]  [<ffffffff81306eb0>] do_sys_open+0x110/0x1f0
[ 3491.169740]  [<ffffffff81306fbf>] SyS_openat+0xf/0x20
[ 3491.169740]  [<ffffffff82b12836>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 3491.169740] Code: ff ff ff 48 85 c0 48 89 c3 74 08 48 89 c7 e8 ef
dc ff ff 48 89 d8 5b 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 53
48 83 ec 08 <f6> 46 0c 02 48 8b 1f 75 6b 48 8b 7e 28 48 8b 47 50 a9 01
04 00
[ 3491.169740] RIP  [<ffffffff81327429>] atime_needs_update+0x9/0xc0
[ 3491.169740]  RSP <ffff88002d6ebcb0>
[ 3491.169740] CR2: 000000000000000c
[ 3491.195051] ---[ end trace c65af9be7536efaf ]---
[ 3491.195051] BUG: sleeping function called from invalid context at
include/linux/sched.h:2795
[ 3491.195051] in_atomic(): 1, irqs_disabled(): 1, pid: 17000, name:
syz-executor
[ 3491.195051] INFO: lockdep is turned off.
[ 3491.195051] irq event stamp: 622
[ 3491.195051] hardirqs last  enabled at (621): [<ffffffff811d9926>]
vprintk_emit+0x2d6/0x5f0
[ 3491.195051] hardirqs last disabled at (622): [<ffffffff82b14d49>]
error_entry+0x69/0xc0
[ 3491.195051] softirqs last  enabled at (618): [<ffffffff81178172>]
__do_softirq+0x222/0x4a0
[ 3491.195051] softirqs last disabled at (595): [<ffffffff81178767>]
irq_exit+0xa7/0xc0
[ 3491.195051] CPU: 1 PID: 17000 Comm: syz-executor Tainted: G      D
W       4.5.0-rc5+ #72
[ 3491.195051] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[ 3491.195051]  0000000000000000 ffff88002d6eb9a8 ffffffff8194ad79
ffff88002c9196c0
[ 3491.195051]  0000000000004268 ffff88002d6eb9d0 ffffffff811a0659
ffffffff8329b7fb
[ 3491.195051]  0000000000000aeb 0000000000000000 ffff88002d6eb9f8
ffffffff811a0764
[ 3491.195051] Call Trace:
[ 3491.195051]  [<ffffffff8194ad79>] dump_stack+0x99/0xd0
[ 3491.195051]  [<ffffffff811a0659>] ___might_sleep+0x179/0x240
[ 3491.195051]  [<ffffffff811a0764>] __might_sleep+0x44/0x80
[ 3491.195051]  [<ffffffff811842cf>] exit_signals+0x1f/0x130
[ 3491.195051]  [<ffffffff811756ef>] do_exit+0xbf/0xd10
[ 3491.195051]  [<ffffffff811da524>] ? kmsg_dump+0x104/0x180
[ 3491.195051]  [<ffffffff8108549f>] oops_end+0x9f/0xe0
[ 3491.195051]  [<ffffffff810ce6e8>] no_context+0x108/0x390
[ 3491.195051]  [<ffffffff810cea8d>] __bad_area_nosemaphore+0x11d/0x220
[ 3491.195051]  [<ffffffff810ceb9e>] bad_area_nosemaphore+0xe/0x10
[ 3491.195051]  [<ffffffff810cf2c4>] __do_page_fault+0x84/0x470
[ 3491.195051]  [<ffffffff810cf764>] trace_do_page_fault+0x74/0x2c0
[ 3491.195051]  [<ffffffff810c9ba4>] do_async_page_fault+0x14/0x90
[ 3491.195051]  [<ffffffff82b14b78>] async_page_fault+0x28/0x30
[ 3491.195051]  [<ffffffff81327429>] ? atime_needs_update+0x9/0xc0
[ 3491.195051]  [<ffffffff81314922>] trailing_symlink+0x62/0x260
[ 3491.195051]  [<ffffffff8131729d>] path_openat+0x2ed/0x1560
[ 3491.195051]  [<ffffffff81319549>] do_filp_open+0x79/0xd0
[ 3491.195051]  [<ffffffff82b11d92>] ? _raw_spin_unlock+0x22/0x30
[ 3491.195051]  [<ffffffff81328f08>] ? __alloc_fd+0xf8/0x200
[ 3491.195051]  [<ffffffff81306eb0>] do_sys_open+0x110/0x1f0
[ 3491.195051]  [<ffffffff81306fbf>] SyS_openat+0xf/0x20
[ 3491.195051]  [<ffffffff82b12836>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 3491.284831] note: syz-executor[17000] exited with preempt_count 1
The warning is this one:
static inline int should_follow_link(struct nameidata *nd, struct path *link,
                                     int follow,
                                     struct inode *inode, unsigned seq)
{
....
        WARN_ON(!inode);                // now, _that_ should not happen.
        return pick_link(nd, link, inode, seq);
}
Powered by blists - more mailing lists
 
