linux-kernel - Re: Top 10 kernel oopses for the week ending January 5th, 2008

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20080108055917.GZ27894@ZenIV.linux.org.uk>
Date:	Tue, 8 Jan 2008 05:59:17 +0000
From:	Al Viro <viro@...IV.linux.org.uk>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Kevin Winchester <kjwinchester@...il.com>,
	"J. Bruce Fields" <bfields@...ldses.org>,
	Arjan van de Ven <arjan@...ux.intel.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	NetDev <netdev@...r.kernel.org>, gregkh@...e.de, neilb@...e.de
Subject: Re: Top 10 kernel oopses for the week ending January 5th, 2008

On Mon, Jan 07, 2008 at 07:26:12PM -0800, Linus Torvalds wrote:
 
> I usually just compile a small program like
> 
> 	const char array[]="\xnn\xnn\xnn...";
> 
> 	int main(int argc, char **argv)
> 	{
> 		printf("%p\n", array);
> 		*(int *)0=0;
> 	}
Heh.  I prefer
char main[] = {.....};
for the same thing, with gdb a.out and no running at all.

FWIW, I'm going to go through Arjan's collection and post blow-by-blow
analysis of some of those suckers.  Tonight, probably...

Let's take e.g. http://www.kerneloops.org/raw.php?rawid=2618

RIP: 0010:[<ffffffff803b49a1>]  [<ffffffff803b49a1>] kref_put+0x31/0x80
RSP: 0000:ffff81007ffe5df0  EFLAGS: 00010202
RAX: 0000000000000000 RBX: 0000000034333545 RCX: ffffffff80606270
RDX: 0000000000000040 RSI: ffffffff803b38b0 RDI: 0000000034333545
RBP: ffff81007ffe5e00 R08: 0000000000000001 R09: 0000000000000000
R10: ffffffff8094c430 R11: 0000000000000000 R12: ffffffff803b38b0
R13: ffff81011ede44d8 R14: ffffffff804d7d50 R15: ffff81011ff210f0
FS:  0000000002024870(0000) GS:ffff81011ff0dd00(0000)
...
Call Trace:
 [<ffffffff803b37e9>] kobject_put+0x19/0x20
 [<ffffffff803b389b>] kobject_del+0x2b/0x40
 [<ffffffff804d7d50>] delayed_delete+0x0/0xb0
 [<ffffffff804d7db9>] delayed_delete+0x69/0xb0
 [<ffffffff80249775>] run_workqueue+0x175/0x210
 [<ffffffff8024a411>] worker_thread+0x71/0xb0
 [<ffffffff8024d9e0>] autoremove_wake_function+0x0/0x40
 [<ffffffff8024a3a0>] worker_thread+0x0/0xb0
 [<ffffffff8024d5fd>] kthread+0x4d/0x80
 [<ffffffff8020c4b8>] child_rip+0xa/0x12
 [<ffffffff8020bbcf>] restore_args+0x0/0x30
 [<ffffffff8024d5b0>] kthread+0x0/0x80
 [<ffffffff8020c4ae>] child_rip+0x0/0x12

Code: f0 ff 0b 0f 94 c0 31 d2 84 c0 74 0b 48 89 df 41 ff d4 ba 01

What do we have here?  It barfs in kref_put() called from kobject_put().
It's -rc6-mm1 and I don't have -mm at hand.  Let's see if we can make
any sense out of it from the mainline - it might be a good idea for the
first pass, unless there are some clear indications to the contrary.

kref_put() is fairly low-level (and deep in call chain, here).  And it's
pretty small:
int kref_put(struct kref *kref, void (*release)(struct kref *kref))
{
        WARN_ON(release == NULL);
        WARN_ON(release == (void (*)(struct kref *))kfree);

        if (atomic_dec_and_test(&kref->refcount)) {
                release(kref);
                return 1;
        }
        return 0;
}

Poking around on the site (I'm not familiar enough with it, so bear with
me) gives a link to posting and an important detail missed in that page:

<1>Unable to handle kernel paging request at 0000000034333545 RIP:
[<ffffffff803b49a1>] kref_put+0x31/0x80

Now, that's interesting - we barf on dereferencing a pointer that (a) has
upper 32 bits zero and (b) doesn't have a bit 7 set in any byte.  Smells
like ASCII data misinterpreted as a pointer.  Conversion to ASCII gives
"E534", which sure as hell does look like a fragment of some string.

OK, so where does that happen?  In this case we have only one candidate,
really - atomic_dec_and_test(&kref->refcount).  Before it we only compare
argument with constants, after it we pass argument to another function.

Now, looking at the registers (see above) we notice that this address had
come from rbx.  Let's try objdump -d lib/kref.o and see what we've got there
for kref_put():

  4a:   55                      push   %rbp
  4b:   48 85 f6                test   %rsi,%rsi
  4e:   48 c7 c1 00 00 00 00    mov    $0x0,%rcx
  55:   ba 36 00 00 00          mov    $0x36,%edx
  5a:   48 89 e5                mov    %rsp,%rbp
  5d:   41 54                   push   %r12
  5f:   49 89 f4                mov    %rsi,%r12
  62:   53                      push   %rbx
  63:   48 89 fb                mov    %rdi,%rbx
  66:   74 15                   je     7d <kref_put+0x33>
  68:   48 81 fe 00 00 00 00    cmp    $0x0,%rsi
  6f:   75 26                   jne    97 <kref_put+0x4d>
  71:   48 c7 c1 00 00 00 00    mov    $0x0,%rcx
  78:   ba 37 00 00 00          mov    $0x37,%edx
  7d:   48 c7 c6 00 00 00 00    mov    $0x0,%rsi
  84:   48 c7 c7 00 00 00 00    mov    $0x0,%rdi
  8b:   31 c0                   xor    %eax,%eax
  8d:   e8 00 00 00 00          callq  92 <kref_put+0x48>
  92:   e8 00 00 00 00          callq  97 <kref_put+0x4d>
  97:   f0 ff 0b                lock decl (%rbx)
  9a:   0f 94 c0                sete   %al
  9d:   31 d2                   xor    %edx,%edx
  9f:   84 c0                   test   %al,%al
  a1:   74 0b                   je     ae <kref_put+0x64>
  a3:   48 89 df                mov    %rbx,%rdi
  a6:   41 ff d4                callq  *%r12
  a9:   ba 01 00 00 00          mov    $0x1,%edx
  ae:   5b                      pop    %rbx
  af:   41 5c                   pop    %r12
  b1:   c9                      leaveq 
  b2:   89 d0                   mov    %edx,%eax
  b4:   c3                      retq   

It's not necessary identical to what that kernel had; still, not bad for
a starting point.  We are even lucky enough to find 'f0 ff' immediately
in there:
  97:   f0 ff 0b                lock decl (%rbx)
Next instructions also match - actually, better than I expected.  So.
We even have register allocation matching the reported kernel and it
all makes sense - this is the instruction where it had puked, the address
had, indeed, come from rbx and it's locked decrement of *rbx followed
by some testing and conditional jump.  Just what one would expect
from atomic_dec_and_test().

IOW, we have &kref->refcount equal to 0x0000000034333545.  What's the
value of kref itself?  We could look into definition of struct kref,
or just notice that release(kref) should be right after that, so
we could see what gets passed to it.
  a3:   48 89 df                mov    %rbx,%rdi
  a6:   41 ff d4                callq  *%r12
is clearly it.  So what we are passing is rbx itself, so that offset got to
be zero.

All right, so we have kref_put() getting 0x0000000034333545 instead of a
pointer in the first argument.  It's called from kobject_put() and unless
-mm has changes in there, there's no need to guess where in kobject_put()
that had been:
void kobject_put(struct kobject * kobj)
{
        if (kobj)
                kref_put(&kobj->kref, kobject_release);
}

Aha.  Now, we need to work back from &kobj->kref to kobj.  Again, assuming
that -mm doesn't change struct kobject in a way that would affect the
offset, we have
struct kobject {
        const char              * k_name;
        struct kref             kref;
        struct list_head        entry;
...
Looks like it's not too likely, unless somebody had been deliberately
rearranging fields (to fit cachelines better, etc.).  We'll need to
verify that, of course, but for now it's a good starting assumption.
Very well, we have one pointer in front of it.  It's an amd64, so
it's an 8 byte field and no alignment padding is to be expected.
IOW, kobj is 0x0000000034333545 - 8, i.e. 0x000000003433353D.  What's _that_
in ASCII?  "=534".  OK, that makes even more sense for a part of some
string...

Let's check; time to take a look at that patch, after all
 struct kobject {
-       const char              * k_name;
+       const char              *name;
        struct kref             kref;

OK, not a problem.  While we are at it, kobject_put() is unchanged by
the patch.  Now, back to trace.  kobject_put() is called from kobject_del().
Now, that one does differ from mainline:

void kobject_del(struct kobject * kobj)
{
	if (!kobj)
		return;

	sysfs_remove_dir(kobj);
	kobj->state_in_sysfs = 0;
	kobj_kset_leave(kobj);
	kobject_put(kobj->parent);
	kobj->parent = NULL;
}

Humm...  So we have kobj->parent containing crap.  What about the caller?
It's from drivers/md/md.c:
static void delayed_delete(struct work_struct *ws)
{
        mdk_rdev_t *rdev = container_of(ws, mdk_rdev_t, del_work);
        kobject_del(&rdev->kobj);
}
and it's only used in
static void unbind_rdev_from_array(mdk_rdev_t * rdev)
{
        char b[BDEVNAME_SIZE];
        if (!rdev->mddev) {
                MD_BUG();
                return;
        }
        bd_release_from_disk(rdev->bdev, rdev->mddev->gendisk);
        list_del_init(&rdev->same_set);
        printk(KERN_INFO "md: unbind<%s>\n", bdevname(rdev->bdev,b));
        rdev->mddev = NULL;
        sysfs_remove_link(&rdev->kobj, "block");

        /* We need to delay this, otherwise we can deadlock when
         * writing to 'remove' to "dev/state"
         */
        INIT_WORK(&rdev->del_work, delayed_delete);
        schedule_work(&rdev->del_work);
}

Well, that takes care of the rest of trace - we have used INIT_WORK to set
rdev->del_work up, scheduled it for execution and eventually the callback
had been called (asynchronously to us).

So what do we have so far?
	unbind_rdev_from_array(rdev);
had been called and rdev->kobj.parent turned to contain a crap value
(0x000000003433353D) instead of a pointer.  That's about all we can
get out of trace.

Now, let's see what could have triggered it.  Curious... Looking through
diff shows an interesting bit:
-       rdev->kobj.parent = &mddev->kobj;
-       if ((err = kobject_add(&rdev->kobj)))
+       if ((err = kobject_add(&rdev->kobj, &mddev->kobj, "dev-%s", b)))
                goto fail;
in bind_rdev_to_array().  At the first sight the changes in kobject_add() 
seem to match that.  And nothing else in md.c seems to be setting it to
anything non-NULL.  Very well, so it's one of the following:

	* unbind_rdev_from_array() called on rdev that didn't pass through
bind_rdev_to_array().
	* unbind_rdev_from_array() called on rdev that bailed out from
bind_rdev_to_array() before that point.
	* mddev value in the above is crap.  That's bloody unlikely, BTW -
kobject_add() would increment the refcount of rdev->kobj.parent (or we would
be in far more trouble, since it would not match kobject_del() and _that_
would hurt a lot more than just md.c).  So &mddev->kobj would better be
something saner when it went through that point.
	* *rdev got corrupted in between.

Actually, looking at the callers of unbind_rdev_from_array()...  We always
follow it with export_rdev().  Which does (presumably) final kobject_put()
on &rdev->kobj, freeing rdev itself.

What guarantees that it doesn't happen before we get to callback?  AFAICS,
nothing whatsoever...

And if it does happen, we'll get rdev happily freed (by rdev_free(), as
->release() of &rdev->kobj) by the time we get to delayed_delete().  Which
explains what's going on just fine.

BTW, -mm changes in kobject.c explain WTF it doesn't trigger in mainline -
there we managed to get away with that since kobject_add() bumped refcount
of kobject by one and kobject_del() decremented it.  That masked the bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/