linux-ext4 - Re: ext4 crash in 4.4.10

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160706102224.GH14067@quack2.suse.cz>
Date:	Wed, 6 Jul 2016 12:22:24 +0200
From:	Jan Kara <jack@...e.cz>
To:	Nikolay Borisov <kernel@...p.com>
Cc:	Jan Kara <jack@...e.cz>, linux-ext4 <linux-ext4@...r.kernel.org>,
	Theodore Ts'o <tytso@....edu>, Jan Kara <jack@...e.com>,
	SiteGround Operations <operations@...eground.com>
Subject: Re: ext4 crash in 4.4.10

On Mon 04-07-16 11:49:27, Nikolay Borisov wrote:
> Hello again Jan, 
> 
> On 06/03/2016 12:19 PM, Jan Kara wrote:
> > Hi,
> > 
> > On Fri 03-06-16 11:28:31, Nikolay Borisov wrote:
> >> Recently the following crash was brought to my attention:
> >>
> [SNIP]
> > 
> > Hum, this looks most likely like a memory corruption. The value
> > ffffffffd9c01f11 doesn't look like a valid pointer to any dynamically
> > allocated data (it is not aligned to multiple of 4, it does not point to
> > data segment ffff88..........). It is close to a pointer to kernel code
> > (modules start at ffffffffa.......) so if it really points to some kernel
> > code it may be interesting to find out where. I have no clue how such
> > number could get to ei->i_dquot[0]. Usually what I do in such cases is
> > search kernel memory whether something unusual points to that place,
> > whether previous struct members didn't get corrupted as well or whether
> > that value is not also somewhere else in memory. But it's a search for a
> > needle in a haystack.
> > 
> > 								Honza
> 
> So I got this exact same crash on a different machine, 
> with the exact same value. This rules out it being a random corruption: 
> 
> [2455521.848677] BUG: unable to handle kernel paging request at ffffffffd9c01fb1
> [2455521.849025] IP: [<ffffffff81204b62>] dquot_free_inode+0xa2/0x230
> [2455521.849315] PGD 1c0b067 PUD 1c0d067 PMD 0 
> [2455521.849720] Oops: 0000 [#1] SMP 
> [2455521.850062] Modules linked in: <OMITTED >
> [2455521.856549]  ipv6 [last unloaded: nf_conntrack_ftp]
> [2455521.856904] CPU: 8 PID: 2955 Comm: rm Tainted: G           O    4.4.10-clouder1 #73
> [2455521.857286] Hardware name: Supermicro X10DRi/X10DRi, BIOS 2.0 12/28/2015
> [2455521.857517] task: ffff883506658000 ti: ffff881d50198000 task.ti: ffff881d50198000
> [2455521.857898] RIP: 0010:[<ffffffff81204b62>]  [<ffffffff81204b62>] dquot_free_inode+0xa2/0x230
> [2455521.858353] RSP: 0018:ffff881d5019bc48  EFLAGS: 00010286
> [2455521.858581] RAX: ffffffffd9c01f11 RBX: ffff881d5019bc48 RCX: 000000000000fb20
> [2455521.858962] RDX: ffff881d5019bc58 RSI: ffff880996894680 RDI: ffffffff81c09540
> [2455521.859343] RBP: ffff881d5019bcc8 R08: 0000000000000001 R09: ffff881d5019bc58
> [2455521.859724] R10: ffff881d5019bca0 R11: 0000000100000000 R12: ffff880996894680
> [2455521.860105] R13: 0000000000000000 R14: 0000000000000008 R15: ffff881d5019be68
> [2455521.860486] FS:  00007f6ad2fe9700(0000) GS:ffff881fffb00000(0000) knlGS:0000000000000000
> [2455521.860868] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [2455521.861096] CR2: ffffffffd9c01fb1 CR3: 0000000151007000 CR4: 00000000001406e0
> [2455521.861476] Stack:
> [2455521.861696]  ffff881fa0388c00 ffff880996894368 0000000000000000 0000000000000000
> [2455521.862335]  0000000000000000 ffffffff8123949c ffff881d5019bd28 ffffffff812351c8
> [2455521.862972]  ffff881d5019bcb8 ffff883fb9a4d800 ffff881ff093a810 ffff883fb9a4d800
> [2455521.863611] Call Trace:
> [2455521.863838]  [<ffffffff8123949c>] ? ext4_evict_inode+0x26c/0x4c0
> [2455521.864069]  [<ffffffff812351c8>] ? ext4_mark_iloc_dirty+0x518/0x770
> [2455521.864304]  [<ffffffff812312e3>] ext4_free_inode+0x83/0x5a0
> [2455521.864534]  [<ffffffff8123949c>] ? ext4_evict_inode+0x26c/0x4c0
> [2455521.864765]  [<ffffffff8123673b>] ? ext4_mark_inode_dirty+0x7b/0x260
> [2455521.864999]  [<ffffffff812396e5>] ext4_evict_inode+0x4b5/0x4c0
> [2455521.865233]  [<ffffffff811ba616>] evict+0xc6/0x1c0
> [2455521.865466]  [<ffffffff811ba9dc>] iput+0x1ec/0x260
> [2455521.865696]  [<ffffffff811ab128>] ? vfs_unlink+0x128/0x130
> [2455521.865928]  [<ffffffff811ae766>] do_unlinkat+0x186/0x2c0
> [2455521.866158]  [<ffffffff811ae8e2>] SyS_unlinkat+0x22/0x40
> [2455521.866390]  [<ffffffff81635c57>] entry_SYSCALL_64_fastpath+0x12/0x6a
> [2455521.866620] Code: 80 41 be 08 00 00 00 65 ff 0d cf 60 e0 7e e8 f6 0d 43 00 48 8d 53 10 4c 89 e6 4c 8d 55 d8 66 c7 02 00 00 48 8b 06 48 85 c0 74 61 <48> 8b 88 a0 00 00 00 4c 8d 80 a0 00 00 00 83 e1 08 0f 84 a5 00 
> [2455521.871376] RIP  [<ffffffff81204b62>] dquot_free_inode+0xa2/0x230
> [2455521.871674]  RSP <ffff881d5019bc48>
> [2455521.871897] CR2: ffffffffd9c01fb1
> 
> The crash again points to test_bit in info_idq_free.  I followed
> your advise to search for the address and here is what I got: 
> 
> crash> search -m ffffffff00000000 d9c01f11
> 
> ffff88000181e030: d9c01927d9c01f11 
> ffff880996894680: ffffffffd9c01f11 
> ffff881d5019b858: ffffffffd9c01f11 
> ffff881d5019b998: ffffffffd9c01f11 - <stack frame of crash_kexec>
> ffff881d5019bbe8: ffffffffd9c01f11 - <stack frame of page_fault)
> ffffffff8181e030: d9c01927d9c01f11
> 
> So two of the values are in the stack frames of function involved, 
> in the crash so I'd say they are of no interest. What's interesting
> is that ffffffff8181e030 seems to be quota_magics: 
> 
> readelf -s vmlinux-4.4.10-clouder1 | grep ffffffff8181e030
> 15605: ffffffff8181e030    12 OBJECT  LOCAL  DEFAULT    4 quota_magics.24849
> 
> #define V2_INITQMAGICS {\
>         0xd9c01f11,     /* USRQUOTA */\
>         0xd9c01927,     /* GRPQUOTA */\
>         0xd9c03f14,     /* PRJQUOTA */\
> }
> 
> So it seems that somehow the USRQUOTA magic values overwrites
> the dquot pointer. Looking at the code I'm not entirely 
> sure how this can happen though.

This is indeed interesting. Can you dump full struct ext4_inode * of the inode
for which dquot_free_inode() was crashing? Command

kmem -s ffff880996894680

should show you that this address is part of an object in ext4_inode_cache
(please verify that) and give you pointer to the beginning of the object
which is ext4_inode... Thanks!

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html