linux-kernel - Re: kernel BUG at mm/usercopy.c:102 -- pc : usercopy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <6f5b9d18-2d04-495c-970c-eb5eada5f676@redhat.com>
Date: Mon, 25 Mar 2024 17:43:23 +0100
From: David Hildenbrand <david@...hat.com>
To: Xiubo Li <xiubli@...hat.com>, linux-mm@...ck.org
Cc: linux-kernel@...r.kernel.org,
 Ceph Development <ceph-devel@...r.kernel.org>,
 linux-fsdevel@...r.kernel.org, Vlastimil Babka <vbabka@...e.cz>
Subject: Re: kernel BUG at mm/usercopy.c:102 -- pc : usercopy_abort

On 25.03.24 13:06, Xiubo Li wrote:
> 
> On 3/25/24 18:14, David Hildenbrand wrote:
>> On 25.03.24 08:45, Xiubo Li wrote:
>>> Hi guys,
>>>
>>> We are hitting the same crash frequently recently with the latest kernel
>>> when testing kceph, and the call trace will be something likes:
>>>
>>> [ 1580.034891] usercopy: Kernel memory exposure attempt detected from
>>> SLUB object 'kmalloc-192' (offset 82, size 499712)!^M
>>> [ 1580.045866] ------------[ cut here ]------------^M
>>> [ 1580.050551] kernel BUG at mm/usercopy.c:102!^M
>>> ^M
>>> Entering kdb (current=0xffff8881211f5500, pid 172901) on processor 4
>>> Oops: (null)^M
>>> due to oops @ 0xffffffff8138cabd^M
>>> CPU: 4 PID: 172901 Comm: fsstress Tainted: G S 6.6.0-g623393c9d50c #1^M
>>> Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 1.0c 09/07/2015^M
>>> RIP: 0010:usercopy_abort+0x6d/0x80^M
>>> Code: 4c 0f 44 d0 41 53 48 c7 c0 1c e9 13 82 48 c7 c6 71 62 13 82 48 0f
>>> 45 f0 48 89 f9 48 c7 c7 f0 6b 1b 82 4c 89 d2 e8 63 2b df ff <0f> 0b 49
>>> c7 c1 44 c8 14 82 4d 89 cb 4d 89 c8 eb a5 66 90 f3 0f 1e^M
>>> RSP: 0018:ffffc90006dfba88 EFLAGS: 00010246^M
>>> RAX: 000000000000006a RBX: 000000000007a000 RCX: 0000000000000000^M
>>> RDX: 0000000000000000 RSI: ffff88885fd1d880 RDI: ffff88885fd1d880^M
>>> RBP: 000000000007a000 R08: 0000000000000000 R09: c0000000ffffdfff^M
>>> R10: 0000000000000001 R11: ffffc90006dfb930 R12: 0000000000000001^M
>>> R13: ffff8882b7bbed12 R14: ffff88827a375830 R15: ffff8882b7b44d12^M
>>> FS:  00007fb24c859500(0000) GS:ffff88885fd00000(0000)
>>> knlGS:0000000000000000^M
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
>>> CR2: 000055c2bcf9eb00 CR3: 000000028956c005 CR4: 00000000001706e0^M
>>> Call Trace:^M
>>>     <TASK>^M
>>>     ? kdb_main_loop+0x32c/0xa10^M
>>>     ? kdb_stub+0x216/0x420^M
>>> more>
>>>
>>> You can see more detail in ceph tracker
>>> https://tracker.ceph.com/issues/64471.
>>
>> Where is the full backtrace? Above contains only the backtrace of kdb.
>>
> Hi David,
> 
> The bad news is that there is no more backtrace. All the failures we hit
> are similar with the following logs:
> 

That's unfortunate :/

"exposure" in the message means we are in copy_to_user().

SLUB object 'kmalloc-192' means that we come from __check_heap_object() 
.. we have 192 bytes, but the length we want to access is 499712 ... 
488 KiB.

So we ended  up somehow in

__copy_to_user()->check_object_size()->__check_object_size()->
check_heap_object()->__check_heap_object()->usercopy_abort()


.. but the big question is which code tried to copy way too much memory 
out of a slab folio to user space.

> 
>> That link also contains:
>>
>> Entering kdb (current=0xffff9115d14fb980, pid 61925) on processor 5
>> Oops: (null)^M
>> due to oops @ 0xfffffffface3a1d2^M
>> CPU: 5 PID: 61925 Comm: ld Kdump: loaded Not tainted
>> 5.14.0-421.el9.x86_64 #1^M
>> Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015^M
>> RIP: 0010:usercopy_abort+0x74/0x76^M
>> Code: 14 74 ad 51 48 0f 44 d6 49 c7 c3 cb 9f 73 ad 4c 89 d1 57 48 c7
>> c6 60 83 75 ad 48 c7 c7 00 83 75 ad 49 0f 44 f3 e8 1b 3b ff ff <0f> 0b
>> 0f b6 d3 4d 89 e0 48 89 e9 31 f6 48 c7 c7 7f 83 75 ad e8 73^M
>> RSP: 0018:ffffbb97c16af8d0 EFLAGS: 00010246^M
>> RAX: 0000000000000072 RBX: 0000000000000112 RCX: 0000000000000000^M
>> RDX: 0000000000000000 RSI: ffff911d1fd60840 RDI: ffff911d1fd60840^M
>> RBP: 0000000000004000 R08: 80000000ffff84b4 R09: 0000000000ffff0a^M
>> R10: 0000000000000004 R11: 0000000000000076 R12: ffff9115c0be8b00^M
>> R13: 0000000000000001 R14: ffff911665df9f68 R15: ffff9115d16be112^M
>> FS:  00007ff20442eb80(0000) GS:ffff911d1fd40000(0000)
>> knlGS:0000000000000000^M
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
>> CR2: 00007ff20446142d CR3: 00000001215ec003 CR4: 00000000003706e0^M
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400^M
>> Call Trace:^M
>>   <TASK>^M
>>   ? show_trace_log_lvl+0x1c4/0x2df^M


.. are we stuck in show_trace_log_lvl(), probably deadlocked not being 
able to print the actuall callstack? If so, that's nasty.

-- 
Cheers,

David / dhildenb