linux-kernel - Re: Hang on large copy_from_user with PREEMPT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55229D7E.6080309@oracle.com>
Date:	Mon, 06 Apr 2015 10:51:42 -0400
From:	Sasha Levin <sasha.levin@...cle.com>
To:	Borislav Petkov <bp@...en8.de>
CC:	LKML <linux-kernel@...r.kernel.org>,
	Rusty Russell <rusty@...tcorp.com.au>,
	Dave Jones <davej@...hat.com>, Michal Hocko <mhocko@...e.cz>,
	Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: Hang on large copy_from_user with PREEMPT_NONE

On 04/06/2015 07:24 AM, Borislav Petkov wrote:
> On Sun, Apr 05, 2015 at 11:59:52PM -0400, Sasha Levin wrote:
>> Hi all,
>>
>> I'm seeing an interesting hang when trinity is trying to load a large module, where
>> the size passed by userspace is larger than the amount of memory actually allocated
>> in userspace and passed in the 'from' parameter:
>>
>> [ 1549.080032] NMI watchdog: BUG: soft lockup - CPU#11 stuck for 22s! [trinity-c11:12699]
>> [ 1549.080032] Modules linked in:
>> [ 1549.080032] hardirqs last enabled at (4202427): restore_args (arch/x86/kernel/entry_64.S:750)
>> [ 1549.080032] hardirqs last disabled at (4202428): apic_timer_interrupt (arch/x86/kernel/entry_64.S:890)
>> [ 1549.080032] softirqs last enabled at (4202426): __do_softirq (./arch/x86/include/asm/preempt.h:22 kernel/softirq.c:300)
>> [ 1549.080032] softirqs last disabled at (4202421): irq_exit (kernel/softirq.c:350 kernel/softirq.c:391)
>> [ 1549.080032] CPU: 11 PID: 12699 Comm: trinity-c11 Not tainted 4.0.0-rc6-next-20150402-sasha-00039-ge0bdae3-dirty #2125
>> [ 1549.080032] task: ffff880260f30000 ti: ffff88025ebe8000 task.ti: ffff88025ebe8000
>> [ 1549.080032] RIP: copy_user_handle_tail (arch/x86/lib/usercopy_64.c:85)
>> [ 1549.080032] RSP: 0000:ffff88025ebefe38  EFLAGS: 00010202
>> [ 1549.080032] RAX: 00000000356cb494 RBX: 1ffff1004bd7dfcb RCX: 0000000000000000
> 
> Wow, that's some serious len in %rax. Almost a gigabyte AFAICT.

Yup, you can load (or, the kernel would allow you to attempt and load) a module of up to 4GB in size.

>> [ 1549.080032] RDX: 000000009277d652 RSI: ffffc90078bba001 RDI: ffffc90078bba000
> 
> Btw, what's happening to that user pointer in %rdi, is it kosher?

Probably not. Trinity passes random addresses into that syscall, so it's probably one of it's magic
values.

> RSI: ffffc90078bba001
> RDI: ffffc90078bba000
> 
> Hohumm, @from and @to overlap by a byte... Interesting...
> 
>> [ 1549.080032] RBP: ffff88025ebefe38 R08: 0000000000000000 R09: 0000000002f75001
>> [ 1549.080032] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88025ebefe78
>> [ 1549.080032] R13: ffff88025ebeff18 R14: 0000000094949494 R15: 0000000000da91be
>> [ 1549.080032] FS:  00007f1f669df700(0000) GS:ffff8803f2800000(0000) knlGS:0000000000000000
>> [ 1549.080032] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [ 1549.080032] CR2: 0000000002f75000 CR3: 000000025d031000 CR4: 00000000000007a0
>> [ 1549.080032] Stack:
>> [ 1549.080032]  ffff88025ebeff48 ffffffff963c5ef0 ffffc9001993c000 0000000000da91be
>> [ 1549.080032]  0000000041b58ab3 ffffffffa32918f3 ffffffff963c5da0 0000000000000286
>> [ 1549.080032]  ffffc9001993c000 0000000094949494 0000000000000000 0000000000000000
>> [ 1549.080032] Call Trace:
>> [ 1549.080032] SyS_init_module (kernel/module.c:2505 kernel/module.c:3401 kernel/module.c:3388)
>> [ 1549.080032] ? load_module (kernel/module.c:3388)
>> [ 1549.080032] ia32_do_call (arch/x86/ia32/ia32entry.S:501)
>> [ 1549.080032] Code: d0 66 66 90 8a 06 66 66 90 45 85 c0 74 d3 85 c9 74 23 89 d0 45 31 c0 eb 08 83 e8 01 48 89 f7 74 14 48 8d 77 01 44 89 c1 66 66 90 <c6> 07 00 66 66 90 85 c9 74 e4 89 d0 66 66 90 5d c3 0f 1f 44 00
>> All code
>> ========
>>    0:   d0 66 66                shlb   0x66(%rsi)
>>    3:   90                      nop
>>    4:   8a 06                   mov    (%rsi),%al
>>    6:   66 66 90                data32 xchg %ax,%ax
>>    9:   45 85 c0                test   %r8d,%r8d
>>    c:   74 d3                   je     0xffffffffffffffe1
>>    e:   85 c9                   test   %ecx,%ecx
>>   10:   74 23                   je     0x35
>>   12:   89 d0                   mov    %edx,%eax
>>   14:   45 31 c0                xor    %r8d,%r8d
>>   17:   eb 08                   jmp    0x21
>>   19:   83 e8 01                sub    $0x1,%eax
>>   1c:   48 89 f7                mov    %rsi,%rdi
>>   1f:   74 14                   je     0x35
>>   21:   48 8d 77 01             lea    0x1(%rdi),%rsi
>>   25:   44 89 c1                mov    %r8d,%ecx
>>   28:   66 66 90                data32 xchg %ax,%ax
>>   2b:*  c6 07 00                movb   $0x0,(%rdi)              <-- trapping instruction
>>   2e:   66 66 90                data32 xchg %ax,%ax
>>   31:   85 c9                   test   %ecx,%ecx
>>   33:   74 e4                   je     0x19
>>   35:   89 d0                   mov    %edx,%eax
>>   37:   66 66 90                data32 xchg %ax,%ax
>>   3a:   5d                      pop    %rbp
>>   3b:   c3                      retq
>>   3c:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> 
> Hmm, lotsa alternatives patching around that area even though I can't
> see anything wrong from looking at your dump. The "66 66 90" nops are
> the STAC/CLAC things optimized to NOPs which are asm volatile within the
> __put_user_asm()'s own asm volatile. And I thought the labels might be
> fudged but my usercopy_64.s version here looks ok.
> 
> Can you boot that box with "debug-alternative" and send me dmesg? Also,
> vmlinux too please. Privately's fine too.

I've attached dmesg with debug-alternative, and my .config. vmlinux is pretty
large (1GB compressed) so hopefully the config would be enough - let me know
if it isn't.

>> Code starting with the faulting instruction
>> ===========================================
>>    0:   c6 07 00                movb   $0x0,(%rdi)
>>    3:   66 66 90                data32 xchg %ax,%ax
>>    6:   85 c9                   test   %ecx,%ecx
>>    8:   74 e4                   je     0xffffffffffffffee
>>    a:   89 d0                   mov    %edx,%eax
>>    c:   66 66 90                data32 xchg %ax,%ax
>>    f:   5d                      pop    %rbp
>>   10:   c3                      retq
>>   11:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>>
>>
>> This is the result of getting copy_user_handle_tail to zero out a large block of
>> kernel memory very inefficiently:
>>
>>         for (c = 0, zero_len = len; zerorest && zero_len; --zero_len)
> 
> Btw, that zerorest is being tested on every loop iteration! AFAICT,
> 
> 	if (!zerorest) {
> 		clac();
> 		return len;
> 	}
> 
> before the loop should be nicer. Or am I missing something?

Looks good to me. I suspect the compiler optimizes that anyway though.


Thanks,
Sasha


View attachment "sasha-config" of type "text/plain" (163487 bytes)

Download attachment "dmesg.gz" of type "application/gzip" (121221 bytes)