linux-kernel - Re: [syzbot] [bpf?] [trace?] possible deadlock in force_sig_info_to

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=wi8U1PjW_L6Ng9_A80L_1keyEOKud3PVh-8bwPL9W0CNg@mail.gmail.com>
Date: Sun, 28 Apr 2024 13:22:26 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Hillf Danton <hdanton@...a.com>
Cc: syzbot <syzbot+83e7f982ca045ab4405c@...kaller.appspotmail.com>, 
	andrii@...nel.org, bpf@...r.kernel.org, linux-kernel@...r.kernel.org, 
	syzkaller-bugs@...glegroups.com
Subject: Re: [syzbot] [bpf?] [trace?] possible deadlock in force_sig_info_to_task

On Sun, 28 Apr 2024 at 13:01, Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
>
> The *problem* here is that the page fault doesn't actually happen on a
> user access, it happens on the *ret* instruction in
> rep_movs_alternative itself (which doesn't have a exception fixup,
> obviously, because no exception is supposed to happen there!):

Actually, there's another page fault deeper in that call chain:

   asm_exc_page_fault+0x26/0x30 arch/x86/include/asm/idtentry.h:623
  RIP: 0010:__put_user_handle_exception+0x0/0x10 arch/x86/lib/putuser.S:125
  Code: 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 01 cb 48 89 01 31
c9 0f 01 ca c3 cc cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 66 90 <0f> 01
ca b9 f2 ff ff ff c3 cc cc cc cc 0f 1f 00 90 90 90 90 90 90
  RSP: 0000:ffffc90004137d98 EFLAGS: 00050202
  RAX: 00000000662d5943 RBX: 0000000000000000 RCX: 0000000000000019
  RDX: 0000000000000000 RSI: ffffffff8bcaca20 RDI: ffffffff8c1eaba0
  RBP: ffffc90004137e50 R08: ffffffff8fa7cd6f R09: 1ffffffff1f4f9ad
  R10: dffffc0000000000 R11: fffffbfff1f4f9ae R12: ffffc90004137de0
  R13: dffffc0000000000 R14: 1ffff92000826fb8 R15: 0000000000000019
   __do_sys_gettimeofday kernel/time/time.c:147 [inline]
   __se_sys_gettimeofday+0xd9/0x240 kernel/time/time.c:140

which is also nonsensical, since that "<0f> 01 ca" code is just the
"CLAC" instruction (which is the first instruction of
__put_user_handle_exception, which is the exception fixup for the
__put_user() functions.

So that seems to be the *first* problem spot, actually. It too is
incomprehensible to me. I must be missing something. A "clac"
instruction cannot take a page fault (except for the instruction fetch
itself, of course).

So if the page fault on the 'RET' instruction was odd, the page fault
on the CLAC is *really* odd.

That original page fault looks like it's just from one of the
put_user() calls in gettimeofday():

                if (put_user(ts.tv_sec, &tv->tv_sec) ||
                    put_user(ts.tv_nsec / 1000, &tv->tv_usec))

and yes, they can fault, but I'm not seeing how that then points to
the CLAC in the exception handler.

                Linus