linux-kernel - Re: [syzbot] BUG: sleeping function called from invalid context in __fdget

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAMj1kXGTV+U-oy=wyHf2KmuzjmaaPJaLBY4mx09tWjL6gCC=rQ@mail.gmail.com>
Date:   Wed, 30 Jun 2021 09:42:14 +0200
From:   Ard Biesheuvel <ardb@...nel.org>
To:     Dave Hansen <dave.hansen@...el.com>
Cc:     syzbot <syzbot+5d1bad8042a8f0e8117a@...kaller.appspotmail.com>,
        Borislav Petkov <bp@...en8.de>,
        "H. Peter Anvin" <hpa@...or.com>, jpa@....mail.kapsi.fi,
        kan.liang@...ux.intel.com,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Andy Lutomirski <luto@...nel.org>,
        Ingo Molnar <mingo@...hat.com>,
        syzkaller-bugs <syzkaller-bugs@...glegroups.com>,
        Thomas Gleixner <tglx@...utronix.de>, X86 ML <x86@...nel.org>,
        Herbert Xu <herbert@...dor.apana.org.au>
Subject: Re: [syzbot] BUG: sleeping function called from invalid context in __fdget_pos

On Tue, 29 Jun 2021 at 16:46, Dave Hansen <dave.hansen@...el.com> wrote:
>
> ... adding Ard who was recently modifying some of the
> kernel_fpu_begin/end() sites in the AESNI crypto code.
>
> On 6/28/21 12:22 PM, syzbot wrote:
> > console output: https://syzkaller.appspot.com/x/log.txt?x=170e6c94300000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=42ecca11b759d96c
> > dashboard link: https://syzkaller.appspot.com/bug?extid=5d1bad8042a8f0e8117a
> >
> > Unfortunately, I don't have any reproducer for this issue yet.
> ...
> > BUG: sleeping function called from invalid context at kernel/locking/mutex.c:938
> > in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 29652, name: syz-executor.0
> > no locks held by syz-executor.0/29652.
> > Preemption disabled at:
> > [<ffffffff812aa454>] kernel_fpu_begin_mask+0x64/0x260 arch/x86/kernel/fpu/core.c:126
> > CPU: 0 PID: 29652 Comm: syz-executor.0 Not tainted 5.13.0-rc7-syzkaller #0
>
> There's a better backtrace in the log before the rather useless
> backtrace from lockdep:
>
> > [ 1341.360547][T29635] FAULT_INJECTION: forcing a failure.
> > [ 1341.360547][T29635] name failslab, interval 1, probability 0, space 0, times 0
> > [ 1341.374439][T29635] CPU: 1 PID: 29635 Comm: syz-executor.0 Not tainted 5.13.0-rc7-syzkaller #0
> > [ 1341.374712][T29630] FAT-fs (loop2): bogus number of reserved sectors
> > [ 1341.383571][T29635] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> > [ 1341.383591][T29635] Call Trace:
> > [ 1341.383603][T29635]  dump_stack+0x141/0x1d7
> > [ 1341.383630][T29635]  should_fail.cold+0x5/0xa
> > [ 1341.383651][T29635]  ? skcipher_walk_next+0x6e2/0x1680
> > [ 1341.383673][T29635]  should_failslab+0x5/0x10
> > [ 1341.383691][T29635]  __kmalloc+0x72/0x330
> > [ 1341.383720][T29635]  skcipher_walk_next+0x6e2/0x1680
> > [ 1341.383744][T29635]  ? kfree+0xe5/0x7f0
> > [ 1341.383776][T29635]  skcipher_walk_first+0xf8/0x3c0
> > [ 1341.383805][T29635]  skcipher_walk_virt+0x523/0x760
> > [ 1341.445438][T29635]  xts_crypt+0x137/0x7f0
> > [ 1341.449689][T29635]  ? aesni_encrypt+0x80/0x80
>
> There's one suspect-looking site in xts_crypt():
>
> >       kernel_fpu_begin();
> >
> >       /* calculate first value of T */
> >       aesni_enc(aes_ctx(ctx->raw_tweak_ctx), walk.iv, walk.iv);
> >
> >       while (walk.nbytes > 0) {
> >               int nbytes = walk.nbytes;
> >
> >               ...
> >
> >               err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
> >
> >               kernel_fpu_end();
> >
> >               if (walk.nbytes > 0)
> >                       kernel_fpu_begin();
> >       }
>
> I wonder if a slab allocation failure could leave us with walk.nbytes==0.

The code is actually the other way around: kernel_fpu_end() comes
before the call to skcipher_walk_done().

So IIUC, this code forces an allocation failure, and checks whether
the code deals with this gracefully, right?

The skcipher walk API guarantees that walk.nbytes == 0 if an error is
returned, so the pairing of FPU begin/end looks correct to me. And
skcipher_walk_next() should not invoke anything that might sleep from
this particular context.

Herbert, any ideas?