linux-kernel - Re: [PATCH] x86_64: uninline TASK

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALCETrWtK4G7-o=pQoeyzLFTpiR9giGy2jYTU10xt6S8FsMk+Q@mail.gmail.com>
Date:   Tue, 23 Apr 2019 10:21:59 -0700
From:   Andy Lutomirski <luto@...nel.org>
To:     Ingo Molnar <mingo@...nel.org>
Cc:     Andy Lutomirski <luto@...nel.org>,
        Alexey Dobriyan <adobriyan@...il.com>,
        "H. Peter Anvin" <hpa@...or.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        LKML <linux-kernel@...r.kernel.org>, X86 ML <x86@...nel.org>,
        Peter Zijlstra <a.p.zijlstra@...llo.nl>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Al Viro <viro@...iv.linux.org.uk>
Subject: Re: [PATCH] x86_64: uninline TASK_SIZE

On Tue, Apr 23, 2019 at 4:13 AM Ingo Molnar <mingo@...nel.org> wrote:
>
>
> * Andy Lutomirski <luto@...nel.org> wrote:
>
> > > Saving 2KB on a defconfig is quite a lot.
> >
> > Saving 2kB of text by adding 8 bytes to thread_info seems rather
> > dubious to me.  You only need 256 tasks before you lose.  My
> > not-particularly-loaded laptop has 865 tasks right now.
>
> I was suggesting current->task_size or thread_info->task_size as a way to
> 100% avoid the function call overhead. Worth a tiny amount of RAM - even
> with 1 million tasks it's only 4MB of RAM. ;-)
>
> Some TASK_SIZE users are prominent syscalls: mmap(),
>
> > As a general principle, the mere existence of TIF_ADDR32 is a bug. The
> > value of that flag is *wrong* under the 32-bit variant of CRIU. How
> > about instead making some more progress toward getting rid of dubious
> > TASK_SIZE users?  I'm working on a little series to get rid of most of
> > them.  Meanwhile: it sure looks like a large fraction of the users are
> > confused as to whether TASK_SIZE is the highest user address or the
> > lowest non-user address.
>
> I really like that, replacing TASK_SIZE with *nothing* would be even
> faster.
>
> In fact instead of just reducing its usage I'd suggest removing TASK_SIZE
> altogether and renaming TASK_SIZE_MAX back to TASK_SIZE, or something
> like that - the confusion from the deceptively macro-ish naming of
> TASK_SIZE is real.
>
> The original commit of making TASK_SIZE dynamic on the task's compat flag
> was done in:
>
>   84929801e14d: [PATCH] x86_64: TASK_SIZE fixes for compatibility mode processes
>
> Here's the justification given:
>
>     Appended patch will setup compatibility mode TASK_SIZE properly.  This will
>     fix atleast three known bugs that can be encountered while running
>     compatibility mode apps.
>
>     a) A malicious 32bit app can have an elf section at 0xffffe000.  During
>        exec of this app, we will have a memory leak as insert_vm_struct() is
>        not checking for return value in syscall32_setup_pages() and thus not
>        freeing the vma allocated for the vsyscall page.  And instead of exec
>        failing (as it has addresses > TASK_SIZE), we were allowing it to
>        succeed previously.
>
>     b) With a 32bit app, hugetlb_get_unmapped_area/arch_get_unmapped_area
>        may return addresses beyond 32bits, ultimately causing corruption
>        because of wrap-around and resulting in SEGFAULT, instead of returning
>        ENOMEM.
>
>     c) 32bit app doing this below mmap will now fail.
>
>       mmap((void *)(0xFFFFE000UL), 0x10000UL, PROT_READ|PROT_WRITE,
>             MAP_FIXED|MAP_PRIVATE|MAP_ANON, 0, 0);
>
> I believe a) is addressed by getting rid of the vsyscall page - but it
> might also not be a current problem because the vsyscall page has its own
> gate-vma now.
>

I suspect that whatever issue this is predates my involvement in Linux
:)  The "gate" VMA, aka vsyscall, is at 0xffffffffff600000, and the
vDSO setup code shouldn't anywhere near that fragile.  Also, if this
really a bug, we have it for the 64-bit case, too, and TASK_SIZE isn't
going to help.


> b) shouldn't be an issue if the mmap allocator correctly treats the
> compat bit - this doesn't require generic TASK_SIZE variations though, as
> the mmap allocator is already specific to arch/x86/.

This was mostly fixed by the CRIU folks a while back, I think.

>
> c) is a variant of a) I believe, which should be fixed by now.
>
> I just looked through some of the current TASK_SIZE users, and *ALL* of
> them seem dubious to me, with the exception of the mmap allocators. In

Even the munmap one seems to be a bug.

> fact some of them seem to be active bugs:
>
> kernel/:
>
>  - PR_SET_MM_MAP, PR_SET_MM_MAP_SIZE, prctl_set_mm() et al. Ugh, what a
>    nasty prctl()! But besides that, the TASK_SIZE restriction to the ABI
>    looks questionable: if we offer this CRIU functionality then why
>    should it be impossible for a 32-bit CRIU task to switch to 64-bit?
>
>  - kernel/events/core.c: TASK_SIZE_MAX should be a fine filter here, in
>    fact it's probably *wrong* to restrict the profiling data here just
>    because the task happens to be in 32-bit compat mode.

Yep.  I have a patch fo rthis.

>
>  - kernel/rseq.c: is this TASK_SIZE restriction even required, wouldn't
>    TASK_SIZE_MAX be sufficient?

Presumably.

> So I concur 100% that most TASK_SIZE uses are questionable. In fact think
> 84929801e14d was a mistake, and we should effectively revert it
> carefully, by:
>
>  - First by moving almost all TASK_SIZE users over to TASK_SIZE_MAX,
>    analyzing and justifying the impact case by case.
>
>  - Then making the mmap allocators compat compatible (ha) without relying
>    on TASK_SIZE.

I have a somewhat hacky patch for this right now.

>
>  - Renaming TASK_SIZE back to TASK_SIZE_MAX and getting rid of the
>    TASK_SIZE and TASK_SIZE_MAX differentiation.
>
> Or am I missing some complication?

Seems like a great idea to me.

BTW, what the heck is up with get_gate_page()?  I'm struggling to
understand what it's even trying to do.  If there's an architecture
that allows a user program to mremap() or otherwise position its gate
VMA between TASK_SIZE and TASK_SIZE_MAX, then that code is going to
explode horribly.

A whole bunch of work in this direction is here:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/fixes

It's almost entirely untested.