linux-kernel - Re: [PATCH] pidfd: fix test failure due to stack overflow on some arches

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <924d8fd8-7069-41d2-9887-f36160d9f598@linuxfoundation.org>
Date:   Wed, 2 Feb 2022 08:52:47 -0700
From:   Shuah Khan <skhan@...uxfoundation.org>
To:     Christian Brauner <brauner@...nel.org>,
        Axel Rasmussen <axelrasmussen@...gle.com>
Cc:     Christian Brauner <christian@...uner.io>,
        Shuah Khan <shuah@...nel.org>,
        Zach O'Keefe <zokeefe@...gle.com>,
        linux-kernel@...r.kernel.org, linux-kselftest@...r.kernel.org,
        Shuah Khan <skhan@...uxfoundation.org>
Subject: Re: [PATCH] pidfd: fix test failure due to stack overflow on some
 arches

On 1/28/22 1:56 AM, Christian Brauner wrote:
> On Thu, Jan 27, 2022 at 01:29:51PM -0800, Axel Rasmussen wrote:
>> When running the pidfd_fdinfo_test on arm64, it fails for me. After some
>> digging, the reason is that the child exits due to SIGBUS, because it
>> overflows the 1024 byte stack we've reserved for it.
>>
>> To fix the issue, increase the stack size to 8192 bytes (this number is
>> somewhat arbitrary, and was arrived at through experimentation -- I kept
>> doubling until the failure no longer occurred).
>>
>> Also, let's make the issue easier to debug. wait_for_pid() returns an
>> ambiguous value: it may return -1 in all of these cases:
>>
>> 1. waitpid() itself returned -1
>> 2. waitpid() returned success, but we found !WIFEXITED(status).
>> 3. The child process exited, but it did so with a -1 exit code.
>>
>> There's no way for the caller to tell the difference. So, at least log
>> which occurred, so the test runner can debug things.
>>
>> While debugging this, I found that we had !WIFEXITED(), because the
>> child exited due to a signal. This seems like a reasonably common case,
>> so also print out whether or not we have WIFSIGNALED(), and the
>> associated WTERMSIG() (if any). This lets us see the SIGBUS I'm fixing
>> clearly when it occurs.
>>
>> Finally, I'm suspicious of allocating the child's stack on our stack.
>> man clone(2) suggests that the correct way to do this is with mmap(),
>> and in particular by setting MAP_STACK. So, switch to doing it that way
>> instead.
> 
> Heh, yes. :)
> 
> commit 99c3a000279919cc4875c9dfa9c3ebb41ed8773e
> Author: Michael Kerrisk <mtk.manpages@...il.com>
> Date:   Thu Nov 14 12:19:21 2019 +0100
> 
>      clone.2: Allocate child's stack using mmap(2) rather than malloc(3)
> 
>      Christian Brauner suggested mmap(MAP_STACKED), rather than
>      malloc(), as the canonical way of allocating a stack for the
>      child of clone(), and Jann Horn noted some reasons why:
> 
>          Not on Linux, but on OpenBSD, they do use MAP_STACK now
>          AFAIK; this was announced here:
>          <http://openbsd-archive.7691.n7.nabble.com/stack-register-checking-td338238.html>.
>          Basically they periodically check whether the userspace
>          stack pointer points into a MAP_STACK region, and if not,
>          they kill the process. So even if it's a no-op on Linux, it
>          might make sense to advise people to use the flag to improve
>          portability? I'm not sure if that's something that belongs
>          in Linux manpages.
> 
>          Another reason against malloc() is that when setting up
>          thread stacks in proper, reliable software, you'll probably
>          want to place a guard page (in other words, a 4K PROT_NONE
>          VMA) at the bottom of the stack to reliably catch stack
>          overflows; and you probably don't want to do that with
>          malloc, in particular with non-page-aligned allocations.
> 
>      And the OpenBSD 6.5 manual pages says:
> 
>          MAP_STACK
>              Indicate that the mapping is used as a stack. This
>              flag must be used in combination with MAP_ANON and
>              MAP_PRIVATE.
> 
>      And I then noticed that MAP_STACK seems already to be on
>      FreeBSD for a long time:
> 
>          MAP_STACK
>              Map the area as a stack.  MAP_ANON is implied.
>              Offset should be 0, fd must be -1, and prot should
>              include at least PROT_READ and PROT_WRITE.  This
>              option creates a memory region that grows to at
>              most len bytes in size, starting from the stack
>              top and growing down.  The stack top is the start‐
>              ing address returned by the call, plus len bytes.
>              The bottom of the stack at maximum growth is the
>              starting address returned by the call.
> 
>              The entire area is reserved from the point of view
>              of other mmap() calls, even if not faulted in yet.
> 
>      Reported-by: Jann Horn <jannh@...gle.com>
>      Reported-by: Christian Brauner <christian.brauner@...ntu.com>
>      Signed-off-by: Michael Kerrisk <mtk.manpages@...il.com>
> 
> 
>>
>> Signed-off-by: Axel Rasmussen <axelrasmussen@...gle.com>
>> ---
> 
> Yeah, stack handling - especially with legacy clone() - is yucky on the
> best of days. Thank you for the fix.
> 
> Acked-by: Christian Brauner <brauner@...nel.org>
> 

Thank you both. Will apply for 5.17-rc4 or so.

thanks,
-- Shuah