linux-kernel - Re: Spurious SIGSEGV with rseq/membarrier

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <33732f5a-8446-401e-b5ec-0f7b68439f6e@efficios.com>
Date: Mon, 12 Feb 2024 12:11:31 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Dmitry Vyukov <dvyukov@...gle.com>, Peter Oskolkov <posk@...gle.com>,
 Peter Zijlstra <peterz@...radead.org>, "Paul E. McKenney"
 <paulmck@...nel.org>, Boqun Feng <boqun.feng@...il.com>
Cc: LKML <linux-kernel@...r.kernel.org>, Chris Kennelly <ckennelly@...gle.com>
Subject: Re: Spurious SIGSEGV with rseq/membarrier

On 2024-02-12 05:02, Dmitry Vyukov wrote:
> Hi rseq/membarrier maintainers,
> 
> I've spent a bit debugging some spurious SIGSEGVs and it turned out to
> be an interesting interaction between page faults, rseq and
> membarrier. The manifestation is that membarrier(EXPEDITED_RSEQ) is
> effectively not working for a thread (doesn't restart its rseq
> critical section).
> 
> The real code is inside of tcmalloc and relates to the "slabs resing" procedure:
> 
> https://github.com/google/tcmalloc/blob/39775a2d57969eda9497f3673421766bc1e886a0/tcmalloc/internal/percpu_tcmalloc.cc#L176
> 
> The essence is:
> Threads use a data structure inside of rseq critical section.
> The resize procedure replaces the old data structure with a new one,
> uses a membarrier to ensure that threads don't use the old one any
> more and unmaps/mprotects pages that back the old data structure. At
> this point no threads use the old data structure anymore and no
> threads should get SIGSEGV.
> 
> However, what happens is as follows:
> A thread gets a minor page fault on the old data structure inside of
> rseq critical section.
> The page fault handler re-enables preemption and allows other threads
> to be scheduled (I am tno sure this is actually important, but that's
> what I observed in all traces, and it makes the failure scenario much
> more likely).
> Now, the resize procedure is executed, replaces all pointers to the
> old data structure to the new one, executes the membarrier and unmaps
> the old data structure.
> Now the page fault handler resumes, verifies VMA protection and finds
> out that the VMA is indeed inaccessible and the page fault is not a
> minor one, but rather should result in SIGSEGV and sends SIGSEGV.
> Note: at this point the thread has rseq restart pending (from both
> preemption and membarrier), and the restart indeed happens as part of
> SIGSEGV delivery, but it's already too late.

Hi Dmitry,

Thanks for spending the time to analyze this issue and identify the
scenario causing it.

> 
> I think the page fault handling should give the rseq restart
> preference in this case, and realize the thread shouldn't be executing
> the faulting instruction in the first place. In such case the thread
> would be restarted, and access the new data structure after the
> restart.

The wanted behavior you describe here does make sense.

> Unmapping/mprotecting the old data in this case is useful for 2 reasons:
> 1. It allows to release memory (not possible to do reliably now).
> 2. It allows to ensure there are no logical bugs in the user-space
> code and thread don't access the old data when they shouldn't. I was
> actually tracking a potential bug in user-space code, but after
> mprotecting old data, started seeing more of more confusing crashes
> (this spurious SIGSEGV).

So I think we are in a situation where in the original rseq design
we've tried to eliminate "all" kernel-internal rseq restart kernel
corner cases by preventing system calls from being issued within rseq
critical sections (to keep things nice and simple), but missed the fact
that page faults are another mean of entering the kernel which can
indeed trigger preemption, and the restart should happen either before
the side-effect of the page fault is decided (e.g. segmentation fault or
SIGBUS), or before the actual signal is delivered to user-space.

In order to solve this here is what I think we need:

1) Add a proper stress-test reproducer to the rseq selftests:

    - In a loop, conditionally issue the first memory access to a newly
      mmap'd area from a rseq critical section using a pointer
      dereference.
    - In another thread, in a loop:
      - mmap a new memory area,
      - set pointer to point to that memory area
      - wait a tiny while (cpu_relax busy-loop)
      - set pointer to NULL
      - membarrier EXPEDITED_RSEQ
      - munmap the area (could also be mprotect)

   We should probably provide hints to mmap so the same address is not
   re-used loop after loop to increase the chances of hitting the
   race.

2) We need to figure out where to implement this behavior change. Either
    at the page fault handler level, or at signal delivery. It would need
    to check whether:
    - t->rseq_event_mask is nonzero _and_
    - in_rseq_cs() would return true (it's tricky because this needs to
      read userspace memory, which can trigger a page fault).
    We should be careful about not preventing signals emitted by other
    sources to happen.

3) This would be a new behavior. Even if it is something that would have
    been preferable from the beginning, all the pre-existing kernels
    implementing membarrier EXPEDITED_RSEQ do not have that behavior
    today. We would want to implement something that allows userspace to
    detect this "feature". Extending getauxval() to add a new "rseq
    feature" entry that would return feature flags would be an option.

Thoughts ?

Thanks,

Mathieu
-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com