lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+EESO7yc9k79TxyQk+XvWbMfhMmax5GtJTYbNhDrb-0VgJunA@mail.gmail.com>
Date:   Fri, 4 Sep 2020 17:36:02 -0700
From:   Lokesh Gidra <lokeshgidra@...gle.com>
To:     Andrea Arcangeli <aarcange@...hat.com>
Cc:     "Michael S. Tsirkin" <mst@...hat.com>,
        Nick Kralevich <nnk@...gle.com>,
        Jeffrey Vander Stoep <jeffv@...gle.com>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Kees Cook <keescook@...omium.org>,
        Jonathan Corbet <corbet@....net>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Luis Chamberlain <mcgrof@...nel.org>,
        Iurii Zaikin <yzaikin@...gle.com>,
        Mauro Carvalho Chehab <mchehab+samsung@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Andy Shevchenko <andy.shevchenko@...il.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
        Peter Xu <peterx@...hat.com>,
        Mike Rapoport <rppt@...ux.ibm.com>,
        Jerome Glisse <jglisse@...hat.com>, Shaohua Li <shli@...com>,
        linux-doc@...r.kernel.org, LKML <linux-kernel@...r.kernel.org>,
        Linux FS Devel <linux-fsdevel@...r.kernel.org>,
        Tim Murray <timmurray@...gle.com>,
        Minchan Kim <minchan@...gle.com>,
        Sandeep Patil <sspatil@...gle.com>, kernel@...roid.com,
        Daniel Colascione <dancol@...col.org>,
        Kalesh Singh <kaleshsingh@...gle.com>,
        "Dr. David Alan Gilbert" <dgilbert@...hat.com>,
        Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>,
        Dmitry Vyukov <dvyukov@...gle.com>
Subject: Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only

On Thu, Sep 3, 2020 at 8:34 PM Andrea Arcangeli <aarcange@...hat.com> wrote:
>
> Hello,
>
> On Mon, Aug 17, 2020 at 03:11:16PM -0700, Lokesh Gidra wrote:
> > There has been an emphasis that Android is probably the only user for
> > the restriction of userfaults from kernel-space and that it wouldn’t
> > be useful anywhere else. I humbly disagree! There are various areas
> > where the PROT_NONE+SIGSEGV trick is (and can be) used in a purely
> > user-space setting. Basically, any lazy, on-demand,
>
> For the record what I said is quoted below
> https://lkml.kernel.org/r/20200520194804.GJ26186@redhat.com :
>
> """It all boils down of how peculiar it is to be able to leverage only
> the acceleration [..] Right now there's a single user that can cope
> with that limitation [..] If there will be more users [..]  it'd be
> fine to add a value "2" later."""
>
> Specifically I never said "that it wouldn’t be useful anywhere else.".
>
Thanks a lot for clarifying.

> Also I'm only arguing about the sysctl visible kABI change in patch
> 2/2: the flag passed as parameter to the syscall in patch 1/2 is all
> great, because seccomp needs it in the scalar parameter of the syscall
> to implement a filter equivalent to your sysctl "2" policy with only
> patch 1/2 applied.
>
> I've two more questions now:
>
> 1) why don't you enforce the block of kernel initiated faults with
>    seccomp-bpf instead of adding a sysctl value 2? Is the sysctl just
>    an optimization to remove a few instructions per syscall in the bpf
>    execution of Android unprivileged apps? You should block a lot of
>    other syscalls by default to all unprivileged processes, including
>    vmsplice.
>
>    In other words if it's just for Android, why can't Android solve it
>    with only patch 1/2 by tweaking the seccomp filter?

I would let Nick (nnk@) and Jeff (jeffv@) respond to this.

The previous responses from both of them on this email thread
(https://lore.kernel.org/lkml/CABXk95A-E4NYqA5qVrPgDF18YW-z4_udzLwa0cdo2OfqVsy=SQ@mail.gmail.com/
and https://lore.kernel.org/lkml/CAFJ0LnGfrzvVgtyZQ+UqRM6F3M7iXOhTkUBTc+9sV+=RrFntyQ@mail.gmail.com/)
suggest that the performance overhead of seccomp-bpf is too much. Kees
also objected to it
(https://lore.kernel.org/lkml/202005200921.2BD5A0ADD@keescook/)

I'm not familiar with how seccomp-bpf works. All that I can add here
is that userfaultfd syscall is usually not invoked in a performance
critical code path. So, if the performance overhead of seccomp-bpf (if
enabled) is observed on all syscalls originating from a process, then
I'd say patch 2/2 is essential. Otherwise, it should be ok to let
seccomp perform the same functionality instead.

>
> 2) given that Android is secure enough with the sysctl at value 2, why
>    should we even retain the current sysctl 0 semantics? Why can't
>    more secure systems just use seccomp and block userfaultfd, as it
>    is already happens by default in the podman default seccomp
>    whitelist (for those containers that don't define a new json
>    whitelist in the OCI schema)? Shouldn't we focus our energy in
>    making containers more secure by preventing the OCI schema of a
>    random container to re-enable userfaultfd in the container seccomp
>    filter instead of trying to solve this with a global sysctl?
>
>    What's missing in my view is a kubernetes hard allowlist/denylist
>    that cannot be overridden with the OCI schema in case people has
>    the bad idea of running containers downloaded from a not fully
>    trusted source, without adding virt isolation and that's an
>    userland problem to be solved in the container runtime, not a
>    kernel issue. Then you'd just add userfaultfd to the json of the
>    k8s hard seccomp denylist instead of going around tweaking sysctl.
>
> What's your take in changing your 2/2 patch to just replace value "0"
> and avoid introducing a new value "2"?

SGTM. Disabling uffd completely for unprivileged processes can be
achieved either using seccomp-bpf, or via SELinux, once the following
patch series is upstreamed
https://lore.kernel.org/lkml/20200827063522.2563293-1-lokeshgidra@google.com/

>
> The value "0" was motivated by the concern that uffd can enlarge the
> race window for use after free by providing one more additional way to
> block kernel faults, but value "2" is already enough to solve that
> concern completely and it'll be the default on all Android.
>
> In other words by adding "2" you're effectively doing a more
> finegrined and more optimal implementation of "0" that remains useful
> and available to unprivileged apps and it already resolves all
> "robustness against side effects other kernel bugs" concerns. Clearly
> "0" is even more secure statistically but that would apply to every
> other syscall including vmsplice, and there's no
> /proc/sys/vm/unprivileged_vmsplice sysctl out there.
>
> The next issue we have now is with the pipe mutex (which is not a
> major concern but we need to solve it somehow for correctness). So I
> wonder if should make the default value to be "0" (or "2" if think we
> should not replace "0") and to allow only user initiated faults by
> default.
>
> Changing the sysctl default to be 0, will make live migration fail to
> switch to postcopy which will be (unnoticeable to the guest), instead
> of risking the VM to be killed because of network latency
> outlier. Then we wouldn't need to change the pipe code at all.
>
SGTM. I can change the default value to '0' (or '2') in the next
revision of patch 2/2, unless somebody objects to this.

> Alternatively we could still fix the pipe code so it runs better (but
> it'll be more complex) or to disable uffd faults only in the pipe
> code.
>
> One thing to keep in mind is that if we change the default, then
> hypervisor hosts running QEMU would need to set:
>
> vm.userfaultfd = 1
>
> in /etc/sysctl.conf if postcopy live migration is required, that's not
> particularly concerning constraint for qemu (there are likely other
> tweaks required and it looks less risky than an arbitrary timeout
> which could kill the VM: if the above is forgotten the postcopy live
> migration won't even start and it'll be unnoticeable to the guest).
>
> The main concern really are future apps that may want to use uffd for
> kernel initiated faults won't be allowed to do so by default anymore,
> those apps will be heavily incentivated to use bounce buffers before
> passing data to syscalls, similarly to the current use case of patch 2/2.
>
> Comments welcome,
> Andrea
>
> PS. Another usage of uffd that remains possible without privilege with
> the 2/2 patch sysctl "2" behavior (besides the strict SIGSEGV
> acceleration) is the UFFD_FEATURE_SIGBUS. That's good so a malloc lib
> will remain possible without requiring extra privileges, by adding a
> UFFDIO_POPULATE to use in combination with UFFD_FEATURE_SIGBUS
> (UFFDIO_POPULATE just needs to zero out a page and map it, it'll be
> indistinguishable to UFFDIO_ZEROPAGE but it will solve the last
> performance bottleneck by avoiding a wrprotect fault after the
> allocation and it will be THP capable too). Memory will be freed with
> MADV_DONTNEED, without ever having to call mmap/mumap. It could move
> memory around with UFFDIO_COPY+MADV_DONTNEED or by adding UFFDIO_REMAP
> which already exists.
>
UFFDIO_POPULATE sounds like a really useful feature. I don't see it in
the kernel yet. Is there a patch under work on this? If so, kindly
share.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ