netdev - Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <A7232EDD-C53C-4B7D-AD02-4A480F40538B@vmware.com>
Date:   Fri, 22 Feb 2019 22:39:18 +0000
From:   Nadav Amit <namit@...are.com>
To:     Jann Horn <jannh@...gle.com>
CC:     Andy Lutomirski <luto@...capital.net>,
        Alexei Starovoitov <alexei.starovoitov@...il.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Masami Hiramatsu <mhiramat@...nel.org>,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        Ingo Molnar <mingo@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Changbin Du <changbin.du@...il.com>,
        Kees Cook <keescook@...omium.org>,
        Andy Lutomirski <luto@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Network Development <netdev@...r.kernel.org>,
        "bpf@...r.kernel.org" <bpf@...r.kernel.org>,
        Rick Edgecombe <rick.p.edgecombe@...el.com>,
        Dave Hansen <dave.hansen@...el.com>,
        "Peter Zijlstra (Intel)" <peterz@...radead.org>,
        Igor Stoppa <igor.stoppa@...il.com>
Subject: Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access
 kernel memory that can fault

> On Feb 22, 2019, at 2:21 PM, Nadav Amit <namit@...are.com> wrote:
> 
>> On Feb 22, 2019, at 2:17 PM, Jann Horn <jannh@...gle.com> wrote:
>> 
>> On Fri, Feb 22, 2019 at 11:08 PM Nadav Amit <namit@...are.com> wrote:
>>>> On Feb 22, 2019, at 1:43 PM, Jann Horn <jannh@...gle.com> wrote:
>>>> 
>>>> (adding some people from the text_poke series to the thread, removing stable@)
>>>> 
>>>> On Fri, Feb 22, 2019 at 8:55 PM Andy Lutomirski <luto@...capital.net> wrote:
>>>>>> On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov <alexei.starovoitov@...il.com> wrote:
>>>>>>> On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote:
>>>>>>> On Fri, 22 Feb 2019 11:27:05 -0800
>>>>>>> Alexei Starovoitov <alexei.starovoitov@...il.com> wrote:
>>>>>>> 
>>>>>>>>> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
>>>>>>>>> 
>>>>>>>>> Then we should still probably fix up "__probe_kernel_read()" to not
>>>>>>>>> allow user accesses. The easiest way to do that is actually likely to
>>>>>>>>> use the "unsafe_get_user()" functions *without* doing a
>>>>>>>>> uaccess_begin(), which will mean that modern CPU's will simply fault
>>>>>>>>> on a kernel access to user space.
>>>>>>>> 
>>>>>>>> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
>>>>>>>> and users pass both user and kernel addresses into it and expect
>>>>>>>> that the helper will actually try to read from that address.
>>>>>>>> 
>>>>>>>> If __probe_kernel_read will suddenly start failing on all user addresses
>>>>>>>> it will break the expectations.
>>>>>>>> How do we solve it in bpf_probe_read?
>>>>>>>> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
>>>>>>>> in the loop?
>>>>>>>> That's doable, but people already complain that bpf_probe_read() is slow
>>>>>>>> and shows up in their perf report.
>>>>>>> 
>>>>>>> We're changing kprobes to add a specific flag to say that we want to
>>>>>>> differentiate between kernel or user reads. Can this be done with
>>>>>>> bpf_probe_read()? If it's showing up in perf report, I doubt a single
>>>>>> 
>>>>>> so you're saying you will break existing kprobe scripts?
>>>>>> I don't think it's a good idea.
>>>>>> It's not acceptable to break bpf_probe_read uapi.
>>>>> 
>>>>> If so, the uapi is wrong: a long-sized number does not reliably identify an address if you don’t separately know whether it’s a user or kernel address. s390x and 4G:4G x86_32 are the notable exceptions. I have lobbied for RISC-V and future x86_64 to join the crowd.  I don’t know whether I’ll win this fight, but the uapi will probably have to change for at least s390x.
>>>>> 
>>>>> What to do about existing scripts is a different question.
>>>> 
>>>> This lack of logical separation between user and kernel addresses
>>>> might interact interestingly with the text_poke series, specifically
>>>> "[PATCH v3 05/20] x86/alternative: Initialize temporary mm for
>>>> patching" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-6-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cf2513009ef734ecd6b0d08d69913a5ae%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864707020821793&amp;sdata=HAbnDcrBne64JyPuVUMKmM7nQk67F%2BFvjuXEn8TmHeo%3D&amp;reserved=0)
>>>> and "[PATCH v3 06/20] x86/alternative: Use temporary mm for text
>>>> poking" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-7-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cf2513009ef734ecd6b0d08d69913a5ae%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864707020821793&amp;sdata=vNRIMKtFDy%2F3z5FlTwDiJY6VGEV%2FMHgQPTdFSFtCo4s%3D&amp;reserved=0),
>>>> right? If someone manages to get a tracing BPF program to trigger in a
>>>> task that has switched to the patching mm, could they use
>>>> bpf_probe_write_user() - which uses probe_kernel_write() after
>>>> checking that KERNEL_DS isn't active and that access_ok() passes - to
>>>> overwrite kernel text that is mapped writable in the patching mm?
>>> 
>>> Yes, this is a good point. I guess text_poke() should be defined with
>>> “__kprobes” and open-code memcpy.
>>> 
>>> Does it sound reasonable?
>> 
>> Doesn't __text_poke() as implemented in the proposed patch use a
>> couple other kernel functions, too? Like switch_mm_irqs_off() and
>> pte_clear() (which can be a call into a separate function on paravirt
>> kernels)?
> 
> I will move the pte_clear() to be done after the poking mm was unloaded.
> Give me a few minutes to send a sketch of what I think should be done.

Err.. You are right, I don’t see an easy way of preventing a kprobe from
being set on switch_mm_irqs_off(), and open-coding this monster is too ugly.

The reasonable solution seems to me as taking all the relevant pieces of
code (and data) that might be used during text-poking and encapsulating them, so they
will be set in a memory area which cannot be kprobe'd. This can also be
useful to write-protect data structures of code that calls text_poke(),
e.g., static-keys. It can also protect data on that stack that is used
during text_poke() from being overwritten from another core.

This solution is somewhat similar to Igor Stoppa’s idea of using “enclaves”
when doing write-rarely operations.

Right now, I think that text_poke() will keep being susceptible to such
an attack, unless you have a better suggestion.