[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <b46a0eb9-f80f-e459-d31d-ed9074e38ede@redhat.com>
Date: Sat, 16 Jan 2021 12:42:10 +0100
From: David Hildenbrand <david@...hat.com>
To: John Hubbard <jhubbard@...dia.com>, Jason Gunthorpe <jgg@...pe.ca>
Cc: Andrea Arcangeli <aarcange@...hat.com>,
Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, Yu Zhao <yuzhao@...gle.com>,
Andy Lutomirski <luto@...nel.org>,
Peter Xu <peterx@...hat.com>,
Pavel Emelyanov <xemul@...nvz.org>,
Mike Kravetz <mike.kravetz@...cle.com>,
Mike Rapoport <rppt@...ux.vnet.ibm.com>,
Minchan Kim <minchan@...nel.org>,
Will Deacon <will@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Hugh Dickins <hughd@...gle.com>,
"Kirill A. Shutemov" <kirill@...temov.name>,
Matthew Wilcox <willy@...radead.org>,
Oleg Nesterov <oleg@...hat.com>, Jann Horn <jannh@...gle.com>,
Kees Cook <keescook@...omium.org>,
Leon Romanovsky <leonro@...dia.com>, Jan Kara <jack@...e.cz>,
Kirill Tkhai <ktkhai@...tuozzo.com>,
Nadav Amit <nadav.amit@...il.com>, Jens Axboe <axboe@...nel.dk>
Subject: Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse
On 16.01.21 04:40, John Hubbard wrote:
> On 1/15/21 11:46 AM, David Hildenbrand wrote:
>>>> 7) There is no easy way to detect if a page really was pinned: we might
>>>> have false positives. Further, there is no way to distinguish if it was
>>>> pinned with FOLL_WRITE or not (R vs R/W). To perform reliable tracking
>>>> we most probably would need more counters, which we cannot fit into
>>>> struct page. (AFAIU, for huge pages it's easier).
>>>
>>> I think this is the real issue. We can only store so much information,
>>> so we have to decide which things work and which things are broken. So
>>> far someone hasn't presented a way to record everything at least..
>>
>> I do wonder how many (especially long-term) GUP readers/writers we have
>> to expect, and especially, support for a single base page. Do we have a
>> rough estimate?
>>
>> With RDMA, I would assume we only need a single one (e.g., once RDMA
>> device; I'm pretty sure I'm wrong, sounds too easy).
>> With VFIO I guess we need one for each VFIO container (~ in the worst
>> case one for each passthrough device).
>> With direct I/O, vmsplice and other GUP users ?? No idea.
>>
>> If we could somehow put a limit on the #GUP we support, and fail further
>> GUP (e.g., -EAGAIN?) once a limit is reached, we could partition the
>> refcount into something like (assume max #15 GUP READ and #15 GUP R/W,
>> which is most probably a horribly bad choice)
>>
>> [ GUP READ ][ GUP R/W ] [ ordinary ]
>> 31 ... 28 27 ... 24 23 .... 0
>>
>> But due to saturate handling in "ordinary", we would lose further 2 bits
>> (AFAIU), leaving us "only" 22 bits for "ordinary". Now, I have no idea
>> how many bits we actually need in practice.
>>
>> Maybe we need less for GUP READ, because most users want GUP R/W? No idea.
>>
>> Just wild ideas. Most probably that has already been discussed, and most
>> probably people figured that it's impossible :)
>>
>
> I proposed this exact idea a few days ago [1]. It's remarkable that we both
> picked nearly identical values for the layout! :)
Heh! Somehow I missed that. But well, there were *a lot* of mails :)
>
> But as the responses show, security problems prevent pursuing that approach.
It still feels kind of wrong to waste valuable space in the memmap.
In an ideal world (well, one that still only allows for a 64 byte memmap
:) ), we would:
1) Partition the refcount into separate fields that cannot overflow into
each other, similar to my example above, but maybe add even more fields.
2) Reject attempts that would result in an overflow to everything except
the "ordinary" field (e.g., GUP fields in my example above).
3) Put an upper limit on the "ordinary" field that we ever expect for
sane workloads (E.g., 10 bits). In addition, reserve some bits (like the
saturate logic) that we handle as a "red zone".
4) For the "ordinary" field, as soon as we enter the red zone, we know
we have an attack going on. We continue on paths that we cannot fail
(e.g., get_page()) but eventually try stopping the attacker(s). AFAIU,
we know the attacker(s) are something (e.g., one ore multiple processes)
that has direct access to the page in their address space. Of course,
the more paths we can reject, the better.
Now, we would:
a) Have to know what sane upper limits on the "ordinary" field are. I
have no idea which values we can expect. Attacker vs. sane workload.
b) Need a way to identify the attacker(s). In the simplest case, this is
a single process. In the hard case, this involves many processes.
c) Need a way to stop the attacker(s). Doing that out of random context
is problematic. Last resort is doing this asynchronously from another
thread, which leaves more time for the attacker to do harm.
Of course, problem gets more involved as soon as we might have a
malicious child process that uses a page from a well-behaving parent
process for the attack.
Imagine we kill relevant processes, we might end up killing someone
who's not responsible. And even if we don't kill, but instead reject
try_get_page(), we might degrade the well-behaving parent process AFAIKS.
Alternatives to killing the process might be unmapping the problematic
page from the address space.
Reminds me a little about handling memory errors for a page, eventually
killing all users of that page. mm/memory-failure.c:kill_procs().
Complicated problem :)
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists