[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <28478de8-3028-48f2-b887-56149b6e324a@dustri.org>
Date: Fri, 3 May 2024 15:39:28 +0200
From: jvoisin <julien.voisin@...tri.org>
To: Kees Cook <keescook@...omium.org>, Matteo Rizzo <matteorizzo@...gle.com>
Cc: Vlastimil Babka <vbabka@...e.cz>,
Andrew Morton <akpm@...ux-foundation.org>, Christoph Lameter <cl@...ux.com>,
Pekka Enberg <penberg@...nel.org>, David Rientjes <rientjes@...gle.com>,
Joonsoo Kim <iamjoonsoo.kim@....com>,
Roman Gushchin <roman.gushchin@...ux.dev>,
Hyeonggon Yoo <42.hyeyoo@...il.com>, "GONG, Ruiqi"
<gongruiqi@...weicloud.com>, Xiu Jianfeng <xiujianfeng@...wei.com>,
Suren Baghdasaryan <surenb@...gle.com>,
Kent Overstreet <kent.overstreet@...ux.dev>, Jann Horn <jannh@...gle.com>,
Thomas Graf <tgraf@...g.ch>, Herbert Xu <herbert@...dor.apana.org.au>,
linux-kernel@...r.kernel.org, linux-mm@...ck.org,
linux-hardening@...r.kernel.org
Subject: Re: [PATCH v3 0/6] slab: Introduce dedicated bucket allocator
On 4/28/24 19:02, Kees Cook wrote:
> On Sun, Apr 28, 2024 at 01:02:36PM +0200, jvoisin wrote:
>> On 4/24/24 23:40, Kees Cook wrote:
>>> Hi,
>>>
>>> Series change history:
>>>
>>> v3:
>>> - clarify rationale and purpose in commit log
>>> - rebase to -next (CONFIG_CODE_TAGGING)
>>> - simplify calling styles and split out bucket plumbing more cleanly
>>> - consolidate kmem_buckets_*() family introduction patches
>>> v2: https://lore.kernel.org/lkml/20240305100933.it.923-kees@kernel.org/
>>> v1: https://lore.kernel.org/lkml/20240304184252.work.496-kees@kernel.org/
>>>
>>> For the cover letter, I'm repeating commit log for patch 4 here, which has
>>> additional clarifications and rationale since v2:
>>>
>>> Dedicated caches are available for fixed size allocations via
>>> kmem_cache_alloc(), but for dynamically sized allocations there is only
>>> the global kmalloc API's set of buckets available. This means it isn't
>>> possible to separate specific sets of dynamically sized allocations into
>>> a separate collection of caches.
>>>
>>> This leads to a use-after-free exploitation weakness in the Linux
>>> kernel since many heap memory spraying/grooming attacks depend on using
>>> userspace-controllable dynamically sized allocations to collide with
>>> fixed size allocations that end up in same cache.
>>>
>>> While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
>>> against these kinds of "type confusion" attacks, including for fixed
>>> same-size heap objects, we can create a complementary deterministic
>>> defense for dynamically sized allocations that are directly user
>>> controlled. Addressing these cases is limited in scope, so isolation these
>>> kinds of interfaces will not become an unbounded game of whack-a-mole. For
>>> example, pass through memdup_user(), making isolation there very
>>> effective.
>>
>> What does "Addressing these cases is limited in scope, so isolation
>> these kinds of interfaces will not become an unbounded game of
>> whack-a-mole." mean exactly?
>
> The number of cases where there is a user/kernel API for size-controlled
> allocations is limited. They don't get added very often, and most are
> (correctly) using kmemdup_user() as the basis of their allocations. This
> means we have a relatively well defined set of criteria for finding
> places where this is needed, and most newly added interfaces will use
> the existing (kmemdup_user()) infrastructure that will already be covered.
A simple CodeQL query returns 266 of them:
https://lookerstudio.google.com/reporting/68b02863-4f5c-4d85-b3c1-992af89c855c/page/n92nD?params=%7B%22df3%22:%22include%25EE%2580%25803%25EE%2580%2580T%22%7D
Is this number realistic and coherent with your results/own analysis?
>
>>> In order to isolate user-controllable sized allocations from system
>>> allocations, introduce kmem_buckets_create(), which behaves like
>>> kmem_cache_create(). Introduce kmem_buckets_alloc(), which behaves like
>>> kmem_cache_alloc(). Introduce kmem_buckets_alloc_track_caller() for
>>> where caller tracking is needed. Introduce kmem_buckets_valloc() for
>>> cases where vmalloc callback is needed.
>>>
>>> Allows for confining allocations to a dedicated set of sized caches
>>> (which have the same layout as the kmalloc caches).
>>>
>>> This can also be used in the future to extend codetag allocation
>>> annotations to implement per-caller allocation cache isolation[1] even
>>> for dynamic allocations.
>> Having per-caller allocation cache isolation looks like something that
>> has already been done in
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c6152940584290668b35fa0800026f6a1ae05fe
>> albeit in a randomized way. Why not piggy-back on the infra added by
>> this patch, instead of adding a new API?
>
> It's not sufficient because it is a static set of buckets. It cannot be
> adjusted dynamically (which is not a problem kmem_buckets_create() has).
> I had asked[1], in an earlier version of CONFIG_RANDOM_KMALLOC_CACHES, for
> exactly the API that is provided in this series, because that would be
> much more flexible.
>
> And for systems that use allocation profiling, the next step
> would be to provide per-call-site isolation (which would supersede
> CONFIG_RANDOM_KMALLOC_CACHES, which we'd keep for the non-alloc-prof
> cases).
>
>>> Memory allocation pinning[2] is still needed to plug the Use-After-Free
>>> cross-allocator weakness, but that is an existing and separate issue
>>> which is complementary to this improvement. Development continues for
>>> that feature via the SLAB_VIRTUAL[3] series (which could also provide
>>> guard pages -- another complementary improvement).
>>>
>>> Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
>>> Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
>>> Link: https://lore.kernel.org/lkml/20230915105933.495735-1-matteorizzo@google.com/ [3]
>>
>> To be honest, I think this series is close to useless without allocation
>> pinning. And even with pinning, it's still routinely bypassed in the
>> KernelCTF
>> (https://github.com/google/security-research/tree/master/pocs/linux/kernelctf).
>
> Sure, I can understand why you might think that, but I disagree. This
> adds the building blocks we need for better allocation isolation
> control, and stops existing (and similar) attacks toda>
> But yes, given attackers with sufficient control over the entire system,
> all mitigations get weaker. We can't fall into the trap of "perfect
> security"; real-world experience shows that incremental improvements
> like this can strongly impact the difficulty of mounting attacks. Not
> all flaws are created equal; not everything is exploitable to the same
> degree.
It's not about "perfect security", but about wisely spending the
complexity/review/performance/churn/… budgets in my opinion.
>> Do you have some particular exploits in mind that would be completely
>> mitigated by your series?
>
> I link to like a dozen in the last two patches. :P
>
> This series immediately closes 3 well used exploit methodologies.
> Attackers exploiting new flaws that could have used the killed methods
> must now choose methods that have greater complexity, and this drives
> them towards cross-allocator attacks. Robust exploits there are more
> costly to develop as we narrow the scope of methods.
You linked exploits that were making use of the two structures that you
isolated; making them use different structures would likely mean a
couple of hours.
I was more interested in exploits that are effectively killed; as I'm
still not convinced that elastic structures are rare, and that manually
isolating them one by one is attainable/sustainable/…
But if you have some proper analysis in this direction, then yes, I
completely agrees that isolating all of them is a great idea.
>
> Bad analogy: we're locking the doors of a house. Yes, some windows may
> still be unlocked, but now they'll need a ladder. And it doesn't make
> sense to lock the windows if we didn't lock the doors first. This is
> what I mean by complementary defenses, and comes back to what I mentioned
> earlier: "perfect security" is a myth, but incremental security works.
>
>> Moreover, I'm not aware of any ongoing development of the SLAB_VIRTUAL
>> series: the last sign of life on its thread is from 7 months ago.
>
> Yeah, I know, but sometimes other things get in the way. Matteo assures
> me it's still coming.
>
> Since you're interested in seeing SLAB_VIRTUAL land, please join the
> development efforts. Reach out to Matteo (you, he, and I all work for
> the same company) and see where you can assist. Surely this can be
> something you can contribute to while "on the clock"?
I left Google a couple of weeks ago unfortunately, and I won't touch
anything with email-based development for less than a Google salary :D
>
>>> After the core implementation are 2 patches that cover the most heavily
>>> abused "repeat offenders" used in exploits. Repeating those details here:
>>>
>>> The msg subsystem is a common target for exploiting[1][2][3][4][5][6]
>>> use-after-free type confusion flaws in the kernel for both read and
>>> write primitives. Avoid having a user-controlled size cache share the
>>> global kmalloc allocator by using a separate set of kmalloc buckets.
>>>
>>> Link: https://blog.hacktivesecurity.com/index.php/2022/06/13/linux-kernel-exploit-development-1day-case-study/ [1]
>>> Link: https://hardenedvault.net/blog/2022-11-13-msg_msg-recon-mitigation-ved/ [2]
>>> Link: https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html [3]
>>> Link: https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html [4]
>>> Link: https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html [5]
>>> Link: https://zplin.me/papers/ELOISE.pdf [6]
>>> Link: https://syst3mfailure.io/wall-of-perdition/ [7]
>>>
>>> Both memdup_user() and vmemdup_user() handle allocations that are
>>> regularly used for exploiting use-after-free type confusion flaws in
>>> the kernel (e.g. prctl() PR_SET_VMA_ANON_NAME[1] and setxattr[2][3][4]
>>> respectively).
>>>
>>> Since both are designed for contents coming from userspace, it allows
>>> for userspace-controlled allocation sizes. Use a dedicated set of kmalloc
>>> buckets so these allocations do not share caches with the global kmalloc
>>> buckets.
>>>
>>> Link: https://starlabs.sg/blog/2023/07-prctl-anon_vma_name-an-amusing-heap-spray/ [1]
>>> Link: https://duasynt.com/blog/linux-kernel-heap-spray [2]
>>> Link: https://etenal.me/archives/1336 [3]
>>> Link: https://github.com/a13xp0p0v/kernel-hack-drill/blob/master/drill_exploit_uaf.c [4]
>>
>> What's the performance impact of this series? Did you run some benchmarks?
>
> I wasn't able to measure any performance impact at all. It does add a
> small bit of memory overhead, but it's on the order of a dozen pages
> used for the 2 extra sets of buckets. (E.g. it's well below the overhead
> introduced by CONFIG_RANDOM_KMALLOC_CACHES, which adds 16 extra sets
> of buckets.)
Nice!
Powered by blists - more mailing lists