linux-kernel - Re: [PATCH v4 1/1] exec: seal system mappings

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a0a3b26b-2222-478c-a1cb-56514ec8d6ff@lucifer.local>
Date: Fri, 3 Jan 2025 21:38:10 +0000
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Kees Cook <kees@...nel.org>
Cc: jeffxu@...omium.org, akpm@...ux-foundation.org, keescook@...omium.org,
        jannh@...gle.com, torvalds@...ux-foundation.org,
        adhemerval.zanella@...aro.org, oleg@...hat.com,
        linux-kernel@...r.kernel.org, linux-hardening@...r.kernel.org,
        linux-mm@...ck.org, jorgelo@...omium.org, sroettger@...gle.com,
        ojeda@...nel.org, adobriyan@...il.com, anna-maria@...utronix.de,
        mark.rutland@....com, linus.walleij@...aro.org, Jason@...c4.com,
        deller@....de, rdunlap@...radead.org, davem@...emloft.net, hch@....de,
        peterx@...hat.com, hca@...ux.ibm.com, f.fainelli@...il.com,
        gerg@...nel.org, dave.hansen@...ux.intel.com, mingo@...nel.org,
        ardb@...nel.org, Liam.Howlett@...cle.com, mhocko@...e.com,
        42.hyeyoo@...il.com, peterz@...radead.org, ardb@...gle.com,
        enh@...gle.com, rientjes@...gle.com, groeck@...omium.org,
        mpe@...erman.id.au, Vlastimil Babka <vbabka@...e.cz>,
        Andrei Vagin <avagin@...il.com>, Dmitry Safonov <0x7f454c46@...il.com>,
        Mike Rapoport <mike.rapoport@...il.com>,
        Alexander Mikhalitsyn <aleksandr.mikhalitsyn@...onical.com>
Subject: Re: [PATCH v4 1/1] exec: seal system mappings

On Tue, Dec 17, 2024 at 02:18:53PM -0800, Kees Cook wrote:
> On Mon, Nov 25, 2024 at 08:20:21PM +0000, jeffxu@...omium.org wrote:
> > Seal vdso, vvar, sigpage, uprobes and vsyscall.
> >
> > Those mappings are readonly or executable only, sealing can protect
> > them from ever changing or unmapped during the life time of the process.
> > For complete descriptions of memory sealing, please see mseal.rst [1].
> >
> > System mappings such as vdso, vvar, and sigpage (for arm) are
> > generated by the kernel during program initialization, and are
> > sealed after creation.
> > [...]
> >
> > +	exec.seal_system_mappings = [KNL]
> > +			Format: { no | yes }
> > +			Seal system mappings: vdso, vvar, sigpage, vsyscall,
> > +			uprobe.
> > +			- 'no':  do not seal system mappings.
> > +			- 'yes': seal system mappings.
> > +			This overrides CONFIG_SEAL_SYSTEM_MAPPINGS=(y/n)
> > +			If not specified or invalid, default is the value set by
> > +			CONFIG_SEAL_SYSTEM_MAPPINGS.
> > +			This option has no effect if CONFIG_64BIT=n
>
> I know there is a v5 coming, but I wanted to give my thoughts to help
> shape it based on the current discussion threads.
>
> The callers of _install_special_mapping() cover what is mentioned here.
> The vdso is very common (arm, arm64, csky, hexagon, loongarch, mips,
> parisc, powerpc, riscv, s390, sh, sparc, x86, um). For those with vdso,
> some also have vvar (arm, arm64, loongarch, mips, powerpc, riscv, s390,
> sparc, x86). After that, I see a few extra things, in addition to
> sigpage and uprobes as mentioned already in the patch:
>
> arm sigpage
> arm64 compat vectors (what is this for arm?)
> arm64 compat sigreturn (what is this for arm?)
> nios2 kuser helpers
> uprobes

OK let's not get ahead of ourselves :)

VDSOs/gate VMAs are treated quite differently by different arches. So we
have to tread _very_ carefully here.

I believe PPC doe some 'tricky' things and may actually want to unmap, for
instance.

The problem with this kind of change is we're doing something fundamental
that impacts _every possible combinatorial combination of configs, arches,
and use cases_ for each of these which we seeming - just assume - will have
no issue with this.

This is insufficient, deeply. We need:

1. Strong justification (hand waving won't suffice).
2. Very extensive testing and checking, and _proof_ of this testing being
performed.
3. Buy-in from arch maintainers.

So far this series has provided none of those. This is why I am cautious
and pushing back here.

And I absolutely will not accept a user being able to turn on a switch in a
known-broken configuration. This is absolutely unacceptable.

It's equally unacceptable for a user to enable a feature that is
untested/confirmed on an architecture.

So let's be careful about Linus's edict here - the operative part being 'if
it doesn't break things'.

>
> As mentioned in the patch, there is also the x86_64 vsyscall mapping which
> eludes a regular grep since it's not using _install_special_mapping() :)

I mean I think we need to do better than grepping :)

This requires really careful testing and checking. So far only two
architectures have been checked, and no real evidence of the checking has
been provided.

I am ok, generally, with the concept of providing this as an opt-in
experimental feature.

But if we're going to look at default-on, which if we have confidence in
this, I'd actually back, we need a LOT more confidence that this is OK.

>
> So I guess the question is: can we mseal all of these universally under
> a common knob? Do the different uses mean we need finer granularity of
> knob, and do different architectures need flexibility here too? The
> patch handles the arch question with CONFIG_ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> (which I think will be renamed with s/SEAL/MSEAL/ if I am following the
> threads). This seems a good solution to me. My question is

But then immediately undoes this by enabling the feature to be turned on
regardless. This is emphatically _not_ good.


> about if sigpage, vectors, and sigreturn can also be included? (It seems
> like the answer is "yes", but I didn't see mention of the arm64 compat
> mappings.)

I oppose adding a bunch more things. Let's take this slowly, please.

And evidence of this being checked/tested really must be required.

>
> Linus has expressed the desire that security features be available by
> default if they don't break existing userspace and that they be compiled
> in if possible (rather than be behind a CONFIG) so that code paths are
> being exercised to gain the most exposure to finding bugs. To that end,
> it's best to have a kernel command line to control it if it isn't safe
> to have always enabled. This is how we've handled _many_ features so that
> the code is built into the kernel, but that end users (e.g. distro users)
> can enable/disable a feature without rebuilding the entire kernel.

See above - the operative part is 'if they don't break existing userspace',
and I would be so bold as to suggest he also means the kernel too :P

This isn't a green light to just enable a feature with little to no
testing/checking other than a verbal assurance because it involves
security.

>
> For a "built into the kernel but default disabled unless enabled at boot
> time" example see:
>
> config RANDOMIZE_KSTACK_OFFSET
>         bool "Support for randomizing kernel stack offset on syscall entry" if EXPERT
>         default y
>         depends on HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
> ...
> config RANDOMIZE_KSTACK_OFFSET_DEFAULT
>         bool "Default state of kernel stack offset randomization"
>         depends on RANDOMIZE_KSTACK_OFFSET
> ...
> #ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
> DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
>                            randomize_kstack_offset);
> ...
> early_param("randomize_kstack_offset", early_randomize_kstack_offset);

>
>
> For an example of the older "not built into the kernel but when built in
> can be turned off at boot time" that predated Linus's recommendation see:
>
> config HARDENED_USERCOPY
>         bool "Harden memory copies between kernel and userspace"
> ...
> static DEFINE_STATIC_KEY_FALSE_RO(bypass_usercopy_checks);
> ...
> __setup("hardened_usercopy=", parse_hardened_usercopy);
>
> (This should arguably be "default y" in the kernel these days, but
> whatever.)
>
> So, if we want to have a CONFIG_MSEAL_SYSTEM_MAPPINGS at all, it should
> be "default y" since we have the ...ARCH_HAS... config already, and then
> add a CONFIG_MSEAL_SYSTEM_MAPPINGS_DEFAULT that is off by default (since
> we expect there may be userspace impact) and tie _that_ to the kernel
> command-line so that end users can use it, or system builders can enable
> CONFIG_MSEAL_SYSTEM_MAPPINGS_DEFAULT.

Again, I hate to push on this, but I am simply not going to allow users to
enable features we know break things.

Users might not be aware this feature is broken for CRIU, and X, and Y and
whatever else we've not thought about and enable it thinking it helps
security, and end up with a broken system.

>
> For the command line name, if a namespace is desired, I'd agree that
> naming this "mseal.special_mappings" is reasonable. It does change process
> behavior, so I'm also not opposed to "process.mseal_special_mappings", and
> it happens at exec, so "exec.mseal_special_mappings" is fine by me too.
> I think the main question would be: will there be other things under the
> proposed "mseal", "process", or "exec" namespace? I'd like to encourage
> things being logically grouped since we have SO MANY already. :)
>
> Also from discussions it sounds like there may need to be even finer-gain
> control, likely via prctl, for dealing with the CRIU case. The proposal
> is to provide an opt-out prctl with CAP_CHECKPOINT_RESTORE? I think this
> is reasonable and lets this all work without a new CONFIG. I imagine it
> would look like:
>
> criu process (which has CAP_CHECKPOINT_RESTORE):
> 	- prctl(GET_MSEAL_SYSTEM_MAPPINGS)
> 	- if set:
> 		- remember we need to mseal mappings
> 		- prctl(SET_MSEAL_SYSTEM_MAPPINGS, 0)
> 		- re-exec with --mseal-system-mappings (or something)
> 	- perform the "fork a tree to restore" work
> 	- in each child, move around all the mappings
> 		- if we need to mseal mappings:
> 			- prctl(SET_MSEAL_SYSTEM_MAPPINGS, 1)
> 			- mseal each system mapping
> 		- eventually drop CAP_CHECKPOINT_RESTORE
> 		- become the restored process
>

This seems like putting the onus on CRIU users to deal with a known-broken
thing? That seems really unreasonable? And people would just have to have
the right userland code to work in the kernel with mseal?

Yeah I oppose entirely this unless I'm missing something?

> Does that all sound right? If so I think Jeff has all the details needed
> to spin a v5.

I think a v5 would have to fulfill the 3 points I raised earlier on, and
(to risk repeating myself) I absolutely will not accept a version of this
that allows a user to break their kernel or userspace, even if they ask for
it, sorry.

Also let's not add anything extra, let's focus on the minimum to start
with. I'd prefer a v5 to be opt-in, or to provide truly copious evidence in
favour of default-yes.

The issue with this is we're changing something totally fundamental and
then assuming every combination of arch, config options, userland code,
etc. will work.

It might even be worthwhile just trying to default-y this in the minimal
case, perhaps x86-64, non-CRIU - with evidence provided to support that
this is fine (i.e. showing that all code interfacing with the VDSO at no
point requires mseal'd operations, running a bunch of configs on test
boxes, etc. etc.)?

Let's tread carefully here.

>
> -Kees
>
> --
> Kees Cook