linux-kernel - Re: [PATCH v2 1/2] mm: Implement memory-deny-write-execute as a prctl

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0ec67737-3c09-ba5c-f840-9ed02a0ea6bf@gmail.com>
Date:   Mon, 23 Jan 2023 19:48:40 +0200
From:   Topi Miettinen <toiwoton@...il.com>
To:     Catalin Marinas <catalin.marinas@....com>,
        David Hildenbrand <david@...hat.com>
Cc:     Joey Gouly <joey.gouly@....com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Lennart Poettering <lennart@...ttering.net>,
        Zbigniew Jędrzejewski-Szmek <zbyszek@...waw.pl>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Kees Cook <keescook@...omium.org>,
        Szabolcs Nagy <szabolcs.nagy@....com>,
        Mark Brown <broonie@...nel.org>,
        Jeremy Linton <jeremy.linton@....com>, linux-mm@...ck.org,
        linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
        linux-abi-devel@...ts.sourceforge.net, nd@....com, shuah@...nel.org
Subject: Re: [PATCH v2 1/2] mm: Implement memory-deny-write-execute as a prctl

On 23.1.2023 18.04, Catalin Marinas wrote:
> On Mon, Jan 23, 2023 at 01:53:46PM +0100, David Hildenbrand wrote:
>> On 23.01.23 13:19, Catalin Marinas wrote:
>>> On Mon, Jan 23, 2023 at 12:45:50PM +0100, David Hildenbrand wrote:
>>>> On 19.01.23 17:03, Joey Gouly wrote:
>>>>> diff --git a/include/linux/mman.h b/include/linux/mman.h
>>>>> index 58b3abd457a3..cee1e4b566d8 100644
>>>>> --- a/include/linux/mman.h
>>>>> +++ b/include/linux/mman.h
>>>>> @@ -156,4 +156,38 @@ calc_vm_flag_bits(unsigned long flags)
>>>>>     }
>>>>>     unsigned long vm_commit_limit(void);
>>>>> +
>>>>> +/*
>>>>> + * Denies creating a writable executable mapping or gaining executable permissions.
>>>>> + *
>>>>> + * This denies the following:
>>>>> + *
>>>>> + * 	a)	mmap(PROT_WRITE | PROT_EXEC)
>>>>> + *
>>>>> + *	b)	mmap(PROT_WRITE)
>>>>> + *		mprotect(PROT_EXEC)
>>>>> + *
>>>>> + *	c)	mmap(PROT_WRITE)
>>>>> + *		mprotect(PROT_READ)
>>>>> + *		mprotect(PROT_EXEC)
>>>>> + *
>>>>> + * But allows the following:
>>>>> + *
>>>>> + *	d)	mmap(PROT_READ | PROT_EXEC)
>>>>> + *		mmap(PROT_READ | PROT_EXEC | PROT_BTI)
>>>>> + */
>>>>
>>>> Shouldn't we clear VM_MAYEXEC at mmap() time such that we cannot set VM_EXEC
>>>> anymore? In an ideal world, there would be no further mprotect changes
>>>> required.
>>>
>>> I don't think it works for this scenario. We don't want to disable
>>> PROT_EXEC entirely, only disallow it if the mapping is not already
>>> executable. The below should be allowed:
>>>
>>> 	addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
>>> 	mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI);
>>>
>>> but IIUC what you meant, it fails if we cleared VM_MAYEXEC at mmap()
>>> time.
>>
>> Yeah, if you allow write access at mmap time, clear VM_MAYEXEC (and disallow
>> VM_EXEC of course).
> 
> This should work but it doesn't fully mimic systemd's MDWE behaviour
> (e.g. disallow mprotect(PROT_EXEC) even if the mmap was PROT_READ only).
> Topi wanted to stay close to that at least in the first incarnation of
> this control (can be extended later).
> 
>> But I guess we'd have to go one step further: if we allow exec access
>> at mmap time, clear VM_MAYWRITE (and disallow VM_WRITE of course).
> 
> Yes, both this and the VM_MAYEXEC clearing if VM_WRITE would be useful
> but as additional controls a process can enable.
> 
>> That at least would be then similar to how we handle mmaped files: if the
>> file is not executable, we clear VM_MAYEXEC. If the file is not writable, we
>> clear VM_MAYWRITE.
> 
> We still allow VM_MAYWRITE for private mappings, though we do clear
> VM_MAYEXEC if not executable.
> 
> It would be nice to use VM_MAY* flags for this logic but we can only
> emulate MDWE if we change the semantics of 'MAY': only check the 'MAY'
> flags for permissions being changed (e.g. allow PROT_EXEC if the vma is
> already VM_EXEC even if !VM_MAYEXEC). Another issue is that we end up
> with some weird combinations like having VM_EXEC without VM_MAYEXEC
> (maybe that's fine).
> 
>> Clearing VM_MAYWRITE would imply that also writes via /proc/self/mem to such
>> memory would be forbidden, which might also be what we are trying to
>> achieve, or is that expected to still work?
> 
> I think currently with systemd's MDWE it still works (I haven't tried
> though), unless there's something else forcing that file read-only.
> 
>> But clearing VM_MAYWRITE would mean that is_cow_mapping() would no
>> longer fire for some VMAs, and we'd have to check if that's fine in
>> all cases.
> 
> This will break __access_remote_vm() AFAICT since it can't do a CoW on
> read-only private mapping.
> 
>> Having that said, this patch handles the case when the prctl is applied to a
>> process after already having created some writable or executable mappings,
>> to at least forbid if afterwards on these mappings. What is expected to
>> happen if the process already has writable mappings that are executable at
>> the time we enable the prctl?
> 
> They are expected to continue to work. The prctl() is meant to be
> invoked by something like systemd so that any subsequent exec() will
> inherit the property.
> 
>> Clarifying what the expected semantics with /proc/self/mem are would be
>> nice.
> 
> Yeah, this series doesn't handle this. Topi, do you know if systemd does
> anything about /proc/self/mem? To me this option is more about catching
> inadvertent write|exec mappings rather than blocking programs that
> insist on doing this (they can always map a memfd file twice with
> separate write and exec attributes for example).
> 

I don't think so. For 100% compatibility with seccomp, the same cases of 
mprotect() use should be blocked regardless of the file descriptor used. 
There could be more relaxed PR_MDWE_* controls in the future if needed.

Updated systemd PR: https://github.com/systemd/systemd/pull/25276

I wish there were highly granular access controls for /proc, including 
/proc/self and /proc/sys/*. Now the best options are to use mount 
namespaces and/or SELinux, but they aren't too good for that.

-Topi