[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1b6d27cf-2238-0c1c-c563-b38728fbabc2@redhat.com>
Date:   Thu, 12 Aug 2021 12:13:44 +0200
From:   David Hildenbrand <david@...hat.com>
To:     Christian Brauner <christian.brauner@...ntu.com>
Cc:     linux-kernel@...r.kernel.org,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        "H. Peter Anvin" <hpa@...or.com>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Alexey Dobriyan <adobriyan@...il.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Mark Rutland <mark.rutland@....com>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Jiri Olsa <jolsa@...hat.com>,
        Namhyung Kim <namhyung@...nel.org>,
        Petr Mladek <pmladek@...e.com>,
        Sergey Senozhatsky <sergey.senozhatsky@...il.com>,
        Andy Shevchenko <andriy.shevchenko@...ux.intel.com>,
        Rasmus Villemoes <linux@...musvillemoes.dk>,
        Kees Cook <keescook@...omium.org>,
        "Eric W. Biederman" <ebiederm@...ssion.com>,
        Greg Ungerer <gerg@...ux-m68k.org>,
        Geert Uytterhoeven <geert@...ux-m68k.org>,
        Mike Rapoport <rppt@...nel.org>,
        Vlastimil Babka <vbabka@...e.cz>,
        Vincenzo Frascino <vincenzo.frascino@....com>,
        Chinwen Chang <chinwen.chang@...iatek.com>,
        Michel Lespinasse <walken@...gle.com>,
        Catalin Marinas <catalin.marinas@....com>,
        "Matthew Wilcox (Oracle)" <willy@...radead.org>,
        Huang Ying <ying.huang@...el.com>,
        Jann Horn <jannh@...gle.com>, Feng Tang <feng.tang@...el.com>,
        Kevin Brodsky <Kevin.Brodsky@....com>,
        Michael Ellerman <mpe@...erman.id.au>,
        Shawn Anastasio <shawn@...stas.io>,
        Steven Price <steven.price@....com>,
        Nicholas Piggin <npiggin@...il.com>,
        Jens Axboe <axboe@...nel.dk>,
        Gabriel Krisman Bertazi <krisman@...labora.com>,
        Peter Xu <peterx@...hat.com>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Shakeel Butt <shakeelb@...gle.com>,
        Marco Elver <elver@...gle.com>,
        Daniel Jordan <daniel.m.jordan@...cle.com>,
        Nicolas Viennot <Nicolas.Viennot@...sigma.com>,
        Thomas Cedeno <thomascedeno@...gle.com>,
        Collin Fijalkovich <cfijalkovich@...gle.com>,
        Michal Hocko <mhocko@...e.com>,
        Miklos Szeredi <miklos@...redi.hu>,
        Chengguang Xu <cgxu519@...ernel.net>,
        Christian König <ckoenig.leichtzumerken@...il.com>,
        linux-unionfs@...r.kernel.org, linux-api@...r.kernel.org,
        x86@...nel.org, linux-fsdevel@...r.kernel.org, linux-mm@...ck.org,
        Andrei Vagin <avagin@...il.com>
Subject: Re: [PATCH v1 3/7] kernel/fork: always deny write access to current
 MM exe_file
On 12.08.21 12:05, Christian Brauner wrote:
> [+Cc Andrei]
> 
> On Thu, Aug 12, 2021 at 10:43:44AM +0200, David Hildenbrand wrote:
>> We want to remove VM_DENYWRITE only currently only used when mapping the
>> executable during exec. During exec, we already deny_write_access() the
>> executable, however, after exec completes the VMAs mapped
>> with VM_DENYWRITE effectively keeps write access denied via
>> deny_write_access().
>>
>> Let's deny write access when setting the MM exe_file. With this change, we
>> can remove VM_DENYWRITE for mapping executables.
>>
>> This represents a minor user space visible change:
>> sys_prctl(PR_SET_MM_EXE_FILE) can now fail if the file is already
>> opened writable. Also, after sys_prctl(PR_SET_MM_EXE_FILE), the file
> 
> Just for completeness, this also affects PR_SET_MM_MAP when exe_fd is
> set.
Correct.
> 
>> cannot be opened writable. Note that we can already fail with -EACCES if
>> the file doesn't have execute permissions.
>>
>> Signed-off-by: David Hildenbrand <david@...hat.com>
>> ---
> 
> The biggest user I know and that I'm involved in is CRIU which heavily
> uses PR_SET_MM_MAP (with a fallback to PR_SET_MM_EXE_FILE on older
> kernels) during restore. Afair, criu opens the exe fd as an O_PATH
> during dump and thus will use the same flag during restore when
> opening it. So that should be fine.
Yes.
> 
> However, if I understand the consequences of this change correctly, a
> problem could be restoring workloads that hold a writable fd open to
> their exe file at dump time which would mean that during restore that fd
> would be reopened writable causing CRIU to fail when setting the exe
> file for the task to be restored.
If it's their exe file, then the existing VM_DENYWRITE handling would 
have forbidden these workloads to open the fd of their exe file 
writable, right? At least before doing any 
PR_SET_MM_MAP/PR_SET_MM_EXE_FILE. But that should rule out quite a lot 
of cases we might be worried about, right?
> 
> Which honestly, no idea how many such workloads exist. (I know at least
> of runC and LXC need to sometimes reopen to rexec themselves (weird bug
> to protect against attacking the exe file) and thus re-open
> /proc/self/exe but read-only.)
> 
>>   kernel/fork.c | 39 ++++++++++++++++++++++++++++++++++-----
>>   1 file changed, 34 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 6bd2e52bcdfb..5d904878f19b 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -476,6 +476,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   {
>>   	struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
>>   	struct rb_node **rb_link, *rb_parent;
>> +	struct file *exe_file;
>>   	int retval;
>>   	unsigned long charge;
>>   	LIST_HEAD(uf);
>> @@ -493,7 +494,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   	mmap_write_lock_nested(mm, SINGLE_DEPTH_NESTING);
>>   
>>   	/* No ordering required: file already has been exposed. */
>> -	RCU_INIT_POINTER(mm->exe_file, get_mm_exe_file(oldmm));
>> +	exe_file = get_mm_exe_file(oldmm);
>> +	RCU_INIT_POINTER(mm->exe_file, exe_file);
>> +	if (exe_file)
>> +		deny_write_access(exe_file);
>>   
>>   	mm->total_vm = oldmm->total_vm;
>>   	mm->data_vm = oldmm->data_vm;
>> @@ -638,8 +642,13 @@ static inline void mm_free_pgd(struct mm_struct *mm)
>>   #else
>>   static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
>>   {
>> +	struct file *exe_file;
>> +
>>   	mmap_write_lock(oldmm);
>> -	RCU_INIT_POINTER(mm->exe_file, get_mm_exe_file(oldmm));
>> +	exe_file = get_mm_exe_file(oldmm);
>> +	RCU_INIT_POINTER(mm->exe_file, exe_file);
>> +	if (exe_file)
>> +		deny_write_access(exe_file);
>>   	mmap_write_unlock(oldmm);
>>   	return 0;
>>   }
>> @@ -1163,11 +1172,19 @@ void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
>>   	 */
>>   	old_exe_file = rcu_dereference_raw(mm->exe_file);
>>   
>> -	if (new_exe_file)
>> +	if (new_exe_file) {
>>   		get_file(new_exe_file);
>> +		/*
>> +		 * exec code is required to deny_write_access() successfully,
>> +		 * so this cannot fail
>> +		 */
>> +		deny_write_access(new_exe_file);
>> +	}
>>   	rcu_assign_pointer(mm->exe_file, new_exe_file);
>> -	if (old_exe_file)
>> +	if (old_exe_file) {
>> +		allow_write_access(old_exe_file);
>>   		fput(old_exe_file);
>> +	}
>>   }
>>   
>>   int atomic_set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
>> @@ -1194,10 +1211,22 @@ int atomic_set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
>>   	}
>>   
>>   	/* set the new file, lockless */
>> +	ret = deny_write_access(new_exe_file);
>> +	if (ret)
>> +		return -EACCES;
>>   	get_file(new_exe_file);
>> +
>>   	old_exe_file = xchg(&mm->exe_file, new_exe_file);
>> -	if (old_exe_file)
>> +	if (old_exe_file) {
>> +		/*
>> +		 * Don't race with dup_mmap() getting the file and disallowing
>> +		 * write access while someone might open the file writable.
>> +		 */
>> +		mmap_read_lock(mm);
>> +		allow_write_access(old_exe_file);
>>   		fput(old_exe_file);
>> +		mmap_read_unlock(mm);
>> +	}
>>   	return 0;
>>   }
>>   
>> -- 
>> 2.31.1
>>
> 
-- 
Thanks,
David / dhildenb
Powered by blists - more mailing lists
 
