linux-kernel - Re: [External] Re: [PATCH v2] mm: add new syscall pidfd_set

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <a44f794e-fe60-e261-3631-9107822d5c36@bytedance.com>
Date:   Mon, 14 Nov 2022 00:41:21 +0800
From:   Zhongkun He <hezhongkun.hzk@...edance.com>
To:     Andrew Morton <akpm@...ux-foundation.org>
Cc:     corbet@....net, mhocko@...e.com, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, linux-api@...r.kernel.org,
        linux-doc@...r.kernel.org
Subject: Re: [External] Re: [PATCH v2] mm: add new syscall
 pidfd_set_mempolicy().

Hi Andrew, thanks for your replay.

> This sounds a bit suspicious.  Please share much more detail about
> these races.  If we proced with this design then mpol_put_async()
> shouild have comments which fully describe the need for the async free.
> 
> How do we *know* that these races are fully prevented with this
> approach?  How do we know that mpol_put_async() won't free the data
> until the race window has fully passed?

A mempolicy can be either associated with a process or with a VMA.
All vma manipulation is somewhat protected by a down_read on
mmap_lock.In process context there is no locking because only
the process accesses its own state before.

Now  we need to change the process context mempolicy specified
in pidfd. the mempolicy may about to be freed by
pidfd_set_mempolicy() while alloc_pages() is using it,
the race condition appears.

process context mempolicy is used in:
alloc_pages()
alloc_pages_bulk_array_mempolicy()
policy_mbind_nodemask()
mempolicy_slab_node()
.....

Say something like the following：

pidfd_set_mempolicy()        target task stack:
                                 alloc_pages:
                                 mpol = p->mempolicy;
task_lock(task);
  old = task->mempolicy;
  task->mempolicy = new;
  task_unlock(task);
  mpol_put(old);
                               /*old mpol has been freed.*/
                               policy_node(...., mpol)
    	           __alloc_pages(mpol);
To reduce the use of locks and atomic operations(mpol_get/put)
in the hot path,task_work is used in mpol_put_async(),
when the target task exit to user mode,	the process context
mempolicy is not used anymore, mpol_free_async()
will be called as task_work to free mempolicy in
target context.			


> Also, in some situations mpol_put_async() will free the data
> synchronously anyway, so aren't these races still present?
> If the task has run exit_task_work(),task_work_add() will fail.
we can free the mempolicy directly because mempolicy is not used.

> 
> Secondly, why was the `flags' argument added?  We might use it one day?
> For what purpose?  I mean, every syscall could have a does-nothing
> `flags' arg, but we don't do that.  What's the plan here?
> 
I found that some functions use 'flags' for scalability, such
as process_madvise(), set_mempolicy_home_node(). back to our case, This 
operation has per-thread rather than per-process semantic ,we could use 
flags to switch for future extension if any. but I'm not sure.

Thanks.