[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8a5d8ac1-327d-cee0-150a-3ec152134351@colorfullife.com>
Date: Sat, 30 Oct 2021 15:11:44 +0200
From: Manfred Spraul <manfred@...orfullife.com>
To: "Eric W. Biederman" <ebiederm@...ssion.com>,
Alexander Mikhalitsyn <alexander.mikhalitsyn@...tuozzo.com>
Cc: linux-kernel@...r.kernel.org,
Andrew Morton <akpm@...ux-foundation.org>,
Davidlohr Bueso <dave@...olabs.net>,
Greg KH <gregkh@...uxfoundation.org>,
Andrei Vagin <avagin@...il.com>,
Pavel Tikhomirov <ptikhomirov@...tuozzo.com>,
Vasily Averin <vvs@...tuozzo.com>,
Alexander Mikhalitsyn <alexander@...alicyn.com>,
stable@...r.kernel.org
Subject: Re: [PATCH 2/2] shm: extend forced shm destroy to support objects
from several IPC nses
On 10/30/21 06:26, Eric W. Biederman wrote:
> Alexander Mikhalitsyn <alexander.mikhalitsyn@...tuozzo.com> writes:
>
>> Currently, exit_shm function not designed to work properly when
>> task->sysvshm.shm_clist holds shm objects from different IPC namespaces.
>>
>> This is a real pain when sysctl kernel.shm_rmid_forced = 1, because
>> it leads to use-after-free (reproducer exists).
>>
>> That particular patch is attempt to fix the problem by extending exit_shm
>> mechanism to handle shm's destroy from several IPC ns'es.
>>
>> To achieve that we do several things:
>> 1. add namespace (non-refcounted) pointer to the struct shmid_kernel
>> 2. during new shm object creation (newseg()/shmget syscall) we initialize
>> this pointer by current task IPC ns
>> 3. exit_shm() fully reworked such that it traverses over all
>> shp's in task->sysvshm.shm_clist and gets IPC namespace not
>> from current task as it was before but from shp's object itself, then
>> call shm_destroy(shp, ns).
>>
>> Note. We need to be really careful here, because as it was said before
>> (1), our pointer to IPC ns non-refcnt'ed. To be on the safe side we using
>> special helper get_ipc_ns_not_zero() which allows to get IPC ns refcounter
>> only if IPC ns not in the "state of destruction".
>
>
>
>> Q/A
>>
>> Q: Why we can access shp->ns memory using non-refcounted pointer?
>> A: Because shp object lifetime is always shorther
>> than IPC namespace lifetime, so, if we get shp object from the
>> task->sysvshm.shm_clist while holding task_lock(task) nobody can
>> steal our namespace.
> Not true. A struct shmid_kernel can outlive the namespace in which it
> was created. I you look at do_shm_rmid which is called when the
> namespace is destroyed for every shmid_kernel in the namespace that if
> the struct shmid_kernel still has users only ipc_set_key_private is
> called. The struct shmid_kernel continues to exist.
No, shm_nattach is always 0 when a namespace is destroyed.
Thus it is impossible that shmid_kernel continues to exist.
Let's check all shm_nattach modifications:
1) do_shmat:
shp->shm_nattach++;
sfd->ns = get_ipc_ns(ns);
shp->shm_nattach--;
pairs with
shm_release()
put_ipc_ns()
2) shm_open()
only shp->shm_nattach++
shm_open unconditionally accesses shm_file_data, i.e. sfd must be valid,
there must be a reference to the namespace
pairs with shm_close()
only shp->shm_nattach--;
shm_close unconditionally accesses shm_file_data, i.e. sfd must be
valid, there must be a reference to the namespace
As shm_open()/close "nests" inside do_shmat: there is always a get_ipc_ns().
Or, much simpler: Check shm_open() and shm_close():
These two functions address a shm segment by namespace and ID, not by a
shm pointer. Thus _if_ it is possible that shm_nattach is > 0 at
namespace destruction, then there would be far more issues.
Or: Attached is a log file, a test application, and a patch that adds
pr_info statements.
The namespace is destroyed immediately when no segments are mapped, the
destruction is delayed until exit() if there are mapped segments.
>> Q: Does this patch change semantics of unshare/setns/clone syscalls?
>> A: Not. It's just fixes non-covered case when process may leave
>> IPC namespace without getting task->sysvshm.shm_clist list cleaned up.
>
> Just reading through exit_shm the code is not currently safe.
>
> At a minimum do_shm_rmid needs to set the shp->ns to NULL. Otherwise
> the struct shmid_kernel can contain a namespace pointer after
> the namespace exits. Which results in a different use after free.
No [unless there are additional bugs]
>
> Beyond that there is dropping the task lock. The code holds a reference
> to the namespace which means that the code does not need to worry about
> free_ipcs. References from mappings are still possible.
>
> Which means that the code could see:
> exit_shm()
> task_lock()
> shp = ...;
> task_unlock()
> shm_close()
> down_write(&shm_ids(ns).rwsem);
> ...
> shm_destroy(shp);
> up_write(&shm_ids(ns).rwsem);
> down_write(&shm_ids(ns)->rwsem);
> shm_lock_by_ptr(shp); /* use after free */
>
>
> I am trying to imagine how to close that race with the current code
> structure. Maybe something could be done by looking at shm_nattach
> count and making it safe to look at that count under the task_lock.
There is no race. Before dropping task_lock, a reference to both the
namespace and the shp pointer is obtained.
Thus neither one can disappear.
> But even then because shmid_kernel is still in the hash table it could
> be mapped and unmapped in the window when task_lock was dropped.
We have ipc_valid_object(), i.e. perm->deleted. If set, then the pointer
and the spinlock are valid, even though the rest is already destroyed.
ipc_rmid() just sets deleted, the (rcu delayed) kfree is done via
ipc_rcu_putref().
> Alternatively shmctl(id, IPC_RMID) can be called in when task_lock is
> dropped. Much less code is involved than mapping and unmapping so it is
> much more likely to win the race.
>
> I don't see how that race can be closed.
>
> Am I missing something?
>
> Eric
>
>
>> Fixes: ab602f79915 ("shm: make exit_shm work proportional to task activity")
>>
>> Cc: "Eric W. Biederman" <ebiederm@...ssion.com>
>> Cc: Andrew Morton <akpm@...ux-foundation.org>
>> Cc: Davidlohr Bueso <dave@...olabs.net>
>> Cc: Greg KH <gregkh@...uxfoundation.org>
>> Cc: Andrei Vagin <avagin@...il.com>
>> Cc: Pavel Tikhomirov <ptikhomirov@...tuozzo.com>
>> Cc: Vasily Averin <vvs@...tuozzo.com>
>> Cc: Manfred Spraul <manfred@...orfullife.com>
>> Cc: Alexander Mikhalitsyn <alexander@...alicyn.com>
>> Cc: stable@...r.kernel.org
>> Co-developed-by: Manfred Spraul <manfred@...orfullife.com>
>> Signed-off-by: Manfred Spraul <manfred@...orfullife.com>
>> Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@...tuozzo.com>
Should/can I mark that I have tested the code?
I would drop one change and one comment is incorrect, otherwise no
findings. See the attached 0002 patch
Tested-by: Manfred Spraul <manfred@...orfullife.com>
>> ---
>> include/linux/ipc_namespace.h | 15 +++
>> include/linux/sched/task.h | 2 +-
>> include/linux/shm.h | 2 +-
>> ipc/shm.c | 170 +++++++++++++++++++++++++---------
>> 4 files changed, 142 insertions(+), 47 deletions(-)
>>
>> diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
>> index 05e22770af51..b75395ec8d52 100644
>> --- a/include/linux/ipc_namespace.h
>> +++ b/include/linux/ipc_namespace.h
>> @@ -131,6 +131,16 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
>> return ns;
>> }
>>
>> +static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns)
>> +{
>> + if (ns) {
>> + if (refcount_inc_not_zero(&ns->ns.count))
>> + return ns;
>> + }
>> +
>> + return NULL;
>> +}
>> +
>> extern void put_ipc_ns(struct ipc_namespace *ns);
>> #else
>> static inline struct ipc_namespace *copy_ipcs(unsigned long flags,
>> @@ -147,6 +157,11 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
>> return ns;
>> }
>>
>> +static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns)
>> +{
>> + return ns;
>> +}
>> +
>> static inline void put_ipc_ns(struct ipc_namespace *ns)
>> {
>> }
>> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
>> index ef02be869cf2..bfdf84dab4be 100644
>> --- a/include/linux/sched/task.h
>> +++ b/include/linux/sched/task.h
>> @@ -157,7 +157,7 @@ static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
>> * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
>> * subscriptions and synchronises with wait4(). Also used in procfs. Also
>> * pins the final release of task.io_context. Also protects ->cpuset and
>> - * ->cgroup.subsys[]. And ->vfork_done.
>> + * ->cgroup.subsys[]. And ->vfork_done. And ->sysvshm.shm_clist.
>> *
>> * Nests both inside and outside of read_lock(&tasklist_lock).
>> * It must not be nested with write_lock_irq(&tasklist_lock),
>> diff --git a/include/linux/shm.h b/include/linux/shm.h
>> index d8e69aed3d32..709f6d0451c0 100644
>> --- a/include/linux/shm.h
>> +++ b/include/linux/shm.h
>> @@ -11,7 +11,7 @@ struct file;
>>
>> #ifdef CONFIG_SYSVIPC
>> struct sysv_shm {
>> - struct list_head shm_clist;
>> + struct list_head shm_clist;
>> };
>>
This is a whitespace only change. We can drop it.
>> long do_shmat(int shmid, char __user *shmaddr, int shmflg, unsigned long *addr,
>> diff --git a/ipc/shm.c b/ipc/shm.c
>> index 748933e376ca..29667e17b12a 100644
>> --- a/ipc/shm.c
>> +++ b/ipc/shm.c
>> @@ -62,9 +62,18 @@ struct shmid_kernel /* private to the kernel */
>> struct pid *shm_lprid;
>> struct ucounts *mlock_ucounts;
>>
>> - /* The task created the shm object. NULL if the task is dead. */
>> + /*
>> + * The task created the shm object, for looking up
>> + * task->sysvshm.shm_clist_lock
>> + */
>> struct task_struct *shm_creator;
>> - struct list_head shm_clist; /* list by creator */
>> +
>> + /*
>> + * list by creator. shm_clist_lock required for read/write
>> + * if list_empty(), then the creator is dead already
>> + */
shm_clist_lock was replaced by task_lock(->shm_creator).
>> + struct list_head shm_clist;
>> + struct ipc_namespace *ns;
>> } __randomize_layout;
>>
>> /* shm_mode upper byte flags */
>> @@ -115,6 +124,7 @@ static void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
>> struct shmid_kernel *shp;
>>
>> shp = container_of(ipcp, struct shmid_kernel, shm_perm);
>> + WARN_ON(ns != shp->ns);
>>
>> if (shp->shm_nattch) {
>> shp->shm_perm.mode |= SHM_DEST;
>> @@ -225,10 +235,36 @@ static void shm_rcu_free(struct rcu_head *head)
>> kfree(shp);
>> }
>>
>> -static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s)
>> +/*
>> + * It has to be called with shp locked.
>> + * It must be called before ipc_rmid()
>> + */
>> +static inline void shm_clist_rm(struct shmid_kernel *shp)
>> {
>> - list_del(&s->shm_clist);
>> - ipc_rmid(&shm_ids(ns), &s->shm_perm);
>> + struct task_struct *creator;
>> +
>> + /*
>> + * A concurrent exit_shm may do a list_del_init() as well.
>> + * Just do nothing if exit_shm already did the work
>> + */
>> + if (list_empty(&shp->shm_clist))
>> + return;
>> +
>> + /*
>> + * shp->shm_creator is guaranteed to be valid *only*
>> + * if shp->shm_clist is not empty.
>> + */
>> + creator = shp->shm_creator;
>> +
>> + task_lock(creator);
>> + list_del_init(&shp->shm_clist);
>> + task_unlock(creator);
> Lock ordering
> rwsem
> ipc_lock
> task_lock
>
correct.
>> +}
>> +
>> +static inline void shm_rmid(struct shmid_kernel *s)
>> +{
>> + shm_clist_rm(s);
>> + ipc_rmid(&shm_ids(s->ns), &s->shm_perm);
>> }
>>
>>
>> @@ -283,7 +319,7 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
>> shm_file = shp->shm_file;
>> shp->shm_file = NULL;
>> ns->shm_tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT;
>> - shm_rmid(ns, shp);
>> + shm_rmid(shp);
>> shm_unlock(shp);
>> if (!is_file_hugepages(shm_file))
>> shmem_lock(shm_file, 0, shp->mlock_ucounts);
>> @@ -306,10 +342,10 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
>> *
>> * 2) sysctl kernel.shm_rmid_forced is set to 1.
>> */
>> -static bool shm_may_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
>> +static bool shm_may_destroy(struct shmid_kernel *shp)
>> {
>> return (shp->shm_nattch == 0) &&
>> - (ns->shm_rmid_forced ||
>> + (shp->ns->shm_rmid_forced ||
>> (shp->shm_perm.mode & SHM_DEST));
>> }
>>
>> @@ -340,7 +376,7 @@ static void shm_close(struct vm_area_struct *vma)
>> ipc_update_pid(&shp->shm_lprid, task_tgid(current));
>> shp->shm_dtim = ktime_get_real_seconds();
>> shp->shm_nattch--;
>> - if (shm_may_destroy(ns, shp))
>> + if (shm_may_destroy(shp))
>> shm_destroy(ns, shp);
>> else
>> shm_unlock(shp);
>> @@ -361,10 +397,10 @@ static int shm_try_destroy_orphaned(int id, void *p, void *data)
>> *
>> * As shp->* are changed under rwsem, it's safe to skip shp locking.
>> */
>> - if (shp->shm_creator != NULL)
>> + if (!list_empty(&shp->shm_clist))
>> return 0;
>>
>> - if (shm_may_destroy(ns, shp)) {
>> + if (shm_may_destroy(shp)) {
>> shm_lock_by_ptr(shp);
>> shm_destroy(ns, shp);
>> }
>> @@ -382,48 +418,87 @@ void shm_destroy_orphaned(struct ipc_namespace *ns)
>> /* Locking assumes this will only be called with task == current */
>> void exit_shm(struct task_struct *task)
>> {
>> - struct ipc_namespace *ns = task->nsproxy->ipc_ns;
>> - struct shmid_kernel *shp, *n;
>> + for (;;) {
>> + struct shmid_kernel *shp;
>> + struct ipc_namespace *ns;
>>
>> - if (list_empty(&task->sysvshm.shm_clist))
>> - return;
>> + task_lock(task);
>> +
>> + if (list_empty(&task->sysvshm.shm_clist)) {
>> + task_unlock(task);
>> + break;
>> + }
>> +
>> + shp = list_first_entry(&task->sysvshm.shm_clist, struct shmid_kernel,
>> + shm_clist);
>> +
>> + /* 1) unlink */
>> + list_del_init(&shp->shm_clist);
>>
>> - /*
>> - * If kernel.shm_rmid_forced is not set then only keep track of
>> - * which shmids are orphaned, so that a later set of the sysctl
>> - * can clean them up.
>> - */
>> - if (!ns->shm_rmid_forced) {
>> - down_read(&shm_ids(ns).rwsem);
>> - list_for_each_entry(shp, &task->sysvshm.shm_clist, shm_clist)
>> - shp->shm_creator = NULL;
>> /*
>> - * Only under read lock but we are only called on current
>> - * so no entry on the list will be shared.
>> + * 2) Get pointer to the ipc namespace. It is worth to say
>> + * that this pointer is guaranteed to be valid because
>> + * shp lifetime is always shorter than namespace lifetime
>> + * in which shp lives.
>> + * We taken task_lock it means that shp won't be freed.
>> */
>> - list_del(&task->sysvshm.shm_clist);
>> - up_read(&shm_ids(ns).rwsem);
>> - return;
>> - }
>> + ns = shp->ns;
>>
>> - /*
>> - * Destroy all already created segments, that were not yet mapped,
>> - * and mark any mapped as orphan to cover the sysctl toggling.
>> - * Destroy is skipped if shm_may_destroy() returns false.
>> - */
>> - down_write(&shm_ids(ns).rwsem);
>> - list_for_each_entry_safe(shp, n, &task->sysvshm.shm_clist, shm_clist) {
>> - shp->shm_creator = NULL;
>> + /*
>> + * 3) If kernel.shm_rmid_forced is not set then only keep track of
>> + * which shmids are orphaned, so that a later set of the sysctl
>> + * can clean them up.
>> + */
>> + if (!ns->shm_rmid_forced) {
>> + task_unlock(task);
>> + continue;
>> + }
>>
>> - if (shm_may_destroy(ns, shp)) {
>> + /*
>> + * 4) get a reference to the namespace.
>> + * The refcount could be already 0. If it is 0, then
>> + * the shm objects will be free by free_ipc_work().
>> + */
>> + ns = get_ipc_ns_not_zero(ns);
>> + if (ns) {
>> + /*
>> + * 5) get a reference to the shp itself.
>> + * This cannot fail: shm_clist_rm() is called before
>> + * ipc_rmid(), thus the refcount cannot be 0.
>> + */
>> + WARN_ON(!ipc_rcu_getref(&shp->shm_perm));
>> + }
>> +
>> + task_unlock(task);
> <<<<<<<<< BOOM >>>>>>>
>
> I don't see anything that prevents another task from
> calling shm_destroy(ns, shp) here and freeing it before
> this task can take the rwsem for writing.
shm_destroy() can be called. But due to the ipc_rcu_getref(), the
structure will remain valid.
>> +
>> + if (ns) {
>> + down_write(&shm_ids(ns).rwsem);
>> shm_lock_by_ptr(shp);
>> - shm_destroy(ns, shp);
>> + /*
>> + * rcu_read_lock was implicitly taken in
>> + * shm_lock_by_ptr, it's safe to call
>> + * ipc_rcu_putref here
>> + */
>> + ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
>> +
>> + if (ipc_valid_object(&shp->shm_perm)) {
And this will return false if there was a shm_destroy().
>> + if (shm_may_destroy(shp))
>> + shm_destroy(ns, shp);
>> + else
>> + shm_unlock(shp);
>> + } else {
>> + /*
>> + * Someone else deleted the shp from namespace
>> + * idr/kht while we have waited.
>> + * Just unlock and continue.
>> + */
-> just do a NOP if shm_destroy() was alread performed.
Actually, the same design is used by find_alloc_undo() in ipc/sem.c.
>> + shm_unlock(shp);
>> + }
>> +
>> + up_write(&shm_ids(ns).rwsem);
>> + put_ipc_ns(ns); /* paired with get_ipc_ns_not_zero */
>> }
>> }
>> -
>> - /* Remove the list head from any segments still attached. */
>> - list_del(&task->sysvshm.shm_clist);
>> - up_write(&shm_ids(ns).rwsem);
>> }
>>
>> static vm_fault_t shm_fault(struct vm_fault *vmf)
>> @@ -680,7 +755,11 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
>> if (error < 0)
>> goto no_id;
>>
>> + shp->ns = ns;
>> +
>> + task_lock(current);
>> list_add(&shp->shm_clist, ¤t->sysvshm.shm_clist);
>> + task_unlock(current);
>>
>> /*
>> * shmid gets reported as "inode#" in /proc/pid/maps.
>> @@ -1573,7 +1652,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
>> down_write(&shm_ids(ns).rwsem);
>> shp = shm_lock(ns, shmid);
>> shp->shm_nattch--;
>> - if (shm_may_destroy(ns, shp))
>> +
>> + if (shm_may_destroy(shp))
>> shm_destroy(ns, shp);
>> else
>> shm_unlock(shp);
View attachment "0002-shm-extend-forced-shm-destroy-to-support-objects-fro.patch" of type "text/x-patch" (11893 bytes)
View attachment "0003-DEBUG-CODE-instrummented-ipc-shm.c.patch" of type "text/x-patch" (3848 bytes)
View attachment "shmns4.c" of type "text/x-csrc" (3215 bytes)
View attachment "log-ns4.txt" of type "text/plain" (5028 bytes)
Powered by blists - more mailing lists