[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20230309083523.66592-1-qiuxu.zhuo@intel.com>
Date: Thu, 9 Mar 2023 16:35:23 +0800
From: Qiuxu Zhuo <qiuxu.zhuo@...el.com>
To: tglx@...utronix.de
Cc: arjan.van.de.ven@...el.com, arjan@...ux.intel.com,
boqun.feng@...il.com, davem@...emloft.net, edumazet@...gle.com,
kuba@...nel.org, linux-kernel@...r.kernel.org,
mark.rutland@....com, maz@...nel.org, netdev@...r.kernel.org,
pabeni@...hat.com, peterz@...radead.org,
torvalds@...uxfoundation.org, wangyang.guo@...el.com,
will@...nel.org, x86@...nel.org
Subject: Re: [patch V2 3/4] atomics: Provide rcuref - scalable reference counting
Hi Thomas,
Some comments on the comments.
If I'm wrong, please correct me ;-).
> From: Thomas Gleixner <tglx@...utronix.de>
> To: LKML <linux-kernel@...r.kernel.org>
> Cc: Linus Torvalds <torvalds@...uxfoundation.org>,
> x86@...nel.org, Wangyang Guo <wangyang.guo@...el.com>,
> Arjan van De Ven <arjan@...ux.intel.com>,
> "David S. Miller" <davem@...emloft.net>,
> Eric Dumazet <edumazet@...gle.com>,
> Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
> netdev@...r.kernel.org, Will Deacon <will@...nel.org>,
> Peter Zijlstra <peterz@...radead.org>,
> Boqun Feng <boqun.feng@...il.com>,
> Mark Rutland <mark.rutland@....com>,
> Marc Zyngier <maz@...nel.org>,
> Arjan Van De Ven <arjan.van.de.ven@...el.com>
> Subject: [patch V2 3/4] atomics: Provide rcuref - scalable reference counting
>
> atomic_t based reference counting, including refcount_t, uses
> atomic_inc_not_zero() for acquiring a reference. atomic_inc_not_zero() is
> implemented with a atomic_try_cmpxchg() loop. High contention of the
> reference count leads to retry loops and scales badly. There is nothing to
> improve on this implementation as the semantics have to be preserved.
>
> Provide rcuref as a scalable alternative solution which is suitable for RCU
> managed objects. Similar to refcount_t it comes with overflow and underflow
> detection and mitigation.
>
> rcuref treats the underlying atomic_t as an unsigned integer and partitions
> this space into zones:
>
> 0x00000000 - 0x7FFFFFFF valid zone (1 .. INT_MAX references)
>From the point of rcuref_read()'s view:
0x00000000 encodes 1, ..., then 0x7FFFFFFF should encode INT_MAX + 1 references.
> 0x80000000 - 0xBFFFFFFF saturation zone
> 0xC0000000 - 0xFFFFFFFE dead zone
> 0xFFFFFFFF no reference
>
> rcuref_get() unconditionally increments the reference count with
> atomic_add_negative_relaxed(). rcuref_put() unconditionally decrements the
> reference count with atomic_add_negative_release().
>
> This unconditional increment avoids the inc_not_zero() problem, but
> requires a more complex implementation on the put() side when the count
> drops from 0 to -1.
>
> When this transition is detected then it is attempted to mark the reference
> count dead, by setting it to the midpoint of the dead zone with a single
> atomic_cmpxchg_release() operation. This operation can fail due to a
> concurrent rcuref_get() elevating the reference count from -1 to 0 again.
>
> If the unconditional increment in rcuref_get() hits a reference count which
> is marked dead (or saturated) it will detect it after the fact and bring
> back the reference count to the midpoint of the respective zone. The zones
> provide enough tolerance which makes it practically impossible to escape
> from a zone.
[...]
> + * Why not refcount?
> + * =================
> + *
> + * In principle it should be possible to make refcount use the rcuref
> + * scheme, but the destruction race described below cannot be prevented
> + * unless the protected object is RCU managed.
> + *
> + * Theory of operation
> + * ===================
> + *
> + * rcuref uses an unsigned integer reference counter. As long as the
> + * counter value is greater than or equal to RCUREF_ONEREF and not larger
> + * than RCUREF_MAXREF the reference is alive:
> + *
> + * ONEREF MAXREF SATURATED RELEASED DEAD NOREF
> + * 0 0x7FFFFFFF 0x8000000 0xA0000000 0xBFFFFFFF 0xC0000000 0xE0000000 0xFFFFFFFF
> + * <---valid --------> <-------saturation zone-------> <-----dead zone----->
> + *
> + * The get() and put() operations do unconditional increments and
> + * decrements. The result is checked after the operation. This optimizes
> + * for the fast path.
> + *
> + * If the reference count is saturated or dead, then the increments and
> + * decrements are not harmful as the reference count still stays in the
> + * respective zones and is always set back to STATURATED resp. DEAD. The
> + * zones have room for 2^28 racing operations in each direction, which
> + * makes it practically impossible to escape the zones.
> + *
> + * Once the last reference is dropped the reference count becomes
> + * RCUREF_NOREF which forces rcuref_put() into the slowpath operation. The
> + * slowpath then tries to set the reference count from RCUREF_NOREF to
> + * RCUREF_DEAD via a cmpxchg(). This opens a small window where a
> + * concurrent rcuref_get() can acquire the reference count and bring it
> + * back to RCUREF_ONEREF or even drop the reference again and mark it DEAD.
> + *
> + * If the cmpxchg() succeeds then a concurrent rcuref_get() will result in
> + * DEAD + 1, which is inside the dead zone. If that happens the reference
> + * count is put back to DEAD.
> + *
> + * The actual race is possible due to the unconditional increment and
> + * decrements in rcuref_get() and rcuref_put():
> + *
> + * T1 T2
> + * get() put()
> + * if (atomic_add_negative(1, &ref->refcnt))
For T2 put() here:
"if (atomic_add_negative(1, &ref->refcnt))" ->
"if (atomic_add_negative(-1, &ref->refcnt))"
> + * succeeds-> atomic_cmpxchg(&ref->refcnt, -1, DEAD);
Is it more readable if 's/-1/NODEF/g' ?
> + *
> + * atomic_add_negative(1, &ref->refcnt); <- Elevates refcount to DEAD + 1
> + *
> + * As the result of T1's add is negative, the get() goes into the slow path
> + * and observes refcnt being in the dead zone which makes the operation fail.
> + *
> + * Possible critical states:
> + *
> + * Context Counter References Operation
> + * T1 0 1 init()
> + * T2 1 2 get()
> + * T1 0 1 put()
> + * T2 -1 0 put() tries to mark dead
> + * T1 0 1 get()
> + * T2 0 1 put() mark dead fails
> + * T1 -1 0 put() tries to mark dead
> + * T1 DEAD 0 put() mark dead succeeds
> + * T2 DEAD+1 0 get() fails and puts it back to DEAD
> + *
> + * Of course there are more complex scenarios, but the above illustrates
> + * the working principle. The rest is left to the imagination of the
> + * reader.
> + *
> + * Deconstruction race
> + * ===================
> + *
> + * The release operation must be protected by prohibiting a grace period in
> + * order to prevent a possible use after free:
> + *
> + * T1 T2
> + * put() get()
> + * // ref->refcnt = ONEREF
> + * if (atomic_add_negative(-1, &ref->cnt))
For T1 put() here:
"if (atomic_add_negative(-1, &ref->cnt))" ->
"if (!atomic_add_negative(-1, &ref->cnt))"
> + * return false; <- Not taken
> + *
> + * // ref->refcnt == NOREF
> + * --> preemption
> + * // Elevates ref->c to ONEREF
s/ref->c/ref->refcnt/g
> + * if (!atomic_add_negative(1, &ref->refcnt))
> + * return true; <- taken
> + *
> + * if (put(&p->ref)) { <-- Succeeds
> + * remove_pointer(p);
> + * kfree_rcu(p, rcu);
> + * }
> + *
> + * RCU grace period ends, object is freed
> + *
> + * atomic_cmpxchg(&ref->refcnt, NONE, DEAD); <- UAF
s/NONE/NOREF/g
[...]
Powered by blists - more mailing lists