netdev - Re: [PATCH net-next 1/2] net/sched: act_api: rely on rcu in tcf_idr_check

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <77b8d1d8-4ad9-49c7-9c42-612e9de29881@mojatatu.com>
Date: Tue, 5 Dec 2023 17:19:21 -0300
From: Pedro Tammela <pctammela@...atatu.com>
To: Vlad Buslov <vladbu@...dia.com>
Cc: netdev@...r.kernel.org, davem@...emloft.net, edumazet@...gle.com,
 kuba@...nel.org, pabeni@...hat.com, jhs@...atatu.com,
 xiyou.wangcong@...il.com, jiri@...nulli.us, marcelo.leitner@...il.com
Subject: Re: [PATCH net-next 1/2] net/sched: act_api: rely on rcu in
 tcf_idr_check_alloc

On 05/12/2023 15:34, Vlad Buslov wrote:
> On Tue 05 Dec 2023 at 12:30, Pedro Tammela <pctammela@...atatu.com> wrote:
>> Instead of relying only on the idrinfo->lock mutex for
>> bind/alloc logic, rely on a combination of rcu + mutex + atomics
>> to better scale the case where multiple rtnl-less filters are
>> binding to the same action object.
>>
>> Action binding happens when an action index is specified explicitly and
>> an action exists which such index exists. Example:
> 
> Nit: the first sentence looks mangled, extra 'exists' word and probably
> 'which' should be 'with'.
> 
>>    tc actions add action drop index 1
>>    tc filter add ... matchall action drop index 1
>>    tc filter add ... matchall action drop index 1
>>    tc filter add ... matchall action drop index 1
>>    tc filter ls ...
>>       filter protocol all pref 49150 matchall chain 0 filter protocol all pref 49150 matchall chain 0 handle 0x1
>>       not_in_hw
>>             action order 1: gact action drop
>>              random type none pass val 0
>>              index 1 ref 4 bind 3
>>
>>     filter protocol all pref 49151 matchall chain 0 filter protocol all pref 49151 matchall chain 0 handle 0x1
>>       not_in_hw
>>             action order 1: gact action drop
>>              random type none pass val 0
>>              index 1 ref 4 bind 3
>>
>>     filter protocol all pref 49152 matchall chain 0 filter protocol all pref 49152 matchall chain 0 handle 0x1
>>       not_in_hw
>>             action order 1: gact action drop
>>              random type none pass val 0
>>              index 1 ref 4 bind 3
>>
>> When no index is specified, as before, grab the mutex and allocate
>> in the idr the next available id. In this version, as opposed to before,
>> it's simplified to store the -EBUSY pointer instead of the previous
>> alloc + replace combination.
>>
>> When an index is specified, rely on rcu to find if there's an object in
>> such index. If there's none, fallback to the above, serializing on the
>> mutex and reserving the specified id. If there's one, it can be an -EBUSY
>> pointer, in which case we just try again until it's an action, or an action.
>> Given the rcu guarantees, the action found could be dead and therefore
>> we need to bump the refcount if it's not 0, handling the case it's
>> in fact 0.
>>
>> As bind and the action refcount are already atomics, these increments can
>> happen without the mutex protection while many tcf_idr_check_alloc race
>> to bind to the same action instance.
>>
>> Signed-off-by: Pedro Tammela <pctammela@...atatu.com>
>> ---
>>   net/sched/act_api.c | 56 +++++++++++++++++++++++++++------------------
>>   1 file changed, 34 insertions(+), 22 deletions(-)
>>
>> diff --git a/net/sched/act_api.c b/net/sched/act_api.c
>> index abec5c45b5a4..79a044d2ae02 100644
>> --- a/net/sched/act_api.c
>> +++ b/net/sched/act_api.c
>> @@ -824,43 +824,55 @@ int tcf_idr_check_alloc(struct tc_action_net *tn, u32 *index,
>>   	struct tcf_idrinfo *idrinfo = tn->idrinfo;
>>   	struct tc_action *p;
>>   	int ret;
>> +	u32 max;
>>   
>> -again:
>> -	mutex_lock(&idrinfo->lock);
>>   	if (*index) {
>> +again:
>> +		rcu_read_lock();
>>   		p = idr_find(&idrinfo->action_idr, *index);
>> +
>>   		if (IS_ERR(p)) {
>>   			/* This means that another process allocated
>>   			 * index but did not assign the pointer yet.
>>   			 */
>> -			mutex_unlock(&idrinfo->lock);
>> +			rcu_read_unlock();
>>   			goto again;
>>   		}
>>   
>> -		if (p) {
>> -			refcount_inc(&p->tcfa_refcnt);
>> -			if (bind)
>> -				atomic_inc(&p->tcfa_bindcnt);
>> -			*a = p;
>> -			ret = 1;
>> -		} else {
>> -			*a = NULL;
>> -			ret = idr_alloc_u32(&idrinfo->action_idr, NULL, index,
>> -					    *index, GFP_KERNEL);
>> -			if (!ret)
>> -				idr_replace(&idrinfo->action_idr,
>> -					    ERR_PTR(-EBUSY), *index);
>> +		if (!p) {
>> +			/* Empty slot, try to allocate it */
>> +			max = *index;
>> +			rcu_read_unlock();
>> +			goto new;
>> +		}
>> +
>> +		if (!refcount_inc_not_zero(&p->tcfa_refcnt)) {
>> +			/* Action was deleted in parallel */
>> +			rcu_read_unlock();
>> +			return -ENOENT;
> 
> Current version doesn't return ENOENT since it is synchronous. You are
> now introducing basically a change to UAPI since users of this function
> (individual actions) are not prepared to retry on ENOENT and will
> propagate the error up the call chain. I guess you need to try to create
> a new action with specified index instead.

I see.
So you are saying that in the case where action foo is deleted and a 
binding in parallel observes the deleted action, it should fallback into 
trying to allocate the index.

We could goto again and hope that idr_find will observe the idr index 
being freed, in which case it would fall back into action allocation if 
it does or simply go via the same path as before (jumping to 'again').

I don't see much problems here, it seems to converge in this scenario
as it eventually transforms into race for action allocation (more below) 
if you have an unfortunate delete with many bindings in flight.

> 
>>   		}
>> +
>> +		if (bind)
>> +			atomic_inc(&p->tcfa_bindcnt);
>> +		*a = p;
>> +
>> +		rcu_read_unlock();
>> +
>> +		return 1;
>>   	} else {
>> +		/* Find a slot */
>>   		*index = 1;
>> -		*a = NULL;
>> -		ret = idr_alloc_u32(&idrinfo->action_idr, NULL, index,
>> -				    UINT_MAX, GFP_KERNEL);
>> -		if (!ret)
>> -			idr_replace(&idrinfo->action_idr, ERR_PTR(-EBUSY),
>> -				    *index);
>> +		max = UINT_MAX;
>>   	}
>> +
>> +new:
>> +	*a = NULL;
>> +
>> +	mutex_lock(&idrinfo->lock);
>> +	ret = idr_alloc_u32(&idrinfo->action_idr, ERR_PTR(-EBUSY), index, max,
>> +			    GFP_KERNEL);
> 
> What if multiple concurrent tasks didn't find the action by index with
> rcu and get here, synchronizing on the idrinfo->lock? It looks like
> after the one who got the lock first successfully allocates the index
> everyone else will fail (also propagating ENOSPACE to the user). 

Correct

> I guess you need some mechanism to account for such case and retry.

Ok, so if I'm binding and it's observed a free index, which means "try 
to allocate" and I get a ENOSPC after jumping to new, try again but this 
time binding into the allocated action.

In this scenario when we come back to 'again' we will wait until -EBUSY 
is replaced with the real pointer. Seems like a big enough window that 
any race for allocating from binding would most probably end up in this 
contention loop.

However I think when we have these two retry mechanisms there's a 
extremely small window for an infinite loop if an action delete is timed 
just right, in between the action pointer is found and when we grab the 
tcfa_refcnt.

	idr_find (pointer)
	tcfa_refcnt (0)  <-------|
	again:                   |
	idr_find (free index!)   |
	new:                     |
	idr_alloc_u32 (ENOSPC)   |
	again:                   |
	idr_find (EBUSY)         |
	again:                   |
	idr_find (pointer)       |
	<evil delete happens>    |
	------->>>>--------------|

Another potential problem, is that this will race with non binding 
actions. So if the ENOSPC was actually from another unrelated action. A 
practical example would be a race between a binding to an 'action drop 
index 1' and an 'action ok' allocation. Actually it's a general problem 
and not particular to this case here but it seems like we could be 
amplifying it.

I'm conflicted here. If I were to choose one of the two, I would pick 
the action respawing as to me it seems to converge much quicker and 
removes the uapi change (my bad! :).

As for usability, if all of a sudden there's a huge influx of ENOSPC 
errors because users are abusing 'tc filter add ... action index 1 ...' 
in parallel _before_ actually creating the action object the fix is to just:
tc actions add action index 1 ...
tc filter add ...

As tc has always supported