[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALzJLG-ZQxF0t1N5ry2OUqcBjrs+7XzuWb_dRDOjEnRf1975Hg@mail.gmail.com>
Date: Thu, 17 Nov 2016 11:48:55 +0200
From: Saeed Mahameed <saeedm@....mellanox.co.il>
To: Daniel Borkmann <daniel@...earbox.net>
Cc: "David S. Miller" <davem@...emloft.net>,
Alexei Starovoitov <alexei.starovoitov@...il.com>,
Brenden Blanco <bblanco@...mgrid.com>,
Zhiyi Sun <zhiyisun@...il.com>,
Rana Shahout <ranas@...lanox.com>,
Saeed Mahameed <saeedm@...lanox.com>,
Linux Netdev List <netdev@...r.kernel.org>
Subject: Re: [PATCH net-next v2 2/4] bpf, mlx5: fix various refcount issues in mlx5e_xdp_set
On Wed, Nov 16, 2016 at 5:26 PM, Daniel Borkmann <daniel@...earbox.net> wrote:
> On 11/16/2016 03:30 PM, Saeed Mahameed wrote:
>>
>> On Wed, Nov 16, 2016 at 3:54 PM, Daniel Borkmann <daniel@...earbox.net>
>> wrote:
>>>
>>> On 11/16/2016 01:25 PM, Saeed Mahameed wrote:
>>>>
>>>> On Wed, Nov 16, 2016 at 2:04 AM, Daniel Borkmann <daniel@...earbox.net>
>>>> wrote:
>>>>>
>>>>>
>>>>> There are multiple issues in mlx5e_xdp_set():
>>>>>
>>>>> 1) The batched bpf_prog_add() is currently not checked for errors! When
>>>>> doing so, it should be done at an earlier point in time to makes
>>>>> sure
>>>>> that we cannot fail anymore at the time we want to set the program
>>>>> for
>>>>> each channel. This only means that we have to undo the
>>>>> bpf_prog_add()
>>>>> in case we return due to required reset.
>>>>>
>>>>> 2) When swapping the priv->xdp_prog, then no extra reference count must
>>>>> be
>>>>> taken since we got that from call path via dev_change_xdp_fd()
>>>>> already.
>>>>> Otherwise, we'd never be able to free the program. Also,
>>>>> bpf_prog_add()
>>>>> without checking the return code could fail.
>>>>>
>>>>> Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs
>>>>> support")
>>>>> Signed-off-by: Daniel Borkmann <daniel@...earbox.net>
>>>>> ---
>>>>> drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 25
>>>>> ++++++++++++++++++-----
>>>>> 1 file changed, 20 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>>>>> b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>>>>> index ab0c336..cf26672 100644
>>>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>>>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>>>>> @@ -3142,6 +3142,17 @@ static int mlx5e_xdp_set(struct net_device
>>>>> *netdev, struct bpf_prog *prog)
>>>>> goto unlock;
>>>>> }
>>>>>
>>>>> + if (prog) {
>>>>> + /* num_channels is invariant here, so we can take the
>>>>> + * batched reference right upfront.
>>>>> + */
>>>>> + prog = bpf_prog_add(prog, priv->params.num_channels);
>>>>> + if (IS_ERR(prog)) {
>>>>> + err = PTR_ERR(prog);
>>>>> + goto unlock;
>>>>> + }
>>>>> + }
>>>>> +
>>>>
>>>>
>>>> With this approach you will end up taking a ref count twice per RQ! on
>>>> the first time xdp_set is called i.e (old_prog = NULL, prog != NULL).
>>>> One ref will be taken per RQ/Channel from the above code, and since
>>>> reset will be TRUE mlx5e_open_locked will be called and another ref
>>>> count will be taken on mlx5e_create_rq.
>>>>
>>>> The issue here is that we have two places for ref count accounting,
>>>> xdp_set and mlx5e_create_rq. Having ref-count updates in
>>>> mlx5e_create_rq is essential for num_channels configuration changes
>>>> (mlx5e_set_ringparam).
>>>>
>>>> Our previous approach made sure that only one path will do the ref
>>>> counting (mlx5e_open_locked vs. mlx5e_xdp_set batch ref update when
>>>> reset == false).
>>>
>>>
>>> That is correct, for a short time bpf_prog_add() was charged also when
>>> we reset. When we reset, we will then jump to unlock_put and safely undo
>>> this since we took ref from mlx5e_create_rq() already in that case, and
>>> return successfully. That was intentional, so that eventually we end up
>>> just taking a /single/ ref per RQ when we exit mlx5e_xdp_set(), but more
>>> below ...
>>>
>>> [...]
>>>>
>>>>
>>>> 2. Keep the current approach and make sure to not update the ref count
>>>> twice, you can batch update only if (!reset && was_open) otherwise you
>>>> can rely on mlx5e_open_locked to take the num_channels ref count for
>>>> you.
>>>>
>>>> Personally I prefer option 2, since we will keep the current logic
>>>> which will allow configuration updates such as (mlx5e_set_ringparam)
>>>> to not worry about ref counting since it will be done in the reset
>>>> flow.
>>>
>>>
>>> ... agree on keeping current approach. I actually like the idea, so we'd
>>> end up with this simpler version for the batched ref then.
>>
>>
>> Great :).
>>
>> So let's do the batched update only if we are not going to reset (we
>> already know that in advance), instead of the current patch where you
>> batch update unconditionally and then
>> unlock_put in case reset was performed (which is just redundant and
>> confusing).
>>
>> Please note that if open_locked fails you need to goto unlock_put.
>
>
> Sorry, I'm not quite sure I follow you here; are you /now/ commenting on
> the original patch or on the already updated diff I did from my very last
> email, that is, http://patchwork.ozlabs.org/patch/695564/ ?
>
I am commenting on this patch "V2 2/4".
this patch will take batched ref count unconditionally and release the
extra refs "unlock_put" if reset was performed.
I am asking to take the batched ref only if we are not going to reset
in advance to save the extra "unlock_put"
>>> Note, your "bpf_prog_add(prog, 1) // one ref for the device." is not
>>> needed
>>> since this we already got that one through dev_change_xdp_fd() as
>>> mentioned.
>>
>>
>> If it is not needed then why we need bpf_prog_put on mlx5e_nic_cleanup
>> in your next patch? this doesn't look symmetric (right) !
>> you shouldn't release a reference you didn't take.
>> Overall with this series the driver can take num_channels refs and
>> will release num_channels refs on mlx5e_close. we shouldn't release
>> one extra ref on NIC cleanup.
>
>
> I already explained this in the commit description; when dev_change_xdp_fd()
> is called and sees a valid fd, it does bpf_prog_get_type(), which calls the
> __bpf_prog_get() taking a ref on the program (bpf_prog_inc()). That is then
> passed down to ops->ndo_xdp(). Only in case of error from the ->ndo_xdp()
> callback, we bpf_prog_put() this reference from dev_change_xdp_fd() side.
>
> For drivers that implement against this ndo, it means that you need N-1 refs
> in addition. Have a look at the other two in-tree users, which do it
> correctly,
> that is mlx4_xdp_set() and nfp_net_xdp_setup().
>
> It's documented here (include/linux/netdevice.h) with ndo_xdp referring to
> it:
>
> /* These structures hold the attributes of xdp state that are being passed
> * to the netdevice through the xdp op.
> */
> enum xdp_netdev_command {
> /* Set or clear a bpf program used in the earliest stages of packet
> * rx. The prog will have been loaded as BPF_PROG_TYPE_XDP. The
> callee
> * is responsible for calling bpf_prog_put on any old progs that are
> * stored. In case of error, the callee need not release the new
> prog
> * reference, but on success it takes ownership and must
> bpf_prog_put
> * when it is no longer used.
> */
> XDP_SETUP_PROG,
> [...]
> };
>
> I think reason why this is rather confusing would be that initially, it was
> just a single prog for all queues, but it was requested along the way to
> move
> prog pointer down to queues instead and let them have their own ref, so that
> some time in future individual progs from a subset of the queues can be
> exchanged.
>
> Since the way xdp in mlx5 is implemented, you currently have the
> priv->xdp_prog
> as the control prog. That's okay right now, but requires to drop the last
> ref
> on priv->xdp_prog via bpf_prog_put() when netdev is dismantled and
> priv->xdp_prog
> still present.
Got it, but i would rather make the stack cleanup those refs once it
receives an unregester_netdev event instead of counting on netdev to
do it, as i said before, like any other resource added to the netdev
by the stack (vlan/mac/etc..).
but this is a general discussion of xdp API, it won't block this series :).
Saeed.
Powered by blists - more mailing lists