lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 1 Feb 2023 21:36:01 +0000
From:   Vadim Fedorenko <vadim.fedorenko@...ux.dev>
To:     Saeed Mahameed <saeed@...nel.org>,
        Vadim Fedorenko <vadfed@...a.com>
Cc:     Jakub Kicinski <kuba@...nel.org>,
        Rahul Rameshbabu <rrameshbabu@...dia.com>,
        Tariq Toukan <ttoukan.linux@...il.com>,
        Gal Pressman <gal@...dia.com>, netdev@...r.kernel.org
Subject: Re: [PATCH net v4 2/2] mlx5: fix possible ptp queue fifo
 use-after-free

On 01/02/2023 18:19, Saeed Mahameed wrote:
> On 01 Feb 04:26, Vadim Fedorenko wrote:
>> Fifo indexes were not checked during pop operations and it leads to
>> potential use-after-free when poping from empty queue. Such case was
>> possible during re-sync action.
>>
>> There were out-of-order cqe spotted which lead to drain of the queue and
>> use-after-free because of lack of fifo pointers check. Special check
>> is added to avoid resync operation if SKB could not exist in the fifo
>> because of OOO cqe (skb_id must be between consumer and producer index).
>>
>> Fixes: 58a518948f60 ("net/mlx5e: Add resiliency for PTP TX port 
>> timestamp")
>> Signed-off-by: Vadim Fedorenko <vadfed@...a.com>
>> ---
>> .../net/ethernet/mellanox/mlx5/core/en/ptp.c  | 23 ++++++++++++++-----
>> .../net/ethernet/mellanox/mlx5/core/en/txrx.h |  4 +++-
>> 2 files changed, 20 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c 
>> b/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c
>> index b72de2b520ec..5df726185192 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c
>> @@ -86,7 +86,7 @@ static bool mlx5e_ptp_ts_cqe_drop(struct mlx5e_ptpsq 
>> *ptpsq, u16 skb_cc, u16 skb
>>     return (ptpsq->ts_cqe_ctr_mask && (skb_cc != skb_id));
>> }
>>
>> -static void mlx5e_ptp_skb_fifo_ts_cqe_resync(struct mlx5e_ptpsq 
>> *ptpsq, u16 skb_cc,
>> +static bool mlx5e_ptp_skb_fifo_ts_cqe_resync(struct mlx5e_ptpsq 
>> *ptpsq, u16 skb_cc,
>>                          u16 skb_id, int budget)
>> {
>>     struct skb_shared_hwtstamps hwts = {};
>> @@ -94,14 +94,23 @@ static void 
>> mlx5e_ptp_skb_fifo_ts_cqe_resync(struct mlx5e_ptpsq *ptpsq, u16 skb_
>>
>>     ptpsq->cq_stats->resync_event++;
>>
>> -    while (skb_cc != skb_id) {
>> -        skb = mlx5e_skb_fifo_pop(&ptpsq->skb_fifo);
>> +    if (skb_cc > skb_id || PTP_WQE_CTR2IDX(ptpsq->skb_fifo_pc) < 
>> skb_id) {
> 
> To avoid returning boolean and add more functionality to this function,
> I prefer to put this check in mlx5e_ptp_handle_ts_cqe(), see below.
>

Let's discuss this point, see comments below.

>> +        mlx5_core_err_rl(ptpsq->txqsq.mdev, "out-of-order ptp cqe\n");
> 
> it's better to add a counter for this, eg: ptpsq->cq_stats->ooo_cqe_drop++;

Sure, will do.

> 
>> +        return false;
>> +    }
>> +
>> +    while (skb_cc != skb_id && (skb = 
>> mlx5e_skb_fifo_pop(&ptpsq->skb_fifo))) {
>>         hwts.hwtstamp = mlx5e_skb_cb_get_hwts(skb)->cqe_hwtstamp;
>>         skb_tstamp_tx(skb, &hwts);
>>         ptpsq->cq_stats->resync_cqe++;
>>         napi_consume_skb(skb, budget);
>>         skb_cc = PTP_WQE_CTR2IDX(ptpsq->skb_fifo_cc);
>>     }
>> +
>> +    if (!skb)
>> +        return false;
>> +
>> +    return true;
>> }
>>
>> static void mlx5e_ptp_handle_ts_cqe(struct mlx5e_ptpsq *ptpsq,
>> @@ -111,7 +120,7 @@ static void mlx5e_ptp_handle_ts_cqe(struct 
>> mlx5e_ptpsq *ptpsq,
>>     u16 skb_id = PTP_WQE_CTR2IDX(be16_to_cpu(cqe->wqe_counter));
>>     u16 skb_cc = PTP_WQE_CTR2IDX(ptpsq->skb_fifo_cc);
>>     struct mlx5e_txqsq *sq = &ptpsq->txqsq;
>> -    struct sk_buff *skb;
>> +    struct sk_buff *skb = NULL;
>>     ktime_t hwtstamp;
>>
>>     if (unlikely(MLX5E_RX_ERR_CQE(cqe))) {
>> @@ -120,8 +129,10 @@ static void mlx5e_ptp_handle_ts_cqe(struct 
>> mlx5e_ptpsq *ptpsq,
>>         goto out;
>>     }
>>
>> -    if (mlx5e_ptp_ts_cqe_drop(ptpsq, skb_cc, skb_id))
>> -        mlx5e_ptp_skb_fifo_ts_cqe_resync(ptpsq, skb_cc, skb_id, budget);
> 
> you can check here:
>      /* ignore ooo cqe as it was already handled by a previous resync */
>      if (ooo_cqe(cqe))
>          return;

I assume that mlx5e_ptp_skb_fifo_ts_cqe_resync() is error-recovery path 
and should not happen frequently. If we move this check to 
mlx5e_ptp_handle_ts_cqe() we will have additional if with 2 checks for 
every cqe coming from ptp queue - the fast path. With our load of 1+Mpps 
it could be countable. Another point is that 
mlx5e_ptp_skb_fifo_ts_cqe_resync() will assume that skb_id must always 
be within fifo indexes and any other (future?) code has to implement 
this check. Otherwise it will cause use-after-free, double-free and then 
crash, especially if we remove check in mlx5e_skb_fifo_pop() that you 
commented below. I think it's ok to have additional check in error path 
to prevent anything like that in the future.

If you stongly against converting mlx5e_ptp_skb_fifo_ts_cqe_resync() to 
return bool, I can add the check to 'if mlx5e_ptp_ts_cqe_drop()' scope 
before calling resync(), but it will not remove problems from my second 
point and I just would like to see explicit 'yes, we agree to have 
dangerous code in our driver' from you or other maintainers in this case.

>> +    if (mlx5e_ptp_ts_cqe_drop(ptpsq, skb_cc, skb_id) &&
>> +        !mlx5e_ptp_skb_fifo_ts_cqe_resync(ptpsq, skb_cc, skb_id, 
>> budget)) {
>> +        goto out;
>> +    }
>>
>>     skb = mlx5e_skb_fifo_pop(&ptpsq->skb_fifo);
>>     hwtstamp = mlx5e_cqe_ts_to_ns(sq->ptp_cyc2time, sq->clock, 
>> get_cqe_ts(cqe));
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h 
>> b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
>> index d5afad368a69..e599b86d94b5 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
>> @@ -295,13 +295,15 @@ static inline
>> void mlx5e_skb_fifo_push(struct mlx5e_skb_fifo *fifo, struct sk_buff 
>> *skb)
>> {
>>     struct sk_buff **skb_item = mlx5e_skb_fifo_get(fifo, (*fifo->pc)++);
>> -
> 
> redundant change.

yep, will remove this artefact.

> 
>>     *skb_item = skb;
>> }
>>
>> static inline
>> struct sk_buff *mlx5e_skb_fifo_pop(struct mlx5e_skb_fifo *fifo)
>> {
>> +    if (*fifo->pc == *fifo->cc)
>> +        return NULL;
>> +
> 
> I think this won't be necessary if you check for ooo early on
> mlx5e_ptp_handle_ts_cqe() like i suggested above.
> 
And again the only concern here is that we don't have any checks whether 
FIFO has anything to pop before actually poping the value. That can 
easily bring use-after-free in the futuee, especially because the 
function is not ptp specific and is already used for other fifos. I can 
add unlikely() for this check if it helps, but I would like to have this 
check in the code.

>>     return *mlx5e_skb_fifo_get(fifo, (*fifo->cc)++);
>> }
>>
>> -- 
>> 2.30.2
>>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ