[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <7f5fd10d-aaf9-4259-9505-3fbabc3ba102@redhat.com>
Date: Wed, 27 Nov 2024 13:11:05 +0100
From: Paolo Abeni <pabeni@...hat.com>
To: Parthiban.Veerasooran@...rochip.com
Cc: netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
UNGLinuxDriver@...rochip.com, jacob.e.keller@...el.com,
andrew+netdev@...n.ch, davem@...emloft.net, edumazet@...gle.com,
kuba@...nel.org
Subject: Re: [PATCH net v2 2/2] net: ethernet: oa_tc6: fix tx skb race
condition between reference pointers
On 11/27/24 11:49, Parthiban.Veerasooran@...rochip.com wrote:
> On 26/11/24 4:11 pm, Paolo Abeni wrote:
>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>
>> On 11/22/24 11:21, Parthiban Veerasooran wrote:
>>> There are two skb pointers to manage tx skb's enqueued from n/w stack.
>>> waiting_tx_skb pointer points to the tx skb which needs to be processed
>>> and ongoing_tx_skb pointer points to the tx skb which is being processed.
>>>
>>> SPI thread prepares the tx data chunks from the tx skb pointed by the
>>> ongoing_tx_skb pointer. When the tx skb pointed by the ongoing_tx_skb is
>>> processed, the tx skb pointed by the waiting_tx_skb is assigned to
>>> ongoing_tx_skb and the waiting_tx_skb pointer is assigned with NULL.
>>> Whenever there is a new tx skb from n/w stack, it will be assigned to
>>> waiting_tx_skb pointer if it is NULL. Enqueuing and processing of a tx skb
>>> handled in two different threads.
>>>
>>> Consider a scenario where the SPI thread processed an ongoing_tx_skb and
>>> it moves next tx skb from waiting_tx_skb pointer to ongoing_tx_skb pointer
>>> without doing any NULL check. At this time, if the waiting_tx_skb pointer
>>> is NULL then ongoing_tx_skb pointer is also assigned with NULL. After
>>> that, if a new tx skb is assigned to waiting_tx_skb pointer by the n/w
>>> stack and there is a chance to overwrite the tx skb pointer with NULL in
>>> the SPI thread. Finally one of the tx skb will be left as unhandled,
>>> resulting packet missing and memory leak.
>>> To overcome the above issue, protect the moving of tx skb reference from
>>> waiting_tx_skb pointer to ongoing_tx_skb pointer so that the other thread
>>> can't access the waiting_tx_skb pointer until the current thread completes
>>> moving the tx skb reference safely.
>>
>> A mutex looks overkill. Why don't you use a spinlock? why locking only
>> one side (the writer) would be enough?
> Ah my bad, missed to protect tc6->waiting_tx_skb = skb. So that it will
> become like below,
>
> mutex_lock(&tc6->tx_skb_lock);
> tc6->waiting_tx_skb = skb;
> mutex_unlock(&tc6->tx_skb_lock);
>
> As both are not called from atomic context and they are allowed to
> sleep, I used mutex rather than spinlock.
>>
>> Could you please report the exact sequence of events in a time diagram
>> leading to the bug, something alike the following?
>>
>> CPU0 CPU1
>> oa_tc6_start_xmit
>> ...
>> oa_tc6_spi_thread_handler
>> ...
> Good case:
> ----------
> Consider waiting_tx_skb is NULL.
>
> Thread1 (oa_tc6_start_xmit) Thread2 (oa_tc6_spi_thread_handler)
> --------------------------- -----------------------------------
> - if waiting_tx_skb is NULL
> - waiting_tx_skb = skb
> - if ongoing_tx_skb is NULL
> - ongoing_tx_skb = waiting_tx_skb
> - waiting_tx_skb = NULL
> ...
> - ongoing_tx_skb = NULL
> - if waiting_tx_skb is NULL
> - waiting_tx_skb = skb
> - if ongoing_tx_skb is NULL
> - ongoing_tx_skb = waiting_tx_skb
> - waiting_tx_skb = NULL
> ...
> - ongoing_tx_skb = NULL
> ....
>
> Bad case:
> ---------
> Consider waiting_tx_skb is NULL.
>
> Thread1 (oa_tc6_start_xmit) Thread2 (oa_tc6_spi_thread_handler)
> --------------------------- -----------------------------------
> - if waiting_tx_skb is NULL
> - waiting_tx_skb = skb
> - if ongoing_tx_skb is NULL
AFAICS, if 'waiting_tx_skb == NULL and Thread2 is in
oa_tc6_spi_thread_handler()/oa_tc6_prepare_spi_tx_buf_for_tx_skbs()
then ongoing_tx_skb can not be NULL, due to the previous check in:
https://elixir.bootlin.com/linux/v6.12/source/drivers/net/ethernet/oa_tc6.c#L1064
This looks like a single reader/single write scenarios that does not
need any lock to ensure consistency.
Do you observe any memory leak in real life scenarios?
BTW it looks like both oa_tc6_start_xmit and oa_tc6_spi_thread_handler
are possibly lacking memory barriers to avoid missing wake-ups.
Cheers,
Paolo
Powered by blists - more mailing lists