netdev - Re: [PATCH net] can: isotp: isotp_rcv

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <b4be7878-e461-e2c3-2aaf-89598ac8a64f@hartkopp.net>
Date:   Wed, 9 Feb 2022 08:54:09 +0100
From:   Oliver Hartkopp <socketcan@...tkopp.net>
To:     Marc Kleine-Budde <mkl@...gutronix.de>
Cc:     "Ziyang Xuan (William)" <william.xuanziyang@...wei.com>,
        davem@...emloft.net, kuba@...nel.org, linux-can@...r.kernel.org,
        netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH net] can: isotp: isotp_rcv_cf(): fix so->rx race problem

Hi Marc,

On 07.02.22 09:11, Marc Kleine-Budde wrote:
> On 28.01.2022 15:48:05, Oliver Hartkopp wrote:
>> Hello Marc, hello William,
>>
>> On 28.01.22 09:46, Marc Kleine-Budde wrote:
>>> On 28.01.2022 09:32:40, Oliver Hartkopp wrote:
>>>>
>>>>
>>>> On 28.01.22 09:07, Marc Kleine-Budde wrote:
>>>>> On 28.01.2022 08:56:19, Oliver Hartkopp wrote:
>>>>>> I've seen the frame processing sometimes freezes for one second when
>>>>>> stressing the isotp_rcv() from multiple sources. This finally freezes
>>>>>> the entire softirq which is either not good and not needed as we only
>>>>>> need to fix this race for stress tests - and not for real world usage
>>>>>> that does not create this case.
>>>>>
>>>>> Hmmm, this doesn't sound good. Can you test with LOCKDEP enabled?
>>
>>
>>>> #
>>>> # Lock Debugging (spinlocks, mutexes, etc...)
>>>> #
>>>> CONFIG_LOCK_DEBUGGING_SUPPORT=y
>>>> # CONFIG_PROVE_LOCKING is not set
>>> CONFIG_PROVE_LOCKING=y
>>
>> Now enabled even more locking (seen relevant kernel config at the end).
>>
>> It turns out that there is no visible difference when using spin_lock() or
>> spin_trylock().
>>
>> I only got some of these kernel log entries
>>
>> Jan 28 11:13:14 silver kernel: [ 2396.323211] perf: interrupt took too long
>> (2549 > 2500), lowering kernel.perf_event_max_sample_rate to 78250
>> Jan 28 11:25:49 silver kernel: [ 3151.172773] perf: interrupt took too long
>> (3188 > 3186), lowering kernel.perf_event_max_sample_rate to 62500
>> Jan 28 11:45:24 silver kernel: [ 4325.583328] perf: interrupt took too long
>> (4009 > 3985), lowering kernel.perf_event_max_sample_rate to 49750
>> Jan 28 12:15:46 silver kernel: [ 6148.238246] perf: interrupt took too long
>> (5021 > 5011), lowering kernel.perf_event_max_sample_rate to 39750
>> Jan 28 13:01:45 silver kernel: [ 8907.303715] perf: interrupt took too long
>> (6285 > 6276), lowering kernel.perf_event_max_sample_rate to 31750
>>
>> But I get these sporadically anyway. No other LOCKDEP splat.
>>
>> At least the issue reported by William should be fixed now - but I'm still
>> unclear whether spin_lock() or spin_trylock() is the best approach here in
>> the NET_RX softirq?!?
> 
> With the !spin_trylock() -> return you are saying if something
> concurrent happens, drop it. This doesn't sound correct.

Yes, I had the same feeling and did some extensive load tests using both 
variants.

It turned out the standard spin_lock() works excellent to fix the issue.

Thanks for taking it for upstream here:
https://lore.kernel.org/linux-can/20220209074818.3ylfz4zmuhit7orc@pengutronix.de/T/#t

Best regards,
Oliver