netdev - Re: [PATCH 0/6] ravb/sh_eth: fix sleep in atomic by reusing shared ethtool handlers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <27025084-eee4-d851-7473-ea456a5e8168@cogentembedded.com>
Date:   Sat, 26 May 2018 20:31:53 +0300
From:   Sergei Shtylyov <sergei.shtylyov@...entembedded.com>
To:     Vladimir Zapolskiy <vladimir_zapolskiy@...tor.com>,
        "David S. Miller" <davem@...emloft.net>
Cc:     netdev@...r.kernel.org, linux-renesas-soc@...r.kernel.org
Subject: Re: [PATCH 0/6] ravb/sh_eth: fix sleep in atomic by reusing shared
 ethtool handlers

On 05/25/2018 09:25 AM, Vladimir Zapolskiy wrote:

>>>> For ages trivial changes to RAVB and SuperH ethernet links by means of
>>>> standard 'ethtool' trigger a 'sleeping function called from invalid
>>>> context' bug, to visualize it on r8a7795 ULCB:
>>>>
>>>>   % ethtool -r eth0
>>>>   BUG: sleeping function called from invalid context at kernel/locking/mutex.c:747
>>>>   in_atomic(): 1, irqs_disabled(): 128, pid: 554, name: ethtool
>>>>   INFO: lockdep is turned off.
>>>>   irq event stamp: 0
>>>>   hardirqs last  enabled at (0): [<0000000000000000>]           (null)
>>>>   hardirqs last disabled at (0): [<ffff0000080e1d3c>] copy_process.isra.7.part.8+0x2cc/0x1918
>>>>   softirqs last  enabled at (0): [<ffff0000080e1d3c>] copy_process.isra.7.part.8+0x2cc/0x1918
>>>>   softirqs last disabled at (0): [<0000000000000000>]           (null)
>>>>   CPU: 5 PID: 554 Comm: ethtool Not tainted 4.17.0-rc4-arm64-renesas+ #33
>>>>   Hardware name: Renesas H3ULCB board based on r8a7795 ES2.0+ (DT)
>>>>   Call trace:
>>>>    dump_backtrace+0x0/0x198
>>>>    show_stack+0x24/0x30
>>>>    dump_stack+0xb8/0xf4
>>>>    ___might_sleep+0x1c8/0x1f8
>>>>    __might_sleep+0x58/0x90
>>>>    __mutex_lock+0x50/0x890
>>>>    mutex_lock_nested+0x3c/0x50
>>>>    phy_start_aneg_priv+0x38/0x180
>>>>    phy_start_aneg+0x24/0x30
>>>>    ravb_nway_reset+0x3c/0x68
>>>>    dev_ethtool+0x3dc/0x2338
>>>>    dev_ioctl+0x19c/0x490
>>>>    sock_do_ioctl+0xe0/0x238
>>>>    sock_ioctl+0x254/0x460
>>>>    do_vfs_ioctl+0xb0/0x918
>>>>    ksys_ioctl+0x50/0x80
>>>>    sys_ioctl+0x34/0x48
>>>>    __sys_trace_return+0x0/0x4
>>>>
>>>> The root cause is that an attempt to modify ECMR and GECMR registers
>>>> only when RX/TX function is disabled was too overcomplicated in its
>>>> original implementation, also processing of an optional Link Change
>>>> interrupt added even more complexity, as a result the implementation
>>>> was error prone.
>>>>
>>>> The new locking scheme is confirmed to be correct by dumping driver
>>>> specific and generic PHY framework function calls with aid of ftrace
>>>> while running more or less advanced tests.
>>>>
>>>> Please note that sh_eth patches from the series were built-tested only.
>>>>
>>>> On purpose I do not add Fixes tags, the reused PHY handlers were added
>>>> way later than the fixed problems were firstly found in the drivers.
>>>
>>>    I think you went one step too far with these fixes. On the first glance,
>>> the real fixes are to remove grabbing/releasing the spinlock for the duration
>>> of the phylib calls. Am I right? If so, making use of the new phylib APIs
>>> would be a further enhancement, it's not needed for fixing the splats per se...
>>
>>    Note that I hadn't looked at the patches #3/#6 at the time of writing this;
>> those seem to be more complicated than the rest.
> 
> Right, the simplistic approach of just removing the held spinlock does
> not fit well into the overall lame locking model found in the driver.

   Yet you only try fixing it in the patches #3 and #6. I was talking about
the patches #1 and #4 mostly (#2 and #5 turned out to be non-fixes).

> The thing is that I would prefer to exhibit 'remove custom callbacks'
> side of the changes as it is done now, and fixing severe 'invalid contex'
> bugs is left as a valuable side effect. I may attempt to find enough
> free time to follow your instructions, but frankly speaking I don't
> see it beneficial to split a single good all-sufficient change into
> three or more: removal of spinlocks, replacement of phy_start_aneg(),
> then a non-functional clean-up.
   Yes, I would prefer these step-by-step changes.

> Bikeshedding isn't my preference,

   This is not about bikeshedding. What you are trying to do clearly
violates the 2 basic principles of the kernel development: "don't mix
fixes and enhancements" and "do one thing per patch". 

> but a report about technical flaws related to the published changes
> is appreciated, otherwise let me ask you to accept the changes as is,
> secondary optimizations can be done on top of them.

   No, I'll certainly have to NAK patches #1/#3 in their current form.
I'm yet to review patches #3/#6... anyway, if you lack the time to do things
properly, I'll have to take this burden on my shoulders (giving you credits).
Yet I'm basically is in the same situation as you -- I have to spend my copiuos
free time on the large patch sets (like yours) and I'm still having some cleanups
to sh_eth cooking here (which I'll most probably have to defer)...

> --
> With best wishes,
> Vladimir

MBR, Sergei