netdev - 转发: Re: [PATCH] phy: fix the bug when remove the phy driver

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <F95AC9340317A84688A5F0DF0246F3F201515AF3@szxeml504-mbs.china.huawei.com>
Date:   Tue, 29 Nov 2016 13:28:13 +0000
From:   maowenan <maowenan@...wei.com>
To:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Florian Fainelli <f.fainelli@...il.com>
CC:     Dingtianhong <dingtianhong@...wei.com>,
        "weiyongjun (A)" <weiyongjun1@...wei.com>
Subject: 转发: Re: [PATCH] phy: fix the bug when remove the phy driver

Hi
When remove marvell, the call trace as below:
rmmod marvell -> module_exit(phy_module_exit) -> phy_drivers_unregister -> phy_driver_unregister -> driver_unregister -> bus_remove_driver -> driver_detach -> __device_release_driver -> drv->remove(dev), phy_remove -> phydev->drv = NULL; 

static int phy_remove(struct device *dev)
{
	struct phy_device *phydev = to_phy_device(dev);

	mutex_lock(&phydev->lock);
	phydev->state = PHY_DOWN;
	mutex_unlock(&phydev->lock);

	if (phydev->drv->remove)
		phydev->drv->remove(phydev);
	phydev->drv = NULL;

	return 0;
}

I found there is no reference count check when call drv->remove in __device_release_driver()
		if (dev->bus && dev->bus->remove)
			dev->bus->remove(dev);
		else if (drv->remove) {
			pr_info("refcount:%d\n",dev->kobj.kref.refcount.counter);
			drv->remove(dev);
		}

so I add log to trace this, found the ref count of this dev is 5, unfortunately then call phy_remove and set phydev->drv = NULL and kernel panic happen.

@fainelli, What’s your comments about checking reference counting in driver_detach(), If there is someone remain using this driver, we could not remove the phy driver kernel module.
As bellow:
		if (dev->driver == drv && (dev->kobj.kref.refcount.counter == 1))
			__device_release_driver(dev);




-----邮件原件-----
发件人: Dingtianhong 
发送时间: 2016年11月29日 10:03
收件人: maowenan
主题: Fwd: Re: [PATCH] phy: fix the bug when remove the phy driver




-------- Forwarded Message --------
Subject: Re: [PATCH] phy: fix the bug when remove the phy driver
Date: Wed, 3 Aug 2016 09:21:52 +0800
From: Ding Tianhong <dingtianhong@...wei.com>
To: Florian Fainelli <f.fainelli@...il.com>, Netdev <netdev@...r.kernel.org>, David S. Miller <davem@...emloft.net>, linux-kernel@...r.kernel.org <linux-kernel@...r.kernel.org>, Andrew Lunn <andrew@...n.ch>

On 2016/8/3 0:42, Florian Fainelli wrote:
> Le 02/08/2016 à 06:00, Ding Tianhong a écrit :
>> The nic in my board use the phy dev from marvell, and the system will 
>> load the marvell phy driver automatically, but when I remove the phy 
>> drivers, the system immediately panic:
>>
>> localhost login: [ 2582.052465] Unable to handle kernel NULL pointer 
>> dereference at virtual address 000000c0 [ 2582.061429] pgd = 
>> ffff800001226000 [ 2582.065277] [000000c0] *pgd=0000003f7f893003, 
>> *pud=0000003f7f894003, *pmd=0000003f7f895003, *pte=006000006d000707 [ 
>> 2582.076681] Internal error: Oops: 96000006 [#1] SMP [ 2582.082049] Modules linked in: sr_mod(E) cdrom(E) ses(E) enclosure(E) shpchp(E) crc32_arm64(E) aes_ce_blk(E) ablk_helper(E) cryptd(E) aes_ce_cipher(E) ghash_ce(E) sha2_ce(E) sha1_ce(E) usb_storage(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) [last unloaded: marvell]
>> [ 2582.109226] CPU: 21 PID: 1514 Comm: kworker/21:1 Tainted: G            E   4.1.27-vhulk3.6.5.aarch64 #1
>> [ 2582.119415] Hardware name: Huawei Taishan 2180 /BC11SPCC, BIOS 
>> 1.31 06/23/2016 [ 2582.127346] Workqueue: events_power_efficient 
>> phy_state_machine [ 2582.133796] task: ffff803f6f41bac0 ti: 
>> ffff803f6eca4000 task.ti: ffff803f6eca4000 [ 2582.141910] PC is at 
>> phy_state_machine+0x3c/0x438 [ 2582.147081] LR is at 
>> phy_state_machine+0x34/0x438 [ 2582.152173] pc : [<ffff800000715384>] 
>> lr : [<ffff80000071537c>] pstate: 60000145 [ 2582.160189] sp : 
>> ffff803f6eca7d30 [ 2582.163825] x29: ffff803f6eca7d30 x28: 
>> 0000000000000000 [ 2582.169689] x27: ffff805fdfd0aa00 x26: 
>> 0000000000000008 [ 2582.175482] x25: 0000000000000000 x24: 
>> ffff80000112c57c [ 2582.181391] x23: ffff805f4fc8a800 x22: 
>> ffff805fdfd0f700 [ 2582.187241] x21: ffff805f4fc8abf8 x20: 
>> ffff805f4fc8a770 [ 2582.193035] x19: ffff805f4fc8ab70 x18: 
>> 0000000000000007 [ 2582.198914] x17: 000000000000000e x16: 
>> 0000000000000001 [ 2582.204710] x15: 0000000000000007 x14: 
>> 000000000000000e [ 2582.210584] x13: 0000000000000200 x12: 
>> 0000000055555556 [ 2582.216373] x11: 0000000000001c00 x10: 
>> 0000000000000000 [ 2582.222166] x9 : ffff800000a36880 x8 : 
>> ffff803f6f41c060 [ 2582.227994] x7 : 000000010008b39b x6 : 
>> ffff80000101e690 [ 2582.233788] x5 : 0000000000000000 x4 : 
>> 0000000000800000 [ 2582.239612] x3 : 0000000000000000 x2 : 
>> 0000000000000000 [ 2582.245404] x1 : 0000000000000000 x0 : 
>> 0000000000000000 [ 2582.265813] [ 2582.282971] Process kworker/21:1 
>> (pid: 1514, stack limit = 0xffff803f6eca4020) [ 2582.307284] Stack: 
>> (0xffff803f6eca7d30 to 0xffff803f6eca8000)
>> [ 2582.331183] 7d20:                                   ffff803f6eca7d70 ffff8000000db3b8
>> [ 2582.354788] 7d40: ffff803f6e696000 ffff805f4fc8ab70 
>> ffff805fdfd0aa00 ffff805fdfd0f700 [ 2582.378341] 7d60: 
>> 0000000000000000 ffff80000112c57c ffff803f6eca7dc0 ffff8000000db7d4 [ 
>> 2582.403700] 7d80: ffff803f6e696000 ffff803f6e696030 ffff805fdfd0aa18 
>> ffff805fdfd0aa00 [ 2582.428385] 7da0: ffff803f6eca4000 
>> ffff80000112c57c 0000000000000000 0000000000000008 [ 2582.451222] 
>> 7dc0: ffff803f6eca7e20 ffff8000000e1d0c ffff805f4fc15000 
>> ffff80000115f188 [ 2582.474577] 7de0: ffff800000d1cf88 
>> ffff803f6e696000 ffff8000000db690 0000000000000000 [ 2582.497281] 
>> 7e00: 0000000000000000 0000000000000000 0000000000000000 
>> 0000000000000000 [ 2582.522992] 7e20: 0000000000000000 
>> ffff800000083dd0 ffff8000000e1c10 ffff805f4fc15000 [ 2582.546945] 
>> 7e40: 0000000000000000 0000000000000000 0000000000000000 
>> 0000000000000000 [ 2582.568550] 7e60: 0000000000000000 
>> ffff8000000ee33c ffff803f6e696000 ffff805f00000000 [ 2582.589157] 
>> 7e80: 0000000000000000 ffff803f6eca7e88 ffff803f6eca7e88 
>> 0000000000000000 [ 2582.609767] 7ea0: 0000000000000000 
>> ffff803f6eca7ea8 ffff803f6eca7ea8 cb88537fdc8ba31b [ 2582.633228] 
>> 7ec0: 0000000000000000 0000000000000000 0000000000000000 
>> 0000000000000000 [ 2582.655386] 7ee0: 0000000000000000 
>> 0000000000000000 0000000000000000 0000000000000000 [ 2582.674223] 
>> 7f00: 0000000000000000 0000000000000000 0000000000000000 
>> 0000000000000000 [ 2582.692994] 7f20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 2582.711765] 7f40: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 2582.730809] 7f60: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 2582.748601] 7f80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 2582.769388] 7fa0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 2582.786756] 7fc0: 0000000000000000 0000000000000005 0000000000000000 0000000000000000 [ 2582.804134] 7fe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 2582.821488] Call trace:
>> [ 2582.834493] [<ffff800000715384>] phy_state_machine+0x3c/0x438 [ 
>> 2582.851754] [<ffff8000000db3b8>] process_one_work+0x150/0x428 [ 
>> 2582.868188] [<ffff8000000db7d4>] worker_thread+0x144/0x4b0 [ 
>> 2582.883882] [<ffff8000000e1d0c>] kthread+0xfc/0x110
>>
>> ============================cut here===============================
>>
>> The phy_state_machine was still running and didn't know whether the 
>> phydev's driver was removed or not, then occur this problem, so we 
>> need to stop the phy_state_machine when removing the phy driver.
> 
> Your explanation of the problem is unclear to me, unless a network 
> driver attached to the PHY and started it, and then never stopped the 
> PHY state machine, this should not happen, also there should be proper 
> reference counting in place to avoid that. Your trace is based on 4.1 
> is this still reproducible with latest vanilla? Is this with a 
> mainline Ethernet driver?
> 

Yes，the network driver already attached to the PHY and started it, and all of them could work well if I didn't do anything, but if I suddenly remove the marvall.ko, the phy_state_machine was still running, but the phydev->drv is NULL at this time, than oops, I found this problem at 4.1 lts, and didn't see any effective improvement in the kernel 4.8, so report this bug and fix it.

Thanks.
Ding