lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20201008125214.wiorgfcsd5hjkt4z@Rk>
Date:   Thu, 8 Oct 2020 20:52:14 +0800
From:   Coiby Xu <coiby.xu@...il.com>
To:     Shung-Hsi Yu <shung-hsi.yu@...e.com>
Cc:     netdev@...r.kernel.org,
        Benjamin Poirier <benjamin.poirier@...il.com>,
        Manish Chopra <manishc@...vell.com>,
        GR-Linux-NIC-Dev@...vell.com,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Michal Kubecek <mkubecek@...e.cz>
Subject: Re: [RFC 0/3] staging: qlge: Re-writing the debugging features

On Thu, Aug 27, 2020 at 05:54:43PM +0800, Shung-Hsi Yu wrote:
>On Wed, Aug 26, 2020 at 03:52:06PM +0800, Shung-Hsi Yu wrote:
>> On Sat, Aug 15, 2020 at 12:05:58AM +0800, Coiby Xu wrote:
>> > This patch set aims to avoid dumping registers, data structures and
>> > coredump to dmesg and also to reduce the code size of the qlge driver.
>> >
>> > As pointed out by Benjamin [1],
>> >
>> > > At 2000 lines, qlge_dbg.c alone is larger than some entire ethernet
>> > > drivers. Most of what it does is dump kernel data structures or pci
>> > > memory mapped registers to dmesg. There are better facilities for that.
>> > > My thinking is not simply to delete qlge_dbg.c but to replace it, making
>> > > sure that most of the same information is still available. For data
>> > > structures, crash or drgn can be used; possibly with a script for the
>> > > latter which formats the data. For pci registers, they should be
>> > > included in the ethtool register dump and a patch added to ethtool to
>> > > pretty print them. That's what other drivers like e1000e do. For the
>> > > "coredump", devlink health can be used.
>> >
>> > So the debugging features are re-written following Benjamin's advice,
>> >    - use ethtool to dump registers
>> >    - dump kernel data structures in drgn
>> >    - use devlink health to do coredump
>> >
>> > The get_regs ethtool_ops has already implemented. What lacks is a patch
>> > for the userland ethtool to do the pretty-printing. I haven't yet provided
>> > a patch to the userland ethtool because I'm aware ethtool is moving towards
>> > the netlink interface [2]. I'm curious if a generalized mechanism of
>> > pretty-printing will be implemented thus making pretty-printing for a
>> > specific driver unnecessary. As of this writing, `-d|--register-dump`
>> > hasn't been implemented for the netlink interface.
>> >
>> >
>> > To dump kernel data structures, the following Python script can be used
>> > in drgn,
>> >
>> >
>> >     ```python
>> >     def align(x, a):
>> >         """the alignment a should be a power of 2
>> >         """
>> >         mask = a - 1
>> >         return (x+ mask) & ~mask
>> >
>> >     def struct_size(struct_type):
>> >         struct_str = "struct {}".format(struct_type)
>> >         return sizeof(Object(prog, struct_str, address=0x0))
>> >
>> >     def netdev_priv(netdevice):
>> >         NETDEV_ALIGN = 32
>> >         return netdevice.value_() + align(struct_size("net_device"), NETDEV_ALIGN)
>> >
>> >     name = 'xxx'
>> >     qlge_device = None
>> >     netdevices = prog['init_net'].dev_base_head.address_of_()
>> >     for netdevice in list_for_each_entry("struct net_device", netdevices, "dev_list"):
>> >         if netdevice.name.string_().decode('ascii') == name:
>> >             print(netdevice.name)
>> >
>> >     ql_adapter = Object(prog, "struct ql_adapter", address=netdev_priv(qlge_device))
>> >     ```
>> >
>> > The struct ql_adapter will be printed in drgn as follows,
>> >     >>> ql_adapter
>> >     (struct ql_adapter){
>> >             .ricb = (struct ricb){
>> >                     .base_cq = (u8)0,
>> >                     .flags = (u8)120,
>> >                     .mask = (__le16)26637,
>> >                     .hash_cq_id = (u8 [1024]){ 172, 142, 255, 255 },
>> >                     .ipv6_hash_key = (__le32 [10]){},
>> >                     .ipv4_hash_key = (__le32 [4]){},
>> >             },
>> >             .flags = (unsigned long)0,
>> >             .wol = (u32)0,
>> >             .nic_stats = (struct nic_stats){
>> >                     .tx_pkts = (u64)0,
>> >                     .tx_bytes = (u64)0,
>> >                     .tx_mcast_pkts = (u64)0,
>> >                     .tx_bcast_pkts = (u64)0,
>> >                     .tx_ucast_pkts = (u64)0,
>> >                     .tx_ctl_pkts = (u64)0,
>> >                     .tx_pause_pkts = (u64)0,
>> >                     ...
>> >             },
>> >             .active_vlans = (unsigned long [64]){
>> >                     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 52780853100545, 18446744073709551615,
>> >                     18446619461681283072, 0, 42949673024, 2147483647,
>> >             },
>> >             .rx_ring = (struct rx_ring [17]){
>> >                     {
>> >                             .cqicb = (struct cqicb){
>> >                                     .msix_vect = (u8)0,
>> >                                     .reserved1 = (u8)0,
>> >                                     .reserved2 = (u8)0,
>> >                                     .flags = (u8)0,
>> >                                     .len = (__le16)0,
>> >                                     .rid = (__le16)0,
>> >                                     ...
>> >                             },
>> >                             .cq_base = (void *)0x0,
>> >                             .cq_base_dma = (dma_addr_t)0,
>> >                     }
>> >                     ...
>> >             }
>> >     }
>> >
>> >
>> > And the coredump obtained via devlink in json format looks like,
>> >
>> >     $ devlink health dump show DEVICE reporter coredump -p -j
>> >     {
>> >         "Core Registers": {
>> >             "segment": 1,
>> >             "values": [ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ]
>> >         },
>> >         "Test Logic Regs": {
>> >             "segment": 2,
>> >             "values": [ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ]
>> >         },
>> >         "RMII Registers": {
>> >             "segment": 3,
>> >             "values": [ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ]
>> >         },
>> >         ...
>> >         "Sem Registers": {
>> >             "segment": 50,
>> >             "values": [ 0,0,0,0 ]
>> >         }
>> >     }
>> >
>> > Since I don't have a QLGE device and neither could I find a software
>> > simulator, I put some functions into e1000 to get the above result.
>>
>> I'm not familiar with qlge, but I do happen to have one accessible. You can
>> add me to CC for future version of this patch.
>>
>> Testing with the refreshed patch set I got from Benjamin (apply over commit
>> fc80c51fd4b2) with debugging options KASAN, UBSAN, DEBUG_ATOMIC_SLEEP,
>> PROVE_LOCKING and DEBUG_KMEMLEAK enabled, I can verify coredump through
>> devlink works.
>
>Benjamin reminded me to check for errors in the remove callpaths as well,
>and some error showed up in the remove callpath indeed.
>
>  $ modprobe qlge
>  [ 1314.256249] qlge: module is from the staging directory, the quality is unknown, you have been warned.
>  [ 1320.184708] qlge 0000:00:04.0: QLogic 10 Gigabit PCI-E Ethernet Driver
>  [ 1320.185619] qlge 0000:00:04.0: Driver name: qlge, Version: 1.00.00.35.
>  [ 1320.196987] qlge 0000:00:04.0 eth0: Link is down.
>  [ 1320.197830] qlge 0000:00:04.0 eth0: Clearing MAC address
>  [ 1320.199184] qlge 0000:00:04.0 eth0: Function #0, Port 0, NIC Roll 0, NIC Rev = 1, XG Roll = 0, XG Rev = 1.
>  [ 1320.200793] qlge 0000:00:04.0 eth0: MAC address 00:c0:dd:14:54:b8
>  [ 1320.276835] qlge 0000:00:04.0 ens4: renamed from eth0
>  [ 1326.920224] PCI Interrupt Link [LNKB] enabled at IRQ 10
>  [ 1326.943443] qlge 0000:00:05.0 eth0: Link is down.
>  [ 1326.944188] qlge 0000:00:05.0 eth0: Clearing MAC address
>  [ 1326.944975] qlge 0000:00:05.0 eth0: Function #1, Port 1, NIC Roll 0, NIC Rev = 1, XG Roll = 0, XG Rev = 1.
>  [ 1326.946396] qlge 0000:00:05.0 eth0: MAC address 00:c0:dd:14:54:ba
>  [ 1326.997071] qlge 0000:00:05.0 ens5: renamed from eth0
>  $ rmmod qlge
>
>Simple load and unload works fine.
>
>  $ modprobe qlge
>  [ 1348.642124] qlge: module is from the staging directory, the quality is unknown, you have been warned.
>  [ 1355.090448] qlge 0000:00:04.0: QLogic 10 Gigabit PCI-E Ethernet Driver
>  [ 1355.091439] qlge 0000:00:04.0: Driver name: qlge, Version: 1.00.00.35.
>  [ 1355.098890] qlge 0000:00:04.0 eth0: Link is down.
>  [ 1355.099647] qlge 0000:00:04.0 eth0: Clearing MAC address
>  [ 1355.100708] qlge 0000:00:04.0 eth0: Function #0, Port 0, NIC Roll 0, NIC Rev = 1, XG Roll = 0, XG Rev = 1.
>  [ 1355.102301] qlge 0000:00:04.0 eth0: MAC address 00:c0:dd:14:54:b8
>  [ 1355.170258] qlge 0000:00:04.0 ens4: renamed from eth0
>  [ 1361.809918] qlge 0000:00:05.0 eth0: Link is down.
>  [ 1361.810674] qlge 0000:00:05.0 eth0: Clearing MAC address
>  [ 1361.811431] qlge 0000:00:05.0 eth0: Function #1, Port 1, NIC Roll 0, NIC Rev = 1, XG Roll = 0, XG Rev = 1.
>  [ 1361.812594] qlge 0000:00:05.0 eth0: MAC address 00:c0:dd:14:54:ba
>  [ 1361.874453] qlge 0000:00:05.0 ens5: renamed from eth0
>  $ ip link set ens4 up
>  [ 1370.994196] qlge 0000:00:04.0 ens4: MSI-X Enabled, got 2 vectors.
>  [ 1371.199630] qlge 0000:00:04.0 ens4: Clearing MAC address
>  [ 1371.299084] qlge 0000:00:04.0 ens4: Passed Get Port Configuration.
>  $ devlink health dump show pci/0000:00:04.0 reporter coredump -p -j >log.jsonho
>  $ rmmod qlge
>  [ 1413.537710] qlge 0000:00:04.0 ens4: Link is down.
>  [ 1413.538570] qlge 0000:00:04.0 ens4: Clearing MAC address
>  [ 1413.539819] qlge 0000:00:04.0 ens4: Command not supported by firmware.
>  [ 1413.541068] qlge 0000:00:04.0 ens4: Command not supported by firmware.
>  [ 1413.649713] qlge 0000:00:04.0 ens4: Command not supported by firmware.
>  [ 1413.761705] qlge 0000:00:04.0 ens4: Command not supported by firmware.
>  [ 1413.873716] qlge 0000:00:04.0 ens4: Command not supported by firmware.
>  [ 1413.985746] qlge 0000:00:04.0 ens4: Command not supported by firmware.
>  [ 1414.097743] qlge 0000:00:04.0 ens4: Command not supported by firmware.
>
>Unloading the module after coredump through devlink shows the above error.
>At first glance this look quite similar to the errors I get when trying to
>coredump when the interface is down, causing the device to go into an
>invalid state as well.
>
>  $ modprobe qlge
>  [ 1606.229089] qlge: module is from the staging directory, the quality is unknown, you have been warned.
>  [ 1612.423342] qlge 0000:00:04.0: QLogic 10 Gigabit PCI-E Ethernet Driver
>  [ 1612.424551] qlge 0000:00:04.0: Driver name: qlge, Version: 1.00.00.35.
>  [ 1612.432718] qlge 0000:00:04.0 eth0: Link is down.
>  [ 1612.433627] qlge 0000:00:04.0 eth0: Clearing MAC address
>  [ 1612.434621] qlge 0000:00:04.0 eth0: Function #0, Port 0, NIC Roll 0, NIC Rev = 1, XG Roll = 0, XG Rev = 1.
>  [ 1612.436582] qlge 0000:00:04.0 eth0: MAC address 00:c0:dd:14:54:b8
>  [ 1612.494647] qlge 0000:00:04.0 ens4: renamed from eth0
>  [ 1619.078325] qlge 0000:00:05.0 eth0: Link is down.
>  [ 1619.079119] qlge 0000:00:05.0 eth0: Clearing MAC address
>  [ 1619.080165] qlge 0000:00:05.0 eth0: Function #1, Port 1, NIC Roll 0, NIC Rev = 1, XG Roll = 0, XG Rev = 1.
>  [ 1619.081938] qlge 0000:00:05.0 eth0: MAC address 00:c0:dd:14:54:ba
>  [ 1619.154477] qlge 0000:00:05.0 ens5: renamed from eth0
>  $ ip link set ens4 up
>  [ 1629.717532] qlge 0000:00:04.0 ens4: Command not supported by firmware.
>  [ 1629.720692] qlge 0000:00:04.0 ens4: Command not supported by firmware.
>  [ 1629.827878] qlge 0000:00:04.0 ens4: Command not supported by firmware.
>  [ 1629.939891] qlge 0000:00:04.0 ens4: Command not supported by firmware.
>  [ 1630.051855] qlge 0000:00:04.0 ens4: Command not supported by firmware.
>  [ 1630.163885] qlge 0000:00:04.0 ens4: Command not supported by firmware.
>  [ 1630.275872] qlge 0000:00:04.0 ens4: Command not supported by firmware.
>  [ 1630.387893] qlge 0000:00:04.0 ens4: Command not supported by firmware.
>  [ 1630.412685] qlge 0000:00:04.0 ens4: MSI-X Enabled, got 2 vectors.
>  [ 1630.420868] qlge 0000:00:04.0 ens4: Failed about firmware command
>  [ 1630.425110] qlge 0000:00:04.0 ens4: Failed to start port.
>  [ 1630.427966] qlge 0000:00:04.0 ens4: Clearing MAC address
>  [ 1630.487603] qlge 0000:00:04.0 ens4: Failed Get Port Configuration.
>
>I'll report when I've more findings.
>

Thank you for the testing! I will assume these errors are caused by
coredump when the interface is down. Please let me know it still occurs
in v1.
>
<snip>
>

--
Best regards,
Coiby

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ