lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Fri, 31 Mar 2017 13:55:42 -0600
From:   "Goel, Sameer" <sgoel@...eaurora.org>
To:     Daniel Jurgens <danielj@...lanox.com>,
        Saeed Mahameed <saeedm@....mellanox.co.il>,
        Mohamad Haj Yahia <mohamad@...lanox.com>
Cc:     Saeed Mahameed <saeedm@...lanox.com>,
        Matan Barak <matanb@...lanox.com>,
        Leon Romanovsky <leonro@...lanox.com>,
        Linux Netdev List <netdev@...r.kernel.org>,
        linux-rdma@...r.kernel.org, shankerd@...eaurora.org,
        rruigrok@...eaurora.org
Subject: Re: [BUG] ethernet:mellanox:mlx5: Oops in health_recover
 get_nic_state(dev)



On 3/28/2017 7:25 AM, Daniel Jurgens wrote:
> On 3/28/2017 4:11 AM, Saeed Mahameed wrote:
>> On Tue, Mar 28, 2017 at 2:45 AM, Goel, Sameer <sgoel@...eaurora.org> wrote:
>>> Stack frame:
>>> [ 1744.418958] [<ffff00000328936c>] get_nic_state+0x24/0x40 [mlx5_core]
>>> [ 1744.425273] [<ffff0000032899c0>] health_recover+0x28/0x80 [mlx5_core]
>>> [ 1744.431496] [<ffff0000080e3280>] process_one_work+0x150/0x460
>>> [ 1744.437218] [<ffff0000080e35e0>] worker_thread+0x50/0x4b8
>>> [ 1744.442609] [<ffff0000080e9b98>] kthread+0xd8/0xf0
>>> [ 1744.447377] [<ffff000008083330>] ret_from_fork+0x10/0x20
>>>
>>> Summary:
>>> This issue was seen on QDF2400 system 30 mins after while running speccpu 2006. During the test a recoverable PCIe error was seen that gave the following log:
>>> [ 1673.170969] pcieport 0002:00:00.0: aer_status: 0x00004000, aer_mask: 0x00400000
>>> [ 1673.177961] pcieport 0002:00:00.0: aer_layer=Transaction Layer, aer_agent=Requester ID
>>> [ 1673.185832] pcieport 0002:00:00.0: aer_uncor_severity: 0x00462030
>>> [ 1675.536391] mlx5_core 0002:01:00.0: assert_var[0] 0xffffffff
>>> [ 1675.541093] mlx5_core 0002:01:00.0: assert_var[1] 0xffffffff
>>> [ 1675.546750] mlx5_core 0002:01:00.0: assert_var[2] 0xffffffff
>>> [ 1675.552377] mlx5_core 0002:01:00.0: assert_var[3] 0xffffffff
>>> [ 1675.558040] mlx5_core 0002:01:00.0: assert_var[4] 0xffffffff
>>> [ 1675.563661] mlx5_core 0002:01:00.0: assert_exit_ptr 0xffffffff
>>> [ 1675.569488] mlx5_core 0002:01:00.0: assert_callra 0xffffffff
>>> [ 1675.575120] mlx5_core 0002:01:00.0: fw_ver 15.4095.65535
>>> [ 1675.580426] mlx5_core 0002:01:00.0: hw_id 0xffffffff
>>> [ 1675.585363] mlx5_core 0002:01:00.0: irisc_index 255
>>> [ 1675.590242] mlx5_core 0002:01:00.0: synd 0xff: unrecognized error
>>> [ 1675.596301] mlx5_core 0002:01:00.0: ext_synd 0xffff
>>> [ 1675.601209] mlx5_core 0002:01:00.0: mlx5_enter_error_state:120:(pid 7205): start
>>> [ 1675.608613] mlx5_core 0002:01:00.0: mlx5_enter_error_state:127:(pid 7205): end
>>>
>>> After the above log we see the above stackframe and a page fault due to invalid dev pointer.
>>>
>>> So the the recovery work is queued and the timer is stopped. Somehow the workqueue is not cleared and when it runs the dev pointer is invalid.
>>>
>>> This issue was difficult to repro and was seen only once in multiple runs on a specific device.
>> Hi Sameer,
>>
>> Thanks for the report,
>> adding more relevant ppl
>>
>> Mohamad/Daniel Does the above ring a bell ?
>> can you check ?
>>
>> Thanks
>> Saeed.
> 
> Hi Sameer, Can you tell me if you have these 2 patches?
> 
> 5e44fca504705 ('net/mlx5: Only cancel recovery work when cleaning up device')
> 
> 689a248df83b ("net/mlx5: Cancel recovery work in remove flow")
> 
> 

Hi Daniel, 
 No I do not have these patches. I can pick up the cleanups from the mailing list. Are these patches
acked for 4.12?
Thanks,
Sameer
 
-- 
 Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ