linux-kernel - Re: [PATCH rdma-rc 1/3] RDMA/hns: Fix the Oops during rmmod or insmod ko when reset occurs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190115220259.GH22045@ziepe.ca>
Date:   Tue, 15 Jan 2019 15:02:59 -0700
From:   Jason Gunthorpe <jgg@...pe.ca>
To:     "Wei Hu (Xavier)" <xavier.huwei@...wei.com>
Cc:     dledford@...hat.com, linux-rdma@...r.kernel.org,
        lijun_nudt@....com, oulijun@...wei.com, liudongdong3@...wei.com,
        liuyixian@...wei.com, zhangxiping3@...wei.com, linuxarm@...wei.com,
        linux-kernel@...r.kernel.org, xavier_huwei@....com
Subject: Re: [PATCH rdma-rc 1/3] RDMA/hns: Fix the Oops during rmmod or
 insmod ko when reset occurs

On Tue, Jan 15, 2019 at 09:48:01AM +0800, Wei Hu (Xavier) wrote:
> 
> 
> On 2019/1/15 6:06, Jason Gunthorpe wrote:
> > On Sat, Jan 12, 2019 at 03:55:31PM +0800, Wei Hu (Xavier) wrote:
> >>
> >> On 2019/1/12 5:34, Jason Gunthorpe wrote:
> >>> On Thu, Jan 10, 2019 at 09:57:41PM +0800, Wei Hu (Xavier) wrote:
> >>>> +	/* Check the status of the current software reset process, if in
> >>>> +	 * software reset process, wait until software reset process finished,
> >>>> +	 * in order to ensure that reset process and this function will not call
> >>>> +	 * __hns_roce_hw_v2_uninit_instance at the same time.
> >>>> +	 * If a timeout occurs, it indicates that the network subsystem has
> >>>> +	 * encountered a serious error and cannot be recovered from the reset
> >>>> +	 * processing.
> >>>> +	 */
> >>>> +	if (ops->ae_dev_resetting(handle)) {
> >>>> +		dev_warn(dev, "Device is busy in resetting state. waiting.\n");
> >>>> +		end = msecs_to_jiffies(HNS_ROCE_V2_RST_PRC_MAX_TIME) + jiffies;
> >>>> +		while (ops->ae_dev_resetting(handle) &&
> >>>> +		       time_before(jiffies, end))
> >>>> +			msleep(20);
> >>> Really? Does this have to be so ugly? Why isn't there just a simple
> >>> lock someplace that is held during reset?
> >>>
> >>> I'm skeptical that all this strange looking stuff is properly locked
> >>> and concurrency safe.
> >> Hi, Jason
> >>
> >> The hns3 NIC driver notifies the hns RoCE driver to perform
> >> reset related processing by calling the .reset_notify() interface
> >> registered by the RoCE driver.
> >>
> >> There is a constraint on the hip08 chip, the NIC driver needs to
> >> stop the flow before hardware startup reset, otherwise the chip
> >> may hang up.
> >>
> >> We've also thought about using locks, but found using locks can
> >> lead to more serious problems because of that restriction of the
> >> chip.
> >> If using locks here, reset processing may wait for uninstallation
> >> to complete, this may lead that NIC driver fails to stop the flow
> >> in time in the reset process, thus causing the chip to hang up.
> > If you are sleeping then I'm sure a lock can be used instead, how
> > would it be any different?
> Hi, Jason
>     If using locks here, reset process may wait until uninstallation to
> complete,
>         it may trigger the chip constraint, causing chip to hang up.
>     But if using sleeping here, there will notthe case that reset
> process wait until
>        uninstallation to complete, then will not trigger the chip
> constraint.

But how is this even right? If ops->ae_dev_resetting can change at any
time, and you need to wait for it here, without locks can't it just
change instantly after the if statement?

I think it shows the concurrancy & locking is not done right when I
see loops reading shared data and spinning on them with msleep.

Jason