linux-kernel - Re: [PATCH v12 0/6] Address error and recovery for AER and DPC

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180312175630.GF18494@localhost.localdomain>
Date:   Mon, 12 Mar 2018 11:56:30 -0600
From:   Keith Busch <keith.busch@...el.com>
To:     Sinan Kaya <okaya@...eaurora.org>
Cc:     poza@...eaurora.org, Bjorn Helgaas <helgaas@...nel.org>,
        Bjorn Helgaas <bhelgaas@...gle.com>,
        Philippe Ombredanne <pombredanne@...b.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Kate Stewart <kstewart@...uxfoundation.org>,
        linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org,
        Dongdong Liu <liudongdong3@...wei.com>,
        Wei Zhang <wzhang@...com>, Timur Tabi <timur@...eaurora.org>,
        linux-pci-owner@...r.kernel.org
Subject: Re: [PATCH v12 0/6] Address error and recovery for AER and DPC

On Mon, Mar 12, 2018 at 01:41:07PM -0400, Sinan Kaya wrote:
> I was just writing a reply to you. You acted first :)
> 
> On 3/12/2018 1:33 PM, Keith Busch wrote:
> >>> After releasing a slot from DPC, the link is allowed to retrain. If
> >>> there
> >>> is a working device on the other side, a link up event occurs. That
> >>> event is handled by the pciehp driver, and that schedules enumeration
> >>> no matter what you do to the DPC driver.
> >> yes, that is what i current, but this patch-set makes DPC aware of error
> >> handling driver callbacks.
> > I've been questioning the utility of doing that since the very first
> > version of this patch set.
> > 
> 
> I think we should all agree that shutting down the device drivers with active
> work is not safe. There could be outstanding work that the endpoint driver
> needs to take care of. 
> 
> That was the motivation for this change so that we give endpoint drivers an 
> error callback when something goes wrong. 
> 
> The rest is implementation detail that we can all figure out.

I'm not sure if I agree here. All Linux device drivers are supposed to
cope with sudden/unexpected loss of communication at any time. This
includes cleaning up appropriately when requested to unbind from an
inaccessible device with active outstanding work.