netdev - Re: [Intel-wired-lan] [RFC PATCH] e1000e: Remove Other from EIAC.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Sat, 20 Jan 2018 07:45:17 +0900
From:   Benjamin Poirier <benjamin.poirier@...il.com>
To:     Alexander Duyck <alexander.duyck@...il.com>
Cc:     Shrikrishna Khare <skhare@...are.com>,
        Jeff Kirsher <jeffrey.t.kirsher@...el.com>,
        Netdev <netdev@...r.kernel.org>,
        intel-wired-lan <intel-wired-lan@...ts.osuosl.org>,
        linux-kernel@...r.kernel.org
Subject: Re: [Intel-wired-lan] [RFC PATCH] e1000e: Remove Other from EIAC.

On 2018/01/19 08:22, Alexander Duyck wrote:
> On Fri, Jan 19, 2018 at 5:36 AM, Benjamin Poirier
> <benjamin.poirier@...il.com> wrote:
> > On 2018/01/19 17:59, Benjamin Poirier wrote:
> >> On 2018/01/18 07:51, Alexander Duyck wrote:
> >> > On Wed, Jan 17, 2018 at 10:50 PM, Benjamin Poirier <bpoirier@...e.com> wrote:
> >> > > It was reported that emulated e1000e devices in vmware esxi 6.5 Build
> >> > > 7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver
> >> > > overrun interrupt bursts", v4.15-rc1). Some tracing shows that after
> >> > > e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other()
> >> > > on emulated e1000e devices. In comparison, on real e1000e 82574 hardware,
> >> > > icr=0x80000004 (_INT_ASSERTED | _OTHER) in the same situation.
> >> >
> >> > Isn't 0x80000004 (_INT_ASSERTED | _LSC)? The assumption I based my
> >>
> >> Yes. The numeric value is correct. I made a mistake when writing down
> >> the flag names.
> >>
> >> > patch on was that the VMware code was sending _OTHER instead of _LSC
> >> > to trigger LSC events. As such in my version of the workaround I just
> >>
> >> It's not so deterministic, sadly. In my tests, upon entry into
> >> e1000_msix_other() after e1000e_trigger_lsc(), with no workaround patch
> >> I've seen:
> >> icr=0x0
> >> icr=0x3d
> >>       Reserved RXDMT0 Reserved LSC TXDW
> >> icr=0x46
> >>       RXO LSC TXQE
> >> and someone else reported:
> >> icr=0x3c
> >>       Reserved RXDMT0 Reserved LSC
> >>
> >> > went through and did the testing if the _RXO bit was set, otherwise I
> >> > assumed that whatever event was received must have been meant to
> >> > trigger an _LSC type event since that worked in the past.
> >                                ^...
> >
> > Re-reading that part, my thoughts are that it worked in the past, before
> > 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1),
> > because (presumably) Other was not set in EIAC. It worked after
> > 16ecba59bc33 but before 4aea7a5c5e94 ("e1000e: Avoid receiver overrun
> > interrupt bursts", v4.15-rc1) because e1000_msix_other() didn't read the
> > value of icr. If it had, it would've found a bogus value, which is
> > what's happening after 4aea7a5c5e94. I wonder if we're not just getting
> > some uninitialized value from the emulation code... which makes me kind
> > of uneasy about your approach of trying to make sense of the value.
> > Maybe Shrikrishna can clarify where the icr value is coming from when
> > Other is set in EIAC.
> 
> For now I still say we let my current patch go as-is in order to
> address the immediate issue. We can follow-up and do a more refined
> version of things once we get the final word from VMware on how this
> actually works. If nothing else the current patch appears to resolve
> the currently reported issue and is already submitted.
> 
> I'm of the mind that we need to cut down on the code thrash.  This
> driver is supposed to have been in a "maintenance" mode for the last
> year or so as there aren't being any new parts added is my
> understanding. As-is I don't see any reason why 16ecba59bc33 ("e1000e:
> Do not read ICR in Other interrupt", v4.5-rc1) was submitted or
> accepted in the first place. I don't see any notes about it fixing any
> bug or addressing any issue and it seems like that is the start of all
> the issues we have been having recently with RXO triggering more
> interrupts, various link issues, and this most recent VMware issue.

I'm sorry to say but you're the one who suggested that change:

http://lkml.iu.edu/hypermail/linux/kernel/1510.3/03528.html

On 2015/10/28 23:08, Alexander Duyck wrote:
> On 10/22/2015 05:32 PM, Benjamin Poirier wrote:
[...]
> 
> I would argue your first patch probably didn't go far enough to remove dead
> code.  Specifically you should only ever get into this function if LSC is
> set.  There are no other causes that should trigger this.  As such you could
> probably remove the ICR read, and instead replace it with an ICR write of
> the LSC bit since OTHER is already cleared via EIAC.
>