netdev - Re: [Patch net] net: invert the check of detecting hardware RX checksum fault

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAM_iQpXWofVOg+u3fLYgFfU9-aTd7CadYfGF_q2hz16Af4_xGw@mail.gmail.com>
Date:   Fri, 16 Nov 2018 12:06:00 -0800
From:   Cong Wang <xiyou.wangcong@...il.com>
To:     Herbert Xu <herbert@...dor.apana.org.au>
Cc:     Linux Kernel Network Developers <netdev@...r.kernel.org>,
        Tom Herbert <tom@...bertland.com>,
        Eric Dumazet <edumazet@...gle.com>
Subject: Re: [Patch net] net: invert the check of detecting hardware RX
 checksum fault

On Thu, Nov 15, 2018 at 8:50 PM Herbert Xu <herbert@...dor.apana.org.au> wrote:
>
> On Thu, Nov 15, 2018 at 06:23:38PM -0800, Cong Wang wrote:
> >
> > > Normally if the hardware's partial checksum is valid then we just
> > > trust it and send the packet along.  However, if the partial
> > > checksum is invalid we don't trust it and we will compute the
> > > whole checksum manually which is what ends up in sum.
> >
> > Not sure if I understand partial checksum here, but it is the
> > CHECKSUM_COMPLETE case which I am trying to fix, not
> > CHECKSUM_PARTIAL.
>
> What I meant by partial checksum is the checksum produced by the
> hardware on RX.  In the kernel we call that CHECKSUM_COMPLETE.
> CHECKSUM_PARTIAL is the absence of the substantial part of the
> checksum which is something we use in the kernel primarily for TX.
>
> Yes the names are confusing :)

Yeah, understood. The hardware provides skb->csum in this case, but
we keep adjusting it each time when we change skb->data.

>
> > So, in other word, a checksum *match* is the intended to detect
> > this HW RX checksum fault?
>
> Correct.  Or more likely it's probably a bug in either the driver
> or if there are overlaying code such as VLAN then in that code.
>
> Basically if the RX checksum is buggy, it's much more likely to
> cause a valid packet to be rejected than to cause an invalid packet
> to be accepted, because we still verify that checksum against the
> pseudoheader.  So we only attempt to catch buggy hardware/drivers
> by doing a second manual verification for the case where the packet
> is flagged as invalid.

Hmm, now I see how it works. Actually it uses the differences between
these two check's as the difference between hardware checksum with
skb_checksum().

I will send a patch to add a comment there to avoid confusion.

>
> > Sure, my case is nearly same with Pawel's, except I have no vlan:
> > https://marc.info/?l=linux-netdev&m=154086647601721&w=2
>
> Can you please provide your backtrace?

I already did:
https://marc.info/?l=linux-netdev&m=154092211305599&w=2

Note, the offending commit has been backported to 4.14, which
is why I saw this warning. I have no idea why it is backported
from the beginning, it is just an optimization, doesn't fix any bug,
IMHO.

Also, it is much harder for me to reproduce it than Pawel who
saw the warning every second. Sometimes I need 1 hour to trigger
it, sometimes other people here needs 10+ hours to trigger it.

Let me see if I can add vlan on my side to make it more reproducible,
it seems hard as our switch doesn't use vlan either.

We have warnings with conntrack involved too, I can provide it too
if you are interested.

I tend to revert it for -stable, at least that is what I plan to do
on my side unless there is a fix coming soon.

Thanks.