netdev - Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEP_g=8bCR=PeSoi09jLWLtNUrxhzx45h1Wm=9D=R57AqUac2w@mail.gmail.com>
Date:	Tue, 6 Jan 2015 14:11:24 -0500
From:	Jesse Gross <jesse@...ira.com>
To:	Fan Du <fengyuleidian0615@...il.com>
Cc:	"Du, Fan" <fan.du@...el.com>, Thomas Graf <tgraf@...g.ch>,
	"davem@...emloft.net" <davem@...emloft.net>,
	"Michael S. Tsirkin" <mst@...hat.com>,
	Jason Wang <jasowang@...hat.com>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"fw@...len.de" <fw@...len.de>,
	"dev@...nvswitch.org" <dev@...nvswitch.org>,
	"pshelar@...ira.com" <pshelar@...ira.com>
Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU

On Tue, Jan 6, 2015 at 4:34 AM, Fan Du <fengyuleidian0615@...il.com> wrote:
>
> On 2015/1/6 1:58, Jesse Gross wrote:
>>
>> On Mon, Jan 5, 2015 at 1:02 AM, Fan Du <fengyuleidian0615@...il.com>
>> wrote:
>>>
>>> 于 2014年12月03日 10:31, Du, Fan 写道:
>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Thomas Graf [mailto:tgr@...radead.org] On Behalf Of Thomas Graf
>>>>> Sent: Wednesday, December 3, 2014 1:42 AM
>>>>> To: Michael S. Tsirkin
>>>>> Cc: Du, Fan; 'Jason Wang'; netdev@...r.kernel.org; davem@...emloft.net;
>>>>> fw@...len.de; dev@...nvswitch.org; jesse@...ira.com; pshelar@...ira.com
>>>>> Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger
>>>>> than
>>>>> MTU
>>>>>
>>>>> On 12/02/14 at 07:34pm, Michael S. Tsirkin wrote:
>>>>>>
>>>>>> On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote:
>>>>>>>
>>>>>>> On 12/02/14 at 01:48pm, Flavio Leitner wrote:
>>>>>>>>
>>>>>>>> What about containers or any other virtualization environment that
>>>>>>>> doesn't use Virtio?
>>>>>>>
>>>>>>>
>>>>>>> The host can dictate the MTU in that case for both veth or OVS
>>>>>>> internal which would be primary container plumbing techniques.
>>>>>>
>>>>>>
>>>>>> It typically can't do this easily for VMs with emulated devices:
>>>>>> real ethernet uses a fixed MTU.
>>>>>>
>>>>>> IMHO it's confusing to suggest MTU as a fix for this bug, it's an
>>>>>> unrelated optimization.
>>>>>> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here.
>>>>>
>>>>>
>>>>> PMTU discovery only resolves the issue if an actual IP stack is running
>>>>> inside the
>>>>> VM. This may not be the case at all.
>>>>
>>>>    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>
>>>> Some thoughts here:
>>>>
>>>> Think otherwise, this is indeed what host stack should forge a
>>>> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED
>>>> message with _inner_ skb network and transport header, do whatever type
>>>> of
>>>> encapsulation,
>>>> and thereafter push such packet upward to Guest/Container, which make
>>>> them
>>>> feel, the intermediate node
>>>> or the peer send such message. PMTU should be expected to work correct.
>>>> And such behavior should be shared by all other encapsulation tech if
>>>> they
>>>> are also suffered.
>>>
>>>
>>> Hi David, Jesse and Thomas
>>>
>>> As discussed in here:
>>> https://www.marc.info/?l=linux-netdev&m=141764712631150&w=4 and
>>> quotes from Jesse:
>>> My proposal would be something like this:
>>>   * For L2, reduce the VM MTU to the lowest common denominator on the
>>> segment.
>>>   * For L3, use path MTU discovery or fragment inner packet (i.e.
>>> normal routing behavior).
>>>   * As a last resort (such as if using an old version of virtio in the
>>> guest), fragment the tunnel packet.
>>>
>>>
>>> For L2, it's a administrative action
>>> For L3, PMTU approach looks better, because once the sender is alerted
>>> the
>>> reduced MTU,
>>> packet size after encapsulation will not exceed physical MTU, so no
>>> additional fragments
>>> efforts needed.
>>> For "As a last resort... fragment the tunnel packet", the original patch:
>>> https://www.marc.info/?l=linux-netdev&m=141715655024090&w=4 did the job,
>>> but
>>> seems it's
>>> not welcomed.
>>
>> This needs to be properly integrated into IP processing if it is to
>> work correctly.
>
> Do you mean the original patch in this thread? yes, it works correctly
> in my cloud env. If you has any other concerns, please let me know. :)

Ok...but that doesn't actually address the points that I made.

>> One of the reasons for only doing path MTU discovery
>> for L3 is that it operates seamlessly as part of normal operation -
>> there is no need to forge addresses or potentially generate ICMP when
>> on an L2 network. However, this ignores the IP handling that is going
>> on (note that in OVS it is possible for L3 to be implemented as a set
>> of flows coming from a controller).
>>
>> It also should not be VXLAN specific or duplicate VXLAN encapsulation
>> code. As this is happening before encapsulation, the generated ICMP
>> does not need to be encapsulated either if it is created in the right
>> location.
>
> Yes, I agree. GRE share the same issue from the code flow.
> Pushing back ICMP msg back without encapsulation without circulating down
> to physical device is possible. The "right location" as far as I know
> could only be in ovs_vport_send. In addition this probably requires wrapper
> route looking up operation for GRE/VXLAN, after get the under layer device
> MTU
> from the routing information, then calculate reduced MTU becomes feasible.

As I said, it needs to be integrated into L3 processing. In OVS this
would mean adding some primitives to the kernel and then exposing the
functionality upwards into userspace/controller.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html