netdev - Re: [PATCH rfc] packet: zerocopy packet

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+FuTSfyNMP59ugcmvwNV+sZiW2=4hcwqn_NOzVEvj0rGK=eBw@mail.gmail.com>
Date:	Thu, 27 Nov 2014 19:32:54 -0500
From:	Willem de Bruijn <willemb@...gle.com>
To:	"Michael S. Tsirkin" <mst@...hat.com>
Cc:	Network Development <netdev@...r.kernel.org>,
	David Miller <davem@...emloft.net>,
	Eric Dumazet <eric.dumazet@...il.com>,
	Daniel Borkmann <dborkman@...hat.com>
Subject: Re: [PATCH rfc] packet: zerocopy packet_snd

On Thu, Nov 27, 2014 at 2:27 AM, Michael S. Tsirkin <mst@...hat.com> wrote:
> On Wed, Nov 26, 2014 at 06:05:16PM -0500, Willem de Bruijn wrote:
>> On Wed, Nov 26, 2014 at 4:20 PM, Michael S. Tsirkin <mst@...hat.com> wrote:
>> > On Wed, Nov 26, 2014 at 02:59:34PM -0500, Willem de Bruijn wrote:
>> >> > The main problem with zero copy ATM is with queueing disciplines
>> >> > which might keep the socket around essentially forever.
>> >> > The case was described here:
>> >> > https://lkml.org/lkml/2014/1/17/105
>> >> > and of course this will make it more serious now that
>> >> > more applications will be able to do this, so
>> >> > chances that an administrator enables this
>> >> > are higher.
>> >>
>> >> The denial of service issue raised there, that a single queue can
>> >> block an entire virtio-net device, is less problematic in the case of
>> >> packet sockets. A socket can run out of sk_wmem_alloc, but a prudent
>> >> application can increase the limit or use separate sockets for
>> >> separate flows.
>> >
>> > Sounds like this interface is very hard to use correctly.
>>
>> Actually, this socket alloc issue is the same for zerocopy and
>> non-zerocopy. Packets can be held in deep queues at which point
>> the packet socket is blocked. This is accepted behavior.
>>
>> >From the above thread:
>>
>> "It's ok for non-zerocopy packet to be blocked since VM1 thought the
>> packets has been sent instead of pending in the virtqueue. So VM1 can
>> still send packet to other destination."
>>
>> This is very specific to virtio and vhost-net. I don't think that that
>> concern applies to a packet interface.
>
> Well, you are obviously building the interface with some use-case in mind.
> Let's try to make it work for multiple use-cases.
>
> So at some level, you are right.  The issue is not running out of wmem.
> But I think I'm right too - this is hard to use correctly.
>
> I think the difference is that with your patch, application
> can't reuse the memory until packet is transmitted, otherwise junk goes
> out on the network.

The packet headers are in linear skb memory, so the integrity
of the kernel is not affected. But payload (and derived data,
like checksums) may be junk.

> Even closing the socket won't help.
> Is this true?

Good point. Indeed. Your approach in the follow up email,
to let the process release its mappings, sounds like a good
answer.

In general, I think that leaving this under process control is
preferable over kernel guarantees using a timer, because it
is it is more predictable from a process point of view. Both
the state of the system at any point in time, and the per-packet
cycle cost are easier to reason about (were packet sent out
with zero-copy, or were deep copies triggered by a timer? this
question would be difficult to answer otherwise, and it is
important, as deep copies are likely more expensive).

>
> I see this as a problem.
>
> I'm trying to figure out how would one use this interface, one obvious
> use would be to tunnel out raw packets directly from VM memory.
> For this application, a zero copy packet never completing is a problem:
> at minimum, you want to be able to remove the device, which
> translates to a requirement that closing the socket effectively stops
> using userspace memory.

The application can ensure this by releasing the relevant
pages. For mmap()ed pages, munmap() is sufficient. The approach
you mentioned with madvise may be more widely applicable. I had
so far only targeted mmapped (incl. MAP_HUGETLB) pages.

Alternatively, the kernel could enforce this, by keeping a reference
on all in-flight ubuf_info structs associated with the sk. Though this
adds non trivial accounting overhead and potentially massive
copying at socket close.

If outstanding ubuf_info are tracked, then the forced copy can
be triggered by other actions, as well, such as in response to a
sendmsg cmsg or setsockopt.

Because there is per-call cost associated with maintaining
kernel state, but not necessarily with userspace strategies,
I would choose to leave this responsibility to the user. That
is already the behavior with vmsplice gifting, I think.

> In case you want to be able to run e.g. a watchdog,
> ability to specify a deadline also seems benefitial.

>> Another issue, though, is that the method currently really only helps
>> TSO because ll other paths cause a deep copy. There are more use
>> cases once it can send up to 64KB MTU over loopback or send out
>> GSO datagrams without triggering skb_copy_ubufs. I have not looked
>> into how (or if) that can be achieved yet.
>
> I think this was done intentionally at some point,
> try to look at git history to find out the reasons.
>
>> >
>> >> > One possible solution is some kind of timer orphaning frags
>> >> > for skbs that have been around for too long.
>> >>
>> >> Perhaps this can be approximated without an explicit timer by calling
>> >> skb_copy_ubufs on enqueue whenever qlen exceeds a threshold value?
>> >
>> > Not sure.  I'll have to see that patch to judge.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html