netdev - Re: [RFC]: Support for zero-copy TCP transmit of user space data

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <495137DC.8050204@vlnb.net>
Date:	Tue, 23 Dec 2008 22:11:24 +0300
From:	Vladislav Bolkhovitin <vst@...b.net>
To:	Jens Axboe <jens.axboe@...cle.com>
CC:	"David M. Lloyd" <dmlloyd@...rg.com>, linux-mm@...ck.org,
	Christoph Hellwig <hch@...radead.org>,
	James Bottomley <James.Bottomley@...senPartnership.com>,
	linux-scsi@...r.kernel.org, linux-kernel@...r.kernel.org,
	scst-devel@...ts.sourceforge.net,
	Bart Van Assche <bart.vanassche@...il.com>,
	netdev@...r.kernel.org
Subject: Re: [RFC]: Support for zero-copy TCP transmit of user space data

Jens Axboe, on 12/19/2008 10:27 PM wrote:
>>>>>> An iSCSI target driver iSCSI-SCST was a part of the patchset 
>>>>>> (http://lkml.org/lkml/2008/12/10/293). For it a nice optimization to 
>>>>>> have TCP zero-copy transmit of user space data was implemented. Patch, 
>>>>>> implementing this optimization was also sent in the patchset, see 
>>>>>> http://lkml.org/lkml/2008/12/10/296.
>>>>> I'm probably ignorant of about 90% of the context here, but isn't this 
>>>>> the sort of problem that was supposed to have been solved by vmsplice(2)?
>>>> No, vmsplice can't help here. ISCSI-SCST is a kernel space driver. But, 
>>>> even if it was a user space driver, vmsplice wouldn't change anything 
>>>> much. It doesn't have a possibility for a user to know, when 
>>>> transmission of the data finished. So, it is intended to be used as: 
>>>> vmsplice() buffer -> munmap() the buffer -> mmap() new buffer -> 
>>>> vmsplice() it. But on the mmap() stage kernel has to zero all the newly 
>>>> mapped pages and zeroing memory isn't much faster, than copying it. 
>>>> Hence, there would be no considerable performance increase.
>>> vmsplice() isn't the right choice, but splice() very well could be. You
>>> could easily use splice internally as well. The vmsplice() part sort-of
>>> applies in the sense that you want to fill pages into a pipe, which is
>>> essentially what vmsplice() does. You'd need some helper to do that.
>> Sorry, Jens, but splice() works only if there is a file handle on the 
>> another side, so user space doesn't see data buffers. But SCST needs to 
>> serve a wider usage cases, like reading data with decompression from a 
>> virtual tape, where decompression is done in user space. For those only 
>> complete zero-copy network send, which I implemented, can give the best 
>> performance.
> 
> __splice_from_pipe() takes a pipe, a descriptor and an actor. There's
> absolutely ZERO reason you could not reuse most of that for this
> implementation. The big bonus here is that getting the put correct from
> networking would even make splice() better for everyone. Win for Linux,
> win for you since it'll make it MUCH easier for you to get this stuff
> in. Looking at your original patch and I almost think it's a flame bait
> to induce discussion (nothing wrong with that, that approach works quite
> well and has been used before). There's no way in HELL that it'd ever be
> a merge candidate. And I suspect you know that, at least I hope you do
> or you are farther away from going forward with this than you think.
> 
> So don't look at splice() the system call, look at the infrastructure
> and check if that could be useful for your case. To me it looks
> absolutely like it could, if you goal is just zero-copy transmit.

I looked at the splice code again to make sure I don't miss anything. 
__splice_from_pipe() leads to pipe_to_sendpage(), which leads to 
sock_sendpage, then to sock->sendpage(). Sorry, but I don't see any 
point why to go over all the complicated splice infrastructure instead 
of directly call sock->sendpage(), as I do.

> The
> only missing piece is dropping the reference and signalling page
> consumption at the right point, which is when the data is safe to be
> reused. That very bit is missing, but that should be all as far as I can
> tell.

This is exactly what I implemented in the patch we are discussing.

>>> And
>>> the ack-on-xmit-done bits is something that splice-to-socket needs
>>> anyway, so I think it'd be quite a suitable choice for this.
>> So, are you writing that splice() could also benefit from the zero-copy 
>> transmit feature, like I implemented?
> 
> I like how you want to reinvent everything, perhaps you should spend a
> little more time looking into various other approaches? splice() already
> does zero-copy network transmit, there are no copies going on. Ideally,
> you'd have zero copies moving data into your pipe, but migrade/move
> isn't quite there yet. But that doesn't apply to your case at all.
> 
> What is missing, as I wrote, is the 'release on ack' and not on pipe
> buffer release. This is similar to the get_page/put_page stuff you did
> in your patch, but don't go claiming that zero-copy transmit is a
> Vladislav original - the ->sendpage() does no copies.

Jens, I have never claimed I reinvented ->sendpage(). Quite opposite, I 
use it. I only extended it by a missing feature. Although, seems, since 
you were misleaded, I should apologize for not too good description of 
the patch.

Thanks,
Vlad



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html