linux-kernel - Re: Integration of SCST in the mainstream Linux kernel

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1202152756.11265.581.camel@haakon2.linux-iscsi.org>
Date:	Mon, 04 Feb 2008 11:19:16 -0800
From:	"Nicholas A. Bellinger" <nab@...ux-iscsi.org>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	James Bottomley <James.Bottomley@...senPartnership.com>,
	Vladislav Bolkhovitin <vst@...b.net>,
	Bart Van Assche <bart.vanassche@...il.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	FUJITA Tomonori <fujita.tomonori@....ntt.co.jp>,
	linux-scsi@...r.kernel.org, scst-devel@...ts.sourceforge.net,
	linux-kernel@...r.kernel.org, Mike Christie <michaelc@...wisc.edu>,
	CBE-OSS-DEV <cbe-oss-dev@...abs.org>
Subject: Re: Integration of SCST in the mainstream Linux kernel

On Mon, 2008-02-04 at 11:06 -0800, Nicholas A. Bellinger wrote:
> On Mon, 2008-02-04 at 10:29 -0800, Linus Torvalds wrote:
> > 
> > On Mon, 4 Feb 2008, James Bottomley wrote:
> > > 
> > > The way a user space solution should work is to schedule mmapped I/O
> > > from the backing store and then send this mmapped region off for target
> > > I/O.
> > 
> > mmap'ing may avoid the copy, but the overhead of a mmap operation is 
> > quite often much *bigger* than the overhead of a copy operation.
> > 
> > Please do not advocate the use of mmap() as a way to avoid memory copies. 
> > It's not realistic. Even if you can do it with a single "mmap()" system 
> > call (which is not at all a given, considering that block devices can 
> > easily be much larger than the available virtual memory space), the fact 
> > is that page table games along with the fault (and even just TLB miss) 
> > overhead is easily more than the cost of copying a page in a nice 
> > streaming manner.
> > 
> > Yes, memory is "slow", but dammit, so is mmap().
> > 
> > > You also have to pull tricks with the mmap region in the case of writes 
> > > to prevent useless data being read in from the backing store.  However, 
> > > none of this involves data copies.
> > 
> > "data copies" is irrelevant. The only thing that matters is performance. 
> > And if avoiding data copies is more costly (or even of a similar cost) 
> > than the copies themselves would have been, there is absolutely no upside, 
> > and only downsides due to extra complexity.
> > 
> 
> The iSER spec (RFC-5046) quotes the following in the TCP case for direct
> data placement:
> 
> "  Out-of-order TCP segments in the Traditional iSCSI model have to be
>    stored and reassembled before the iSCSI protocol layer within an end
>    node can place the data in the iSCSI buffers.  This reassembly is
>    required because not every TCP segment is likely to contain an iSCSI
>    header to enable its placement, and TCP itself does not have a
>    built-in mechanism for signaling Upper Level Protocol (ULP) message
>    boundaries to aid placement of out-of-order segments.  This TCP
>    reassembly at high network speeds is quite counter-productive for the
>    following reasons: wasted memory bandwidth in data copying, the need
>    for reassembly memory, wasted CPU cycles in data copying, and the
>    general store-and-forward latency from an application perspective."
> 
> While this does not have anything to do directly with the kernel vs. user discussion
> for target mode storage engine, the scaling and latency case is easy enough
> to make if we are talking about scaling TCP for 10 Gb/sec storage fabrics.
> 
> > If you want good performance for a service like this, you really generally 
> > *do* need to in kernel space. You can play games in user space, but you're 
> > fooling yourself if you think you can do as well as doing it in the 
> > kernel. And you're *definitely* fooling yourself if you think mmap() 
> > solves performance issues. "Zero-copy" does not equate to "fast". Memory 
> > speeds may be slower that core CPU speeds, but not infinitely so!
> > 
> 
> >From looking at this problem from a kernel space perspective for a
> number of years, I would be inclined to believe this is true for
> software and hardware data-path cases.  The benefits of moving various
> control statemachines for something like say traditional iSCSI to
> userspace has always been debateable.  The most obvious ones are things
> like authentication, espically if something more complex than CHAP are
> the obvious case for userspace.  However, I have thought recovery for
> failures caused from communication path (iSCSI connections) or entire
> nexuses (iSCSI sessions) failures was very problematic to expect to have
> to potentially push down IOs state to userspace.
> 
> Keeping statemachines for protocol and/or fabric specific statemachines
> (CSM-E and CSM-I from connection recovery in iSCSI and iSER are the
> obvious ones) are the best canidates for residing in kernel space.
> 
> > (That said: there *are* alternatives to mmap, like "splice()", that really 
> > do potentially solve some issues without the page table and TLB overheads. 
> > But while splice() avoids the costs of paging, I strongly suspect it would 
> > still have easily measurable latency issues. Switching between user and 
> > kernel space multiple times is definitely not going to be free, although 
> > it's probably not a huge issue if you have big enough requests).
> > 
> 

Then again, having some data-path for software and hardware bulk IO
operation of storage fabric protocol / statemachine in userspace would
be really interesting for something like an SPU enabled engine for the
Cell Broadband Architecture.

--nab



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/