linux-kernel - Re: [Lsf-pc] [LSF/MM TOPIC] multi-stream IO hint implementation proposal for LSF/MM 2016

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20160218102127.GB4338@quack.suse.cz>
Date:	Thu, 18 Feb 2016 11:21:27 +0100
From:	Jan Kara <jack@...e.cz>
To:	Shaun Tancheff <shaun.tancheff@...gate.com>
Cc:	Dave Chinner <david@...morbit.com>, Jan Kara <jack@...e.cz>,
	Changho Choi-SSI <changho.c@....samsung.com>,
	"lsf-pc@...ts.linux-foundation.org" 
	<lsf-pc@...ts.linux-foundation.org>, linux-fsdevel@...r.kernel.org,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] multi-stream IO hint implementation
 proposal for LSF/MM 2016

On Wed 17-02-16 21:51:56, Shaun Tancheff wrote:
> On Wed, Feb 17, 2016 at 5:36 PM, Dave Chinner <david@...morbit.com> wrote:
> 
>     On Wed, Feb 17, 2016 at 04:21:55PM +0100, Jan Kara wrote:
>     > On Sat 13-02-16 01:50:09, Changho Choi-SSI wrote:
>     > > Dear Program committee,
>     > >
>     > > I wanted to propose a technical discussion.
>     > > Please let me know if there is anything else that I have to submit and/
>     or
>     > > prepare.
>     >
>     > As a side note: It is good to CC other relevant mailing lists so that
>     > corresponding developers can react to the proposal.
>     >
>     > > ==
>     > > Linux Kernel Multi-stream I/O Hint Implementation
>     > >
>     > > Enterprise, datacenter, and client systems increasingly deploy NAND
>     > > flash-based SSDs. However, in use, SSDs cannot avoid inevitable garbage
>     > > collection that deterministically causes write amplification which
>     > > decreases device performance. Unfortunately, write amplification also
>     > > decreases SSD lifetime. However, with multi-stream, unavoidable garbage
>     > > collection overhead (e.g., write amplification) can be significantly
>     > > reduced. For multi-stream devices, the host tags device I/O write
>     > > requests with a stream ID (e.g., I/O hint). The SSD controller places
>     the
>     > > data in media erase blocks according to the stream ID. For example, a
>     SSD
>     > > controller stores data with same stream ID in an associated physical
>     > > location inside SSD. In this way, the multi-stream depends on host I/O
>     > > hints. So it is useful to develop how to implement multi-stream I/O
>     hints
>     > > under limited protocol constraints. The T10 SCSI standard group has
>     > > already standardized the multi-stream feature and NVMe standardization
>     is
>     > > an ticipated in March, 2016. Many Linux users want to leverage
>     > > multi-stream as a mainstream Linux feature since they have seen
>     > > performance improvement and SSD lifetime extension when evaluating
>     > > multi-stream enabled devices. Hence, the multi-stream feature is a good
>     > > Linux community development candidate and should be discussed within
>     the
>     > > community.  I propose this multi-stream topic (i.e., I/O write hint
>     > > implementation) in a discussion session. I can briefly present the
>     > > multi-stream system architecture and answer any technical questions.
>     >
>     > So a key question for a feature like this is: How many stream IDs are
>     > devices going to support? Because AFAIR so far the answer was "it depends
>     > on the device". However the design how stream IDs can be used greatly
>     > differs between "a couple of stream IDs" and e.g. 2^32 stream IDs.
>     Without
>     > this information I don't think the discussion would be very useful. So
>     can
>     > you provide some rough numbers?
> 
> I think we start with the spec granularity of 16 bit ids and let the
> hw/fw funnel it down to what is useful for that device. For a small fast
> SSD in the 500G range 64k  ids is rather unwieldy. I am confident that
> the firmware will be able to funnel  that 64k down to a manageable size
> for what it needs.  After all once you have 50T or so 64k ids isn't that
> many anymore.

OK, makes sense.

>     To me, the biggest problem these hint proposals have had in the past
>     are with the user facing API. Passing hints through the kernel IO
>     stack isn't a huge issue - it's how to get them into the kernel,
>     what defaults should be used when they are not provided, whether the
>     kernel can reserve streams for it's own use (i.e. journal and
>     metadata streams), how to assigning valid stream ids outside of the
>     IO call interface consistently across different filesystems, whether
>     stream IDs should be persistent for an inode, error behaviour when
>     an invalid stream ID is used, etc.
> 
> 
> I would like and support the kernel having well know stream id for it's own
> use. 
> Knowing that the incoming data is a journal, or other metadata is very helpful.
> I think they should even be well known ids.

Well, I agree that kernel will use some IDs internally. I'd just be careful
about making any IDs well-known. There are different filesystems which will
have different ideas how to use IDs and publicizing any IDs would invite
firmware writers to try to optimize particular IDs which would then screw
up all the other filesystems (remember optimizations flash people did for
FAT). So I'd say it's at filesystem's discretion to use some IDs for it's
purposes and remap user requests for these IDs to somewhere else...

> As for what to use when when you have nothing provided  I'm think of
> something that hashes up the originating process in some predictable way
> as fallback.  I am not yet convinced the stream id needs to be persisted
> in the inode.

Well, if I get it right, the purpose of stream IDs is to group blocks with
similar lifetime to the same stream IDs (so that write amplification due to
garbage collection of partially used blocks is reduced). I'm not sure
grouping by PID makes too much sense. I think that using inode number or
even parent directory inode number as a base for stream ID would make more
sense. Filesystems try to group blocks with similar lifetime to reduce
fragmentation as well. And historically the best general heuristics we have
come up with was to use parent directory as a key. It is a crude heuristics
but it is pretty simple.

Also the choice of stream ID when userspace doesn't provide it will likely
depend on the fs anyway - e.g. btrfs with its copy-on-write has very
different block lifetimes from e.g. ext4.

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR