linux-kernel - [RFC][PATCH] Immutable bio vecs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20120829064117.GB22214@moria.home.lan>
Date:	Tue, 28 Aug 2012 23:41:17 -0700
From:	Kent Overstreet <koverstreet@...gle.com>
To:	linux-bcache@...r.kernel.org, linux-kernel@...r.kernel.org,
	dm-devel@...hat.com, linux-raid@...r.kernel.org
Cc:	tj@...nel.org, martin.petersen@...cle.com
Subject: [RFC][PATCH] Immutable bio vecs

Ok, I lied about patches. I got something working yesterday, but the
code is ugly and in flux and not _remotely_ complete.

Well, if you want to see actual code, it's up at
http://evilpiepirate.org/git/linux-bcache.git/log/?h=block_stuff-immutable_bio_vecs
http://evilpiepirate.org/git/linux-bcache.git/log/?h=block_stuff-immutable_bio_vecs

Anyways: IMMUTABLE BVECS

This is an idea I've been kicking around for ages, but it turns out it
had occured to other people besides me, so the other day I finally bit
the bullet and started writing code and trying to figure out how to make
it work sanely.

THE PROBLEM:

This mainly (but not exclusively!) affects stacking block devices, in a
variety of ways.

 * Stacking block devices pretty much universally have to split bios.
Bio splitting is inherently problematic when bvecs are mutable.
Primarily when the split isn't aligned on a bvec boundary, but the block
layer adds other difficulties.

 * CLONING/RETRYING
Many stacking block drivers like to handle errors. You know, things like
raid.

Well, mutable bvecs mean they have to clone not just the bio, but the
entire bvec to retry failed IOs. Awesome.

I'm sure this is less of a problem for certain filesystems, but I bet
their error handling code could benefit from this.

 * CONFUSING/INCONSISTENT SEMANTICS
This isn't so much something that immutable bio vecs will solve, but
rather something that has to be solved to make immutable bio vecs work.

Clearly separating the fields in struct bio into things that change
after submission and things that don't, and making the mutable things
mutate consistently has the potential to simplify and clarify a lot of
code.

I've been staring at a _lot_ of block layer code lately.

BENEFITS:

In the shorter term, what this really enables is writing/changing code
in the block layer to be much more incremental/flexible - instead of
going to ridiculous lengths to set things up ahead of time to adhere to
all the concievable limits, we simply handle bios incrementally and just
always make forward progress.

Arbitrary efficient bio splitting means (to pick one thing of many) we
can get rid of merge_bvec_fn. I did that in one of my branches, and I
deleted over 800 lines of code.

800 lines of HORRIBLE HORRIBLE code.

It isn't just stacking block devices that are benefited by this stuff,
normal block drivers should see improvements to, I just haven't spent as
much time with that kind of code. A lot of code related to
merging/counting segments can be gotten rid of, though - and that is
tricky code.

FURTHER OFF:

Something I've heard Tejun complain about (and I agree with) is the
ridiculous number of forms scatter/gather list an IO goes through on the
way down the stack.

By my count, it's as many as 4. Say you're doing aio:

 * kiocb page vector
 * the direct io code has its own page vector
 * struct bio/bvec
 * struct scatterlist

The immutable bvec stuff helps because it helps with lifting the
restrictions on bio sizes; the dio code can just _start_ by allocating a
bio that holds however many pages it needs, instead of creating its own
page vector then looping over that.

I think unifying bvecs and struct scatterlist is feasible too, and this
should help with that. Haven't thought about that one much though.

HOW DO WE GET TO THIS MAGICAL PLACE?

Pain, misery, and the tears of innocent children.

Well, there's a nontrivial amount of nontrivial restructing to do in a
a fair driver coder, but it looks doable and a lot can be done
incrementally.

THE PLAN:

Step 2: Add a field to struct bio - bi_bvec_done.

This is the number of bytes done in the current bvec; starts at 0 and
when bi_bvec_done == bv_len, set it to 0 and increment bi_idx.

That's really the gist of it, but getting to the point where that can be
done without breaking any existing code is going to take awhile.

Step 1: Introduce a bio iterator

struct bvec_iter {
	sector_t		bi_sector;	/* device address in 512 byte
						   sectors */
	unsigned int		bi_size;	/* residual I/O count */

	unsigned int		bi_idx;		/* current index into bvl_vec */
};

To this gets add bi_bvec_done later.

Initially, I just shoved those fields from struct bio into that struct,
and added a bi_iter field to struct bio - that was just a mass renaming.

This struct bvec_iter is more than just shuffling things around though,
it becomes the only way for driver code to do anything with the bvecs.

IMPLICATIONS:

The big one is that it is no longer possible for code to just poke into
bi_io_vec, without going through the abstractions. You can't use bv_len
or bv_offset on the first bvec without consulting bi_bvec_done, but you
also can't use bv_len on the _last_ bvec.

The reason for this is non aligned splits; if we split we can't modify
the bvec so the last bvec in the first split is going to have a bv_len
that has more bytes than belong to the bio.

So the only way to use the bvec is to iterate from the start,
decrementing bi_size as you go along - bi_size, _not_ bi_vcnt is what
tells you when you're done.

But with the bvec_iter we can construct a new bio_for_each_segment()
that works pretty much just like the old one:

#define __bio_for_each_segment(bvl, bio, iter, start)			\
	for (iter = start, bvl = bio_iovec_iter(bio, iter);		\
	     bvl.bv_len = min(bvl.bv_len, iter.bi_size),		\
		iter.bi_size;						\
	     iter.bi_size -= bvl.bv_len, iter.bi_idx++,			\
		bvl = *__bio_iovec_iter(bio, iter))

#define bio_for_each_segment(bvl, bio, iter)				\
	__bio_for_each_segment(bvl, bio, iter, (bio)->bi_iter)

I'm not including the definition of bio_iovec_iter, but it just returns
a bio_vec (a literal struct, not a pointer) that represents the start of
the bvec array, after taking into account bi_bvec_done.

So from the caller's perspective, they now just use a struct bvec to
iterate, not a pointer - and they use a struct bvec_iter instead of an
integer. For most callers it's a trivial conversion (I've already
converted all uses in my tree).

The non trivial conversions are basically any driver code that looks
into bi_io_vec directly. Besides bi_bvec_done, arbitrary splitting means
that when a driver gets a bio bi_idx may not be 0 - that is a major
concern, it turns out; because of that it just isn't reasonable or safe
for driver code to munge about in bi_io_vec directly.

Everything's got to be converted over to standard abstractions, which is
definitely going to require new utility code in fs/bio.c for various
types of cloning and whatnot (really to complement what's already
there).

So far, it seems the best way to find all the nontrivial stuff to
convert is to just grep around for bi_idx:

$ git grep bi_idx|wc -l
80

Not that bad... almost exactly half in drivers/md.

WHAT ABOUT THE IMMUTABLE BVEC ARRAY ITSELF?

Well, it turns out when we've got all this done, there actually doesn't
seem to be much code that modifies bi_idx or the bvec array itself -
that stuff is pretty well centralized/abstracted.

In practice it's mostly done from blk_update_request()/req_bio_endio()
in blk-core.c, which just does the obvious/simple thing as far as
consistently incrementing/decrementing all the various fields.

I abstracted this out into a new function, bio_advance() - this nicely
simplified blk_update_request(), and bio_advance() is going to be useful
elsewhere.

bio_advance() just takes a bio and some number of sectors to increment
the iterator by - aside from bi_bvec_done and the fields in the bvec
array, bi_size/bi_sector mean exactly what they used to and a bio is
complete when bi_size == 0.

This is really awesome for the new splitting code. That is the coolest
part of all this, if you ask me:

struct bio *split = bio_clone_bioset(bio, gfp, bio_set);
split->bi_size = sectors << 9;
bio_advance(bio, sectors);

Except 2 lines of code for the bio integrity stuff, that is the _entire_
new implementation of arbitrary bio splitting, and I'm going to get rid
of the extra bits for bio integrity too.

So anyways, I'm making excellent progress but I doubt I'm going to be
able to finish 100% of this on my own. I can do everything in the core
block layer code - including the stuff for merging/counting segments,
which is nasty - but for driver code I think I'm going to need help from
people working in the relevant areas to finish this stuff off.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/