lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Mon, 8 Mar 2010 04:01:18 -0500 (EST)
From:	Mikulas Patocka <mpatocka@...hat.com>
To:	Jens Axboe <jens.axboe@...cle.com>
cc:	device-mapper development <dm-devel@...hat.com>,
	Dmitry Monakhov <dmonakhov@...nvz.org>,
	linux-kernel@...r.kernel.org, Mike Snitzer <snitzer@...hat.com>
Subject: Re: [dm-devel] [PATCH 1/2] blkdev: fix merge_bvec_fn return value
 checks

> > The point for building a big bio and splitting it is code size reduction.
> > 
> > You must realize that you need to split the request anyway. If the caller 
> > needs to read/write M sectors and the device can process at most N 
> > sectors, where N<M, then the i/o must be split. You can't get rid of that 
> > split.
> 
> If we consider the basic (and mosted used) file system IO path, then
> those 'M' sectors will usually be submitted in units that are smaller
> than 'N' anyway. It would be silly to allocate a bio for 'N' sectors
> (those get big), only to split it up again shortly.

I agree that majority of IOs are page cache reads/writes and they 
shouldn't be split.

> The point is not to build it too big to begin with, and keep the
> splitting as the slow path. That's how it was originally designed, and
> it'll surely stay that way. Doing the split handling in generic code is
> a good idea, it definitely should be. But I'm never going to make
> splitting a normal part of IO operations just because we allow
> arbitrarily large bios, sorry but that's insane.
>
> > There is just one case, where it is not desirable to create larger 
> > requests than the physical transfer size because of performance --- that 
> > is page cache readahead. That could use the current approach of querying 
> > the queue for limits and not stuffing more data than the device can 
> > handle.
> > 
> > In all the other (non-readahead) cases, the caller is waiting for all the 
> > data anyway, there is no performance problem with creating a larger 
> > requests and splitting them.
> 
> But this I still think is crazy, sorry.

It doesn't look crazy to me:

Case 1: the caller constructs a request to dm-io. Dm-io allocates its 
internal structure, splits the request into several bios, submits them, 
waits for them, calls the completion routing when they all finish.

Case 2: the caller constructs a big bio (without care about device 
limits), submits it to the block layer, the block layer splits the bio 
into several smaller bios (it doesn't even have to allocate the vector if 
it splits at page boundaries), submits the smaller bios to the device, 
when they finish, signals the completion of the big bio.

Now tell me, why do you think that "Case 1" is better than "Case 2"? If 
the bio gets split, you get exactly the same number of allocations in both 
cases. If the bio isn't split, "Case 2" is even better (it saves dm-io 
overhead).

> > > Alasdair had an idea for doing a map/unmap like interface instead, which
> > > I think is a lot better. I don't think he ever coded it up, though.
> > > Perhaps talk to him, if you want to improve our situation in this area.
> > 
> > It would increase code size, not reduce it.
> 
> What kind of answer is that?

If you increase code size in the interface, you make the interface more 
complex and any further changes will be harder. It kills the software 
slowly.

> > BTW. see fs/bio.c:bio_split --- it uses a shared mempool. It can deadlock 
> > if the devices depend on each other. You need to use per-queue mempool.
> 
> Only if you need to split the same original bio twice on the same IO
> path.

It may happen if you use stacked MD devices with different small stripe 
sizes. But it's not common because people usually set stripe size as a 
multiple of page size.

Mikulas

> -- 
> Jens Axboe
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ