linux-kernel - Re: [PATCH 08/17] bio: reimplement bio_copy_user

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49D47E7B.3030209@kernel.org>
Date:	Thu, 02 Apr 2009 17:59:39 +0900
From:	Tejun Heo <tj@...nel.org>
To:	Boaz Harrosh <bharrosh@...asas.com>
CC:	axboe@...nel.dk, linux-kernel@...r.kernel.org,
	fujita.tomonori@....ntt.co.jp
Subject: Re: [PATCH 08/17] bio: reimplement bio_copy_user_iov()

Hello, Boaz.

Boaz Harrosh wrote:
>> Yeah, we can switch to dealing with bvecs instead of sgls.  For kernel
>> mappings, it wouldn't make any difference.  For user single mappings,
>> it'll remove an intermediate step.  For user sg mappings, if we make
>> gup use sgl, it won't make much difference.
> 
> "gup use sgl" was just a stupid suggestion to try eliminate one more
> allocation. Realy Realy stupid.
> 
> for gup there is a much better, safer, way. Just allocate on the stack
> a constant size say 64 pointers and add an inner-loop on the 64 pages.
> The extra code loop is much cheaper then an allocation and is one less
> point of failure.

Great.

> So now that we got gup out of the way, will you go directly to bvecs?

As said before, I don't think exposing bio at the driver interface is
a good idea copy or not.  For internal implementation, if there's
enough benefit, yeah.

>> Frankly, for copy paths, I don't think all this matters at all.  If
>> each request is small, other overheads will dominate.  For large
>> requests, copying to and from sgl isn't something worth worrying
>> about.
> 
> It is. Again, I tested that. Each new allocation in the order of
> the number of pages in a request adds 3-5% on a RAM I/O

Sure, if you do things against ram, anything will show up.  Do you
have something more realistic?

> And it adds a point of failure. The less the better.

That is not the only measure.  Things are always traded off using
multiple evaluation metrics.  The same goes with the _NO_ allocation
thing.

>> For mapping paths, if we change gup to use sgl (we really can't make
>> it swallow bvec), I don't think it will make whole lot of difference
>> one way or the other.  It might end up using slightly more cachelines,
>> but that will be about it.
> 
> gup to use sgl, is out of the way, that was stupid idea to try eliminate
> one more dynamic allocation. 

Yeah, I do agree gup is much better off with pages array.

>> I do agree using bvec as internal page carrier would be nicer (sans
>> having to implement stuff manually) but I'm not convinced whether
>> it'll be worth the added code.
> 
> We can reach a point where there is no intermediate dynamic alloctions
> and no point of failures, but for the allocation of the final bio+biovec.

There is no need to obsess about no point of failure.  Obsessing about
single evaluation metric is a nice and fast way toward a mess.

> And that "implement stuff manually" will also enable internal bio
> code cleanups since now biovec is that much smarter.
>
> <snip> 
>>> And I have one more question.  Are you sure kmalloc of
>>> bigger-then-page buffers are safe?
>> It has higher chance of failure but is definitely safe.
> 
> Much much higher, not just 100% higher. And it will kill the
> use of block-layer in nommu systems.

Didn't know nommu would kill high order allocation or were you talking
about vmalloc?

>>> As I understood it, that tries to allocate physically contiguous
>>> pages which degrades as time passes, and last time I tried this with
>>> a kmem_cache (do to a bug) it crashed the kernel randomly after 2
>>> minutes of use.
>> Then, that's a bug.  
> 
> Fixing the bug will not help, the allocation will fail and the IO will
> not get through.

Under enough memory pressure, Non-fs IOs are gonna fail somewhere
along the line.  If you don't like that, use mempool backed allocation
or preallocated data structures.  Also, I believe it is generally
considered better to fail IO than oopsing.  :-P

>> Yeah, high order allocations fail more easily as time passes.  But
>> low order (1 maybe 2) allocations aren't too bad.
> 
> No, we are not talking about 1 or 2. Today since Jens put the scatterlist
> chaining we can have lots and lots of them chained together. At the time
> Jens said that bio had the chaining option for a long time, only scatterlists
> limit the size of the request.

Alright, let's than chain bios but let's please do it inside bio
proper.

>> If it matters we can make bio_kmalloc() use vmalloc() for bvecs if it
>> goes over PAGE_SIZE but again given that nobody reported the
> 
> No that is a regression, vmaloc is an order of a magnitude slower then
> just plain BIO chaining

It all depends on how frequent those multiple allocations will be.

> spurious
>> GFP_DMA in bounce_gfp which triggers frenzy OOM killing spree for
>> large SG_IOs, I don't think this matters all that much.
>>
> 
> This BUG was only for HW, ie ata. But for stuff like iscsi FB sas
> they did fine.

The bug was for all controllers which couldn't do 64bit DMA on 64bit
machines.

> The fact of the matter is that people, me included, run in systems
> with very large request for a while now, large like 4096 pages which
> is 16 chained bios. Do you want to allocate that with vmalloc?

The patchset was written the way it's written because I couldn't see
any pressing performance requirement for the current existing users.
Optimizing for SG_IO against memory is simply not a good idea.

If you have in-kernel users which often require large transfers, sure,
chaining bio would be better.  Still, let's do it _inside_ bio.  Would
your use case be happy with something like blk_rq_map_kern_bvec(rq,
bvec) which can append to rq?  If we convert internal implementation
over to bvec, blk_rq_map_kern_sg() can be sg->bvec converting wrapper
around blk_rq_map_kern_bvec().

> I love your cleanups, and your courage, which I don't have.

Thanks a lot but it's not like the code and Jens are scary monsters I
need to slay, so I don't think I'm being particularly courageous.  :-)

> I want to help in any way I can. If you need just tell me what, and
> I'll be glad to try or test anything. This is fun I like to work and
> think on these things

Great, your reviews are comments are very helpful, so please keep them
coming.  :-)

Jens, Tomo, what do you guys think?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/