lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <200708121637.00144.phillips@phunq.net>
Date:	Sun, 12 Aug 2007 16:36:59 -0700
From:	Daniel Phillips <phillips@...nq.net>
To:	Jens Axboe <jens.axboe@...cle.com>
Cc:	Evgeniy Polyakov <johnpol@....mipt.ru>, netdev@...r.kernel.org,
	linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
	Peter Zijlstra <peterz@...radead.org>
Subject: Re: Distributed storage.

On Tuesday 07 August 2007 13:55, Jens Axboe wrote:
> I don't like structure bloat, but I do like nice design. Overloading
> is a necessary evil sometimes, though. Even today, there isn't enough
> room to hold bi_rw and bi_flags in the same variable on 32-bit archs,
> so that concern can be scratched. If you read bio.h, that much is
> obvious.

Sixteen bits in bi_rw are consumed by queue priority.  Is there a reason 
this lives in struct bio instead of struct request?

> If you check up on the iommu virtual merging, you'll understand the
> front and back size members. They may smell dubious to you, but
> please take the time to understand why it looks the way it does.

Virtual merging is only needed at the physical device, so why do these 
fields live in struct bio instead of struct request?

> Changing the number of bvecs is integral to how bio buildup current
> works.

Right, that is done by bi_vcnt.  I meant bi_max_vecs, which you can 
derive efficiently from BIO_POOL_IDX() provided the bio was allocated 
in the standard way.  This leaves a little bit of clean up to do for 
bios not allocated from a standard pool.

Incidentally, why does the bvl need to be memset to zero on allocation?  
bi_vcnt already tells you which bvecs are valid and the only field in a 
bvec that can reasonably default to zero is the offset, which ought to 
be set set every time a bvec is initialized anyway.

> > bi_destructor could be combined.  I don't see a lot of users of
> > bi_idx,
>
> bi_idx is integral to partial io completions.

Struct request has a remaining submission sector count so what does 
bi_idx do that is different?

> > that looks like a soft target.  See what happened to struct page
> > when a couple of folks got serious about attacking it, some really
> > deep hacks were done to pare off a few bytes here and there.  But
> > struct bio as a space waster is not nearly in the same ballpark.
>
> So show some concrete patches and examples, hand waving and
> assumptions is just a waste of everyones time.

Average struct bio memory footprint ranks near the bottom of the list of 
things that suck most about Linux storage.  At idle I see 8K in use 
(reserves); during updatedb it spikes occasionally to 50K; under a 
heavy  load generated by ddsnap on a storage box it sometimes goes to 
100K with bio throttling in place.  Really not moving the needle.

On the other hand, vm writeout deadlock ranks smack dab at the top of 
the list, so that is where the patching effort must go for the 
forseeable future.  Without bio throttling, the ddsnap load can go to 
24 MB for struct bio alone.  That definitely moves the needle.  in 
short, we save 3,200 times more memory by putting decent throttling in 
place than by saving an int in struct bio.

That said, I did a little analysis to get an idea of where the soft 
targets are in struct bio, and to get to know the bio layer a little 
better.  Maybe these few hints will get somebody interested enough to 
look further.

> > It would be interesting to see if bi_bdev could be made read only.
> > Generally, each stage in the block device stack knows what the next
> > stage is going to be, so why do we have to write that in the bio? 
> > For error reporting from interrupt context?  Anyway, if Evgeniy
> > wants to do the patch, I will happily unload the task of convincing
> > you that random fields are/are not needed in struct bio :-)
>
> It's a trade off, otherwise you'd have to pass the block device
> around a lot.

Which costs very little, probably less than trashing an extra field's 
worth of cache.

> And it's, again, a design issue. A bio contains 
> destination information, that means device/offset/size information.
> I'm all for shaving structure bytes where it matters, but not for the
> sake of sacrificing code stability or design. I consider struct bio
> quite lean and have worked hard to keep it that way. In fact, iirc,
> the only addition to struct bio since 2001 is the iommu front/back
> size members. And I resisted those for quite a while.

You did not comment on the one about putting the bio destructor in 
the ->endio handler, which looks dead simple.  The majority of cases 
just use the default endio handler and the default destructor.  Of the 
remaining cases, where a specialized destructor is needed, typically a 
specialized endio handler is too, so combining is free.  There are few 
if any cases where a new specialized endio handler would need to be 
written.

As far as code stability goes, current kernels are horribly unstable in 
a variety of contexts because of memory deadlock and slowdowns related 
to the attempt to fix the problem via dirty memory limits.  Accurate 
throttling of bio traffic is one of the two key requirements to fix 
this instability, the other other is accurate writeout path reserve 
management, which is only partially addressed by BIO_POOL.

Nice to see you jumping in Jens.  Now it is over to the other side of 
the thread where Evgeniy has posted a patch that a) grants your wish to 
add no new field in struct bio and b) does not fix the problem.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ