netdev - Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4D35C2D7.6090008@intel.com>
Date:	Tue, 18 Jan 2011 08:41:59 -0800
From:	John Fastabend <john.r.fastabend@...el.com>
To:	"Oleg V. Ukhno" <olegu@...dex-team.ru>
CC:	Jay Vosburgh <fubar@...ibm.com>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"David S. Miller" <davem@...emloft.net>
Subject: Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for
 single TCP session balancing

On 1/18/2011 4:40 AM, Oleg V. Ukhno wrote:
> On 01/18/2011 06:16 AM, John Fastabend wrote:
>> On 1/14/2011 4:05 PM, Jay Vosburgh wrote:
>>> 	Can somebody (John?) more knowledgable than I about dm-multipath
>>> comment on the above?
>>
>> Here I'll give it a go.
>>
>> I don't think detecting L2 link failure this way is very robust. If there
>> is a failure farther away then your immediate link your going to break
>> completely? Your bonding hash will continue to round robin the iscsi
>> packets and half them will get dropped on the floor. dm-multipath handles
>> this reasonably gracefully. Also in this bonding environment you seem to
>> be very sensitive to RTT times on the network. Maybe not bad out right but
>> I wouldn't consider this robust either.
> 
> John, I agree - this bonding mode should be used in quite limited number 
> of situations, but as for failure farther away then immediate link - 
> every bonding mode will suffer same problems in this case - bonding 
> detects only L2 failures, other is done by upper-layer mechanisms. And 
> almost all bonding modes depend on equal RTT on slaves. And, there is 
> already similar load balancing mode - balance-alb - what I did is 
> approximately the same, but for 802.3ad bonding mode and provides 
> "better"(more equal and non-conditional layser2) load striping for tx 
> and _rx_ .
> 
> I think I shouldn't mention the particular use case of this patch - when 
> I wrote it I tried to make a more general solution - my goal was "make 
> equal or near-equal load striping for TX and (most important part) RX 
> within single ethernet(layer 2) domain for  TCP transmission". This 
> bonding mode  just introduces ability to stripe rx and tx load for 
> single TCP connection between hosts inside of one ethernet segment. 
> iSCSI is just an example. It is possible to stripe load between a 
> linux-based router and linux-based web/ftp/etc server as well in the 
> same manner. I think this feature will be useful in some number of 
> network configurations.
> 
>   Also, I looked into net-next code - it seems to me that it can be 
> implemented(adapted to net-next bonding code) without any difficulties 
> and hashing function change makes no problem here.
> 
> What I've written below is just my personal experience and opinion after 
> 5 years of using Oracle +iSCSI +mpath(later - patched bonding).
> 
>  From my personal experience I just can say that most iSCSI failures are 
> caused by link failures, and also I would never send any significant 
> iSCSI traffic via router - router would be a bottleneck in this case.
> So, in my case iSCSI traffic flows within one ethernet domain and in 
> case of link failure bonding driver simply fails one slave(in case of 
> bonding) , instead of checking and failing hundreths of paths (in case 
> of mpath) and first case significantly less cpu, net and time 
> consuming(if using default mpath checker - readsector0).
> Mpath is good for me, when I use it to "merge" drbd mirrors from 
> different hosts, but for just doing simple load striping within single 
> L2 network switch  between 2 .. 16 hosts is some overkill(particularly 
> in maintaining human-readable device naming) :).
> 
> John, what is you opinion on such load balancing method in general, 
> without referring to particular use cases?
> 

This seems reasonable to me, but I'll defer to Jay on this. As long as the
limitations are documented and it looks like they are this may be fine.

Mostly I was interested to know what led you down this path and why MPIO
was not working as at least I expected it should. When I get some time I'll
see if we can address at least some of these issues. Even so it seems like
this bonding mode may still be useful for some use cases perhaps even none
storage use cases.

> 
>>
>> You could tweak your scsi timeout values and fail_fast values, set the io
>> retry to 0 to cause the fail over to occur faster. I suspect you already
>> did this and still it is too slow? Maybe adding a checker in multipathd to
>> listen for link events would be fast enough. The checker could then fail
>> the path immediately.
>>
>> I'll try to address your comments from the other thread here. In general I
>> wonder if it would be better to solve the problems in dm-multipath rather than
>> add another bonding mode?
> Of course I did this, but mpath is fine when device quantity is below 
> 30-40 devices with two paths, 150-200 devices with 2+ paths can make 
> life far more interesting :)

OK admittedly this gets ugly fast.

>>
>> OVU - it is slow(I am using ISCSI for Oracle , so I need to minimize latency)
>>
>> The dm-multipath layer is adding latency? How much? If this is really true
>> maybe its best to the address the real issue here and not avoid it by
>> using the bonding layer.
> 
> I do not remember exact number now, but switching one of my databases , 
> about 2 years ago to bonding increased read throughput for the entire db 
> from 15-20 Tb/day to approximately 30-35 Tb/day (4 iscsi initiators and 
> 8 iscsi targets, 4 ethernet links for iSCSI on each host, all plugged in 
> one switch) because of "full" bandwidth use. Also, bonding usage 
> simplifies network and application setup greatly(compared to mpath)
> 
>>
>> OVU - it handles any link failures bad, because of it's command queue
>> limitation(all queued commands above 32 are discarded in case of path
>> failure, as I remember)
>>
>> Maybe true but only link failures with the immediate peer are handled
>> with a bonding strategy. By working at the block layer we can detect
>> failures throughout the path. I would need to look into this again I
>> know when we were looking at this sometime ago there was some talk about
>> improving this behavior. I need to take some time to go back through the
>> error recovery stuff to remember how this works.
>>
>> OVU - it performs very bad when there are many devices and maтy paths(I was
>> unable to utilize more that 2Gbps of 4 even with 100 disks with 4 paths
>> per each disk)
> 
> Well, I think that behavior can be explained in such a way:
> when balancing by I/Os number per path(rr_min_io), and there is a huge 
> number of devices, mpath is doing load-balaning per-device, and it is 
> not possible to quarantee equal device use for all devices, so there 
> will be imbalance over network interface(mpath is unaware of it's 
> existence, etc), and it is likely it becomes more imbalanced when there 
> are many devices. Also, counting I/O's for many devices and paths 
> consumes some CPU resources and also can cause excessive context switches.
> 

hmm I'll get something setup here and see if this is the case.

>>
>> Hmm well that seems like something is broken. I'll try this setup when
>> I get some time next few days. This really shouldn't be the case dm-multipath
>> should not add a bunch of extra latency or effect throughput significantly.
>> By the way what are you seeing without mpio?
> 
> And one more obsevation from my 2-years old tests - reading device(using 
> dd) (rhel 5 update 1 kernel, ramdisk via ISCSI via loopback ) as mpath 
> device with single path was done at approximately 120-150mb/s, and same 
> test on non-mpath device at 800-900mb/s. Here I am quite sure, it was a 
> kind of revelation to me that time.
> 

Similarly I'll have a look. Thanks for the info.

>>
>> Thanks,
>> John
>>
> 
> 
 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html