netdev - Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 18 Jan 2011 15:40:39 +0300
From:	"Oleg V. Ukhno" <olegu@...dex-team.ru>
To:	John Fastabend <john.r.fastabend@...el.com>
CC:	Jay Vosburgh <fubar@...ibm.com>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"David S. Miller" <davem@...emloft.net>
Subject: Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for
 single TCP session balancing

On 01/18/2011 06:16 AM, John Fastabend wrote:
> On 1/14/2011 4:05 PM, Jay Vosburgh wrote:
>> 	Can somebody (John?) more knowledgable than I about dm-multipath
>> comment on the above?
>
> Here I'll give it a go.
>
> I don't think detecting L2 link failure this way is very robust. If there
> is a failure farther away then your immediate link your going to break
> completely? Your bonding hash will continue to round robin the iscsi
> packets and half them will get dropped on the floor. dm-multipath handles
> this reasonably gracefully. Also in this bonding environment you seem to
> be very sensitive to RTT times on the network. Maybe not bad out right but
> I wouldn't consider this robust either.

John, I agree - this bonding mode should be used in quite limited number 
of situations, but as for failure farther away then immediate link - 
every bonding mode will suffer same problems in this case - bonding 
detects only L2 failures, other is done by upper-layer mechanisms. And 
almost all bonding modes depend on equal RTT on slaves. And, there is 
already similar load balancing mode - balance-alb - what I did is 
approximately the same, but for 802.3ad bonding mode and provides 
"better"(more equal and non-conditional layser2) load striping for tx 
and _rx_ .

I think I shouldn't mention the particular use case of this patch - when 
I wrote it I tried to make a more general solution - my goal was "make 
equal or near-equal load striping for TX and (most important part) RX 
within single ethernet(layer 2) domain for  TCP transmission". This 
bonding mode  just introduces ability to stripe rx and tx load for 
single TCP connection between hosts inside of one ethernet segment. 
iSCSI is just an example. It is possible to stripe load between a 
linux-based router and linux-based web/ftp/etc server as well in the 
same manner. I think this feature will be useful in some number of 
network configurations.

  Also, I looked into net-next code - it seems to me that it can be 
implemented(adapted to net-next bonding code) without any difficulties 
and hashing function change makes no problem here.

What I've written below is just my personal experience and opinion after 
5 years of using Oracle +iSCSI +mpath(later - patched bonding).

 From my personal experience I just can say that most iSCSI failures are 
caused by link failures, and also I would never send any significant 
iSCSI traffic via router - router would be a bottleneck in this case.
So, in my case iSCSI traffic flows within one ethernet domain and in 
case of link failure bonding driver simply fails one slave(in case of 
bonding) , instead of checking and failing hundreths of paths (in case 
of mpath) and first case significantly less cpu, net and time 
consuming(if using default mpath checker - readsector0).
Mpath is good for me, when I use it to "merge" drbd mirrors from 
different hosts, but for just doing simple load striping within single 
L2 network switch  between 2 .. 16 hosts is some overkill(particularly 
in maintaining human-readable device naming) :).

John, what is you opinion on such load balancing method in general, 
without referring to particular use cases?


>
> You could tweak your scsi timeout values and fail_fast values, set the io
> retry to 0 to cause the fail over to occur faster. I suspect you already
> did this and still it is too slow? Maybe adding a checker in multipathd to
> listen for link events would be fast enough. The checker could then fail
> the path immediately.
>
> I'll try to address your comments from the other thread here. In general I
> wonder if it would be better to solve the problems in dm-multipath rather than
> add another bonding mode?
Of course I did this, but mpath is fine when device quantity is below 
30-40 devices with two paths, 150-200 devices with 2+ paths can make 
life far more interesting :)
>
> OVU - it is slow(I am using ISCSI for Oracle , so I need to minimize latency)
>
> The dm-multipath layer is adding latency? How much? If this is really true
> maybe its best to the address the real issue here and not avoid it by
> using the bonding layer.

I do not remember exact number now, but switching one of my databases , 
about 2 years ago to bonding increased read throughput for the entire db 
from 15-20 Tb/day to approximately 30-35 Tb/day (4 iscsi initiators and 
8 iscsi targets, 4 ethernet links for iSCSI on each host, all plugged in 
one switch) because of "full" bandwidth use. Also, bonding usage 
simplifies network and application setup greatly(compared to mpath)

>
> OVU - it handles any link failures bad, because of it's command queue
> limitation(all queued commands above 32 are discarded in case of path
> failure, as I remember)
>
> Maybe true but only link failures with the immediate peer are handled
> with a bonding strategy. By working at the block layer we can detect
> failures throughout the path. I would need to look into this again I
> know when we were looking at this sometime ago there was some talk about
> improving this behavior. I need to take some time to go back through the
> error recovery stuff to remember how this works.
>
> OVU - it performs very bad when there are many devices and maтy paths(I was
> unable to utilize more that 2Gbps of 4 even with 100 disks with 4 paths
> per each disk)

Well, I think that behavior can be explained in such a way:
when balancing by I/Os number per path(rr_min_io), and there is a huge 
number of devices, mpath is doing load-balaning per-device, and it is 
not possible to quarantee equal device use for all devices, so there 
will be imbalance over network interface(mpath is unaware of it's 
existence, etc), and it is likely it becomes more imbalanced when there 
are many devices. Also, counting I/O's for many devices and paths 
consumes some CPU resources and also can cause excessive context switches.

>
> Hmm well that seems like something is broken. I'll try this setup when
> I get some time next few days. This really shouldn't be the case dm-multipath
> should not add a bunch of extra latency or effect throughput significantly.
> By the way what are you seeing without mpio?

And one more obsevation from my 2-years old tests - reading device(using 
dd) (rhel 5 update 1 kernel, ramdisk via ISCSI via loopback ) as mpath 
device with single path was done at approximately 120-150mb/s, and same 
test on non-mpath device at 800-900mb/s. Here I am quite sure, it was a 
kind of revelation to me that time.

>
> Thanks,
> John
>


-- 
Best regards,
Oleg Ukhno.
ITO Team Lead,
Yandex LLC.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html