lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sat, 20 Sep 2014 12:19:57 +0200
From:	Nikolay Aleksandrov <nikolay@...hat.com>
To:	Mahesh Bandewar <maheshb@...gle.com>
CC:	Jay Vosburgh <j.vosburgh@...il.com>,
	Veaceslav Falico <vfalico@...hat.com>,
	Andy Gospodarek <andy@...yhouse.net>,
	David Miller <davem@...emloft.net>,
	netdev <netdev@...r.kernel.org>,
	Eric Dumazet <edumazet@...gle.com>,
	Maciej Zenczykowski <maze@...gle.com>
Subject: Re: [PATCH net-next v4 2/2] bonding: Simplify the xmit function for
 modes that use xmit_hash

On 09/20/2014 02:09 AM, Mahesh Bandewar wrote:
> On Fri, Sep 19, 2014 at 4:06 AM, Nikolay Aleksandrov <nikolay@...hat.com> wrote:
>>
>> On 09/19/2014 12:00 PM, Nikolay Aleksandrov wrote:
>>> On 09/18/2014 11:53 PM, Mahesh Bandewar wrote:
>>>> Earlier change to use usable slave array for TLB mode had an additional
>>>> performance advantage. So extending the same logic to all other modes
>>>> that use xmit-hash for slave selection (viz 802.3AD, and XOR modes).
>>>> Also consolidating this with the earlier TLB change.
>>>>
>>>> The main idea is to build the usable slaves array in the control path
>>>> and use that array for slave selection during xmit operation.
>>>>
>>>> Measured performance in a setup with a bond of 4x1G NICs with 200
>>>> instances of netperf for the modes involved (3ad, xor, tlb)
>>>> cmd: netperf -t TCP_RR -H <TargetHost> -l 60 -s 5
>>>>
>>>> Mode        TPS-Before   TPS-After
>>>>
>>>> 802.3ad   : 468,694      493,101
>>>> TLB (lb=0): 392,583      392,965
>>>> XOR       : 475,696      484,517
>>>>
>>>> Signed-off-by: Mahesh Bandewar <maheshb@...gle.com>
>>>> ---
>>>> v1:
>>>>   (a) If bond_update_slave_arr() fails to allocate memory, it will overwrite
>>>>       the slave that need to be removed.
>>>>   (b) Freeing of array will assign NULL (to handle bond->down to bond->up
>>>>       transition gracefully.
>>>>   (c) Change from pr_debug() to pr_err() if bond_update_slave_arr() returns
>>>>       failure.
>>>>   (d) XOR: bond_update_slave_arr() will consider mii-mon, arp-mon cases and
>>>>       will populate the array even if these parameters are not used.
>>>>   (e) 3AD: Should handle the ad_agg_selection_logic correctly.
>>>> v2:
>>>>   (a) Removed rcu_read_{un}lock() calls from array manipulation code.
>>>>   (b) Slave link-events now refresh array for all these modes.
>>>>   (c) Moved free-array call from bond_close() to bond_uninit().
>>>> v3:
>>>>   (a) Fixed null pointer dereference.
>>>>   (b) Removed bond->lock lockdep dependency.
>>>> v4:
>>>>   (a) Made to changes to comply with Nikolay's locking changes
>>>>   (b) Added a work-queue to refresh slave-array when RTNL is not held
>>>>   (c) Array refresh happens ONLY with RTNL now.
>>>>   (d) alloc changed from GFP_ATOMIC to GFP_KERNEL
>>>>
>> <<<snip>>>
>>>> @@ -3839,6 +4003,7 @@ static void bond_uninit(struct net_device *bond_dev)
>>>>      struct bonding *bond = netdev_priv(bond_dev);
>>>>      struct list_head *iter;
>>>>      struct slave *slave;
>>>> +    struct bond_up_slave *arr;
>>>>
>>>>      bond_netpoll_cleanup(bond_dev);
>>>>
>>>> @@ -3847,6 +4012,12 @@ static void bond_uninit(struct net_device *bond_dev)
>>>>              __bond_release_one(bond_dev, slave->dev, true);
>>>>      netdev_info(bond_dev, "Released all slaves\n");
>>>>
>> Sorry but I just spotted a major problem, bond_3ad_unbind_slave() (called
>> from __bond_release_one) calls ad_agg_selection_logic() which can re-arm
>> the slave_arr work after it's supposed to be stopped here (i.e. the bond
>> device has been closed so all works should've been stopped) so we might
>> leak memory and access freed memory after all since it'll keep
>> re-scheduling itself until it can acquire rtnl which is after the bond
>> device has been destroyed.
>>
> This should not be a problem. ndo_close (bond_close()) is called
> before ndo_uninit(bond_uninit()), so the work-queues get cancelled
> there so if rearm tries to schedule some work after queue gets
> cancelled, it can't do much and wont harm anything.
> Hence there wont be any arrays built once it's free-ed completely and
> therefore no memory leak. I addded some instrumentation and tried
> following sequence -
> 
> # modprobe bonding mode=4
> # ip link set bond0 up
> # [Add ip]
> # [Add default route]
> # ifenslave bond0 eth0 eth1 eth2 eth3
> ....
> [Run some backgound traffic. I used netperf.]
> 
> # ip link bond0 down
> 
> I did not see anything "bad" happening. Did your trial produced
> something unpleasant?
> 
The test you've done is irrelevant to the situation that I described
because ndo_uninit() is called when the device is being destroyed. Moreover
the case I told you about would require to have an active aggregator and an
inactive one (i.e. so agg selection logic will get called), here is the result:
[  428.916586] bond1 (unregistering): Removing an active aggregator
[  428.916589] Failed to build slave-array.
[  428.916849] bond1 (unregistering): Releasing active interface eth1
[  428.920342] bond1 (unregistering): Released all slaves
[  428.923043] Failed to update slave array from WT
[  428.924098] Failed to update slave array from WT
[  428.925125] Failed to update slave array from WT
[  428.926120] Failed to update slave array from WT
[  428.927096] Failed to update slave array from WT
[  428.928101] Failed to update slave array from WT
[  428.929120] Failed to update slave array from WT
[  428.930086] BUG: unable to handle kernel NULL pointer dereference at
       (null)
[  428.930644] IP: [<ffffffff810aa37b>] __queue_work+0x7b/0x350
[  428.930946] PGD 0
[  428.931053] Oops: 0000 [#1] SMP
[  428.931053] Modules linked in: sfc ptp pps_core mdio i2c_algo_bit mtd
bonding(O) snd_hda_codec_generic joydev crct10dif_pclmul crc32_pclmul
i2c_piix4 ppdev crc32c_intel ghash_clmulni_intel parport_pc snd_hda_intel
snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_timer 9pnet_virtio
snd 9pnet pcspkr parport i2ccore serio_raw virtio_console virtio_balloon
pvpanic soundcore virtio_blk virtio_net ata_generic floppy pata_acpi
virtio_pci virtio_ring virtio
[  428.935022] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O
3.17.0-rc4+ #30
[  428.935022] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  428.935022] task: ffffffff81c1b460 ti: ffffffff81c00000 task.ti:
ffffffff81c00000
[  428.935022] RIP: 0010:[<ffffffff810aa37b>]  [<ffffffff810aa37b>]
__queue_work+0x7b/0x350
[  428.935022] RSP: 0018:ffff88005f003e28  EFLAGS: 00010086
[  428.935022] RAX: ffff88005c05c800 RBX: 0000000000000000 RCX:
0000000000000000
[  428.935022] RDX: 0000000000000000 RSI: 0000000000000006 RDI:
ffff88005a4fbd58
[  428.935022] RBP: ffff88005f003e60 R08: 0000000000000046 R09:
ffffffff8225abc2
[  428.935022] R10: 0000000000000004 R11: 0000000000000005 R12:
ffff88005a4fbd58
[  428.935022] R13: 0000000000000008 R14: ffff88004b211800 R15:
00000000000102f0
[  428.935022] FS:  0000000000000000(0000) GS:ffff88005f000000(0000)
knlGS:0000000000000000
[  428.935022] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  428.935022] CR2: 0000000000000000 CR3: 000000004abde000 CR4:
00000000000406f0
[  428.935022] Stack:
[  428.935022]  0a19522f72b12222 0000000081c1b460 ffffffff8225abc0
ffff88005a4fbd78
[  428.935022]  0000000000000101 ffffffff810aa650 ffff88005a4fbd58
ffff88005f003e70
[  428.935022]  ffffffff810aa668 ffff88005f003ea8 ffffffff810f3536
ffffffff8225abc0
[  428.935022] Call Trace:
[  428.935022]  <IRQ>
[  428.935022]
[  428.935022]  [<ffffffff810aa650>] ? __queue_work+0x350/0x350
[  428.935022]  [<ffffffff810aa668>] delayed_work_timer_fn+0x18/0x20
[  428.935022]  [<ffffffff810f3536>] call_timer_fn+0x36/0x120
[  428.935022]  [<ffffffff810aa650>] ? __queue_work+0x350/0x350
[  428.935022]  [<ffffffff810f38f5>] run_timer_softirq+0x1a5/0x320
[  428.935022]  [<ffffffff81096dc5>] __do_softirq+0xf5/0x2b0
[  428.935022]  [<ffffffff810971fd>] irq_exit+0xbd/0xd0
[  428.935022]  [<ffffffff8173b715>] smp_apic_timer_interrupt+0x45/0x60
[  428.935022]  [<ffffffff8173981d>] apic_timer_interrupt+0x6d/0x80
[  428.935022]  <EOI>
[  428.935022]
[  428.935022]  [<ffffffff810581c6>] ? native_safe_halt+0x6/0x10
[  428.935022]  [<ffffffff8101f36f>] default_idle+0x1f/0xe0
[  428.935022]  [<ffffffff8101fd8f>] arch_cpu_idle+0xf/0x20
[  428.935022]  [<ffffffff810d25dd>] cpu_startup_entry+0x38d/0x3c0
[  428.935022]  [<ffffffff81722927>] rest_init+0x87/0x90
[  428.935022]  [<ffffffff81d3510e>] start_kernel+0x482/0x4a3
[  428.935022]  [<ffffffff81d34a85>] ? set_init_arg+0x53/0x53
[  428.935022]  [<ffffffff81d34120>] ? early_idt_handlers+0x120/0x120
[  428.935022]  [<ffffffff81d345ee>] x86_64_start_reservations+0x2a/0x2c
[  428.935022]  [<ffffffff81d3473d>] x86_64_start_kernel+0x14d/0x170
[  428.935022] Code: 84 bb 01 00 00 a8 02 0f 85 eb 00 00 00 48 63 45 d4 49
8b 9e 08 01 00 00 48 03 1c c5 60 fa d0 81 4c 89 e7 e8 18 f5 ff ff 48 85 c0
<48> 8b 3b 0f 84 7c 01 00 00 48 39 c7 0f 84 73 01 00 00 48 89 c7
[  428.935022] RIP  [<ffffffff810aa37b>] __queue_work+0x7b/0x350
[  428.935022]  RSP <ffff88005f003e28>
[  428.935022] CR2: 0000000000000000

This is because it keeps trying to re-schedule even though the interface's
memory has been freed.

While testing this I spotted another issue as well - Failed to build
slave_arr message has been printed too many times because you print it in
3ad mode when there's no active aggregator (bond_3ad_get_active_agg_info
check in bond_update_slave_arr) which leads to re-scheduling which also
lead to a deadlock.

>>>> +    arr = rtnl_dereference(bond->slave_arr);
>>>> +    if (arr) {
>>>> +            kfree_rcu(arr, rcu);
>>>> +            RCU_INIT_POINTER(bond->slave_arr, NULL);
>>>> +    }
>>>> +
>>>>      list_del(&bond->bond_list);
>>>>
>>>>      bond_debug_unregister(bond);
>>>> diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
>>>> index 98dc0d7ad731..4635b175256a 100644
>>>> --- a/drivers/net/bonding/bonding.h
>>>> +++ b/drivers/net/bonding/bonding.h
>>>> @@ -177,6 +177,12 @@ struct slave {
>>>>      struct kobject kobj;
>>>>  };
>>>>
>>>> +struct bond_up_slave {
>>>> +    unsigned int    count;
>>>> +    struct rcu_head rcu;
>>>> +    struct slave    *arr[0];
>>>> +};
>>>> +
>>>>  /*
>>>>   * Link pseudo-state only used internally by monitors
>>>>   */
>>>> @@ -191,6 +197,7 @@ struct bonding {
>>>>      struct   slave __rcu *curr_active_slave;
>>>>      struct   slave __rcu *current_arp_slave;
>>>>      struct   slave __rcu *primary_slave;
>>>> +    struct   bond_up_slave __rcu *slave_arr; /* Array of usable slaves */
>>>>      bool     force_primary;
>>>>      s32      slave_cnt; /* never change this value outside the attach/detach wrappers */
>>>>      int     (*recv_probe)(const struct sk_buff *, struct bonding *,
>>>> @@ -220,6 +227,7 @@ struct bonding {
>>>>      struct   delayed_work alb_work;
>>>>      struct   delayed_work ad_work;
>>>>      struct   delayed_work mcast_work;
>>>> +    struct   delayed_work slave_arr_work;
>>>>  #ifdef CONFIG_DEBUG_FS
>>>>      /* debugging support via debugfs */
>>>>      struct   dentry *debug_dir;
>>>> @@ -531,6 +539,8 @@ const char *bond_slave_link_status(s8 link);
>>>>  struct bond_vlan_tag *bond_verify_device_path(struct net_device *start_dev,
>>>>                                            struct net_device *end_dev,
>>>>                                            int level);
>>>> +int bond_update_slave_arr(struct bonding *bond, struct slave *skipslave);
>>>> +void bond_slave_arr_work_rearm(struct bonding *bond);
>>>>
>>>>  #ifdef CONFIG_PROC_FS
>>>>  void bond_create_proc_entry(struct bonding *bond);
>>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@...r.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ