netdev - Re: [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <877dv6jw8p.fsf@mellanox.com>
Date:   Tue, 14 Jul 2020 11:12:38 +0200
From:   Petr Machata <petrm@...lanox.com>
To:     Cong Wang <xiyou.wangcong@...il.com>
Cc:     Linux Kernel Network Developers <netdev@...r.kernel.org>,
        David Miller <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Eric Dumazet <eric.dumazet@...il.com>,
        Jiri Pirko <jiri@...lanox.com>,
        Ido Schimmel <idosch@...lanox.com>
Subject: Re: [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks


Cong Wang <xiyou.wangcong@...il.com> writes:

> On Fri, Jul 10, 2020 at 7:40 AM Petr Machata <petrm@...lanox.com> wrote:
>>
>>
>> Cong Wang <xiyou.wangcong@...il.com> writes:
>>
>> > On Wed, Jul 8, 2020 at 5:13 PM Petr Machata <petrm@...lanox.com> wrote:
>> >>
>> >>
>> >> Petr Machata <petrm@...lanox.com> writes:
>> >>
>> >> > Cong Wang <xiyou.wangcong@...il.com> writes:
>> >> >
>> >> > I'll think about it some more. For now I will at least fix the lack of
>> >> > locking.
>> >>
>> >> I guess I could store smp_processor_id() that acquired the lock in
>> >> struct qdisc_skb_head. Do a trylock instead of lock, and on fail check
>> >> the stored value. I'll need to be careful about the race between
>> >> unsuccessful trylock and the test, and about making sure CPU ID doesn't
>> >> change after it is read. I'll probe this tomorrow.
>> >
>> > Like __netif_tx_lock(), right? Seems doable.
>>
>> Good to see it actually used, I wasn't sure if the idea made sense :)
>>
>> Unfortunately it is not enough.
>>
>> Consider two threads (A, B) and two netdevices (eth0, eth1):
>>
>> - "A" takes eth0's root lock and proceeds to classification
>> - "B" takes eth1's root lock and proceeds to classification
>> - "A" invokes mirror to eth1, waits on lock held by "B"
>> - "B" invakes mirror to eth0, waits on lock held by "A"
>> - Some say they are still waiting to this day.
>
> Sure, AA or ABBA deadlock.
>
>>
>> So one option that I see is to just stash the mirrored packet in a queue
>> instead of delivering it right away:
>>
>> - s/netif_receive_skb/netif_rx/ in act_mirred
>>
>> - Reuse the RX queue for TX packets as well, differentiating the two by
>>   a bit in SKB CB. Then process_backlog() would call either
>>   __netif_receive_skb() or dev_queue_transmit().
>>
>> - Drop mirred_rec_level guard.
>
> I don't think I follow you, the root qdisc lock is on egress which has
> nothing to do with ingress, so I don't see how netif_rx() is even involved.

netif_rx() isn't, but __netif_receive_skb() is, and that can lead to the
deadlock as well when another mirred redirects it back to the locked
egress queue.

So a way to solve "mirred ingress dev" action deadlock is to
s/netif_receive_skb/netif_rx/. I.e. don't resolve the mirror right away,
go through the per-CPU queue.

Then "mirred egress dev" could be fixed similarly by repurposing the
queue for both ingress and egress, differentiating ingress packets from
egress ones by a bit in SKB CB.

>>
>> This seems to work, but I might be missing something non-obvious, such
>> as CB actually being used for something already in that context. I would
>> really rather not introduce a second backlog queue just for mirred
>> though.
>>
>> Since mirred_rec_level does not kick in anymore, the same packet can end
>> up being forwarded from the backlog queue, to the qdisc, and back to the
>> backlog queue, forever. But that seems OK, that's what the admin
>> configured, so that's what's happening.
>>
>> If this is not a good idea for some reason, this might work as well:
>>
>> - Convert the current root lock to an rw lock. Convert all current
>>   lockers to write lock (which should be safe), except of enqueue, which
>>   will take read lock. That will allow many concurrent threads to enter
>>   enqueue, or one thread several times, but it will exclude all other
>>   users.
>
> Are you sure we can parallelize enqueue()? They all need to move
> skb into some queue, which is not able to parallelize with just a read
> lock. Even the "lockless" qdisc takes a spinlock, r->producer_lock,
> for enqueue().

That's why the second spin lock is for. In guards private data,
including the queues.

>>
>>   So this guards configuration access to the qdisc tree, makes sure
>>   qdiscs don't go away from under one's feet.
>>
>> - Introduce another spin lock to guard the private data of the qdisc
>>   tree, counters etc., things that even two concurrent enqueue
>>   operations shouldn't trample on. Enqueue takes this spin lock after
>>   read-locking the root lock. act_mirred drops it before injecting the
>>   packet and takes it again afterwards.
>>
>> Any opinions y'all?
>
> I thought about forbidding mirror/redirecting to the same device,
> but there might be some legitimate use cases of such. So, I don't

Yes, and also that's not enough:

- A chain of mirreds can achieve the deadlocks as well (i.e. mirror to
  eth1, redirect back to eth0). Or the ABA case shown above, where it's
  two actions that don't even work with the same packets causing the
  deadlock.

- I suspect general forwarding could cause this deadlock as well. E.g.
  redirecting to ingress of a device, where bridge, router take over and
  bring the packet back to egress. I have not tried reproducing this
  though, maybe there's a queue or delayed work etc. somewhere in there
  that makes this not an issue.

> have any other ideas yet, perhaps there is some way to refactor
> dev_queue_xmit() to avoid this deadlock.