netdev - Re: [RFC net-next 0/2] prevent sync issues with hw offload of flower

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <vbfimp5oig9.fsf@mellanox.com>
Date:   Thu, 3 Oct 2019 17:19:22 +0000
From:   Vlad Buslov <vladbu@...lanox.com>
To:     John Hurley <john.hurley@...ronome.com>
CC:     Vlad Buslov <vladbu@...lanox.com>, Jiri Pirko <jiri@...lanox.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "simon.horman@...ronome.com" <simon.horman@...ronome.com>,
        "jakub.kicinski@...ronome.com" <jakub.kicinski@...ronome.com>,
        "oss-drivers@...ronome.com" <oss-drivers@...ronome.com>
Subject: Re: [RFC net-next 0/2] prevent sync issues with hw offload of flower


On Thu 03 Oct 2019 at 19:59, John Hurley <john.hurley@...ronome.com> wrote:
> On Thu, Oct 3, 2019 at 5:26 PM Vlad Buslov <vladbu@...lanox.com> wrote:
>>
>>
>> On Thu 03 Oct 2019 at 02:14, John Hurley <john.hurley@...ronome.com> wrote:
>> > Hi,
>> >
>> > Putting this out an RFC built on net-next. It fixes some issues
>> > discovered in testing when using the TC API of OvS to generate flower
>> > rules and subsequently offloading them to HW. Rules seen contain the same
>> > match fields or may be rule modifications run as a delete plus an add.
>> > We're seeing race conditions whereby the rules present in kernel flower
>> > are out of sync with those offloaded. Note that there are some issues
>> > that will need fixed in the RFC before it becomes a patch such as
>> > potential races between releasing locks and re-taking them. However, I'm
>> > putting this out for comments or potential alternative solutions.
>> >
>> > The main cause of the races seem to be in the chain table of cls_api. If
>> > a tcf_proto is destroyed then it is removed from its chain. If a new
>> > filter is then added to the same chain with the same priority and protocol
>> > a new tcf_proto will be created - this may happen before the first is
>> > fully removed and the hw offload message sent to the driver. In cls_flower
>> > this means that the fl_ht_insert_unique() function can pass as its
>> > hashtable is associated with the tcf_proto. We are then in a position
>> > where the 'delete' and the 'add' are in a race to get offloaded. We also
>> > noticed that doing an offload add, then checking if a tcf_proto is
>> > concurrently deleting, then remove the offload if it is, can extend the
>> > out of order messages. Drivers do not expect to get duplicate rules.
>> > However, the kernel TC datapath they are not duplicates so we can get out
>> > of sync here.
>> >
>> > The RFC fixes this by adding a pre_destroy hook to cls_api that is called
>> > when a tcf_proto is signaled to be destroyed but before it is removed from
>> > its chain (which is essentially the lock for allowing duplicates in
>> > flower). Flower then uses this new hook to send the hw delete messages
>> > from tcf_proto destroys, preventing them racing with duplicate adds. It
>> > also moves the check for 'deleting' to before the sending the hw add
>> > message.
>> >
>> > John Hurley (2):
>> >   net: sched: add tp_op for pre_destroy
>> >   net: sched: fix tp destroy race conditions in flower
>> >
>> >  include/net/sch_generic.h |  3 +++
>> >  net/sched/cls_api.c       | 29 ++++++++++++++++++++++++-
>> >  net/sched/cls_flower.c    | 55 ++++++++++++++++++++++++++---------------------
>> >  3 files changed, 61 insertions(+), 26 deletions(-)
>>
>> Hi John,
>>
>> Thanks for working on this!
>>
>> Are there any other sources for race conditions described in this
>> letter? When you describe tcf_proto deletion you say "main cause" but
>> don't provide any others. If tcf_proto is the only problematic part,
>
> Hi Vlad,
> Thanks for the input.
> The tcf_proto deletion was the cause from the tests we ran. That's not
> to say there are not more I wasn't seeing in my analysis.
>
>> then it might be worth to look into alternative ways to force concurrent
>> users to wait for proto deletion/destruction to be properly finished.
>> Maybe having some table that maps chain id + prio to completion would be
>> simpler approach? With such infra tcf_proto_create() can wait for
>> previous proto with same prio and chain to be fully destroyed (including
>> offloads) before creating a new one.
>
> I think a problem with this is that the chain removal functions call
> tcf_proto_put() (which calls destroy when ref is 0) so, if other
> concurrent processes (like a dump) have references to the tcf_proto
> then we may not get the hw offload even by the time the chain deletion
> function has finished. We would need to make sure this was tracked -
> say after the tcf_proto_destroy function has completed.
> How would you suggest doing the wait? With a replay flag as happens in
> some other places?
>
> To me it seems the main problem is that the tcf_proto being in a chain
> almost acts like the lock to prevent duplicates filters getting to the
> driver. We need some mechanism to ensure a delete has made it to HW
> before we release this 'lock'.

Maybe something like:

1. Extend block with hash table with key being chain id and prio
combined and value is some structure that contains struct completion
(completed in tcf_proto_destroy() where we sure that all rules were
removed from hw) and a reference counter.

2. When cls API wants to delete proto instance
(tcf_chain_tp_delete_empty(), chain flush, etc.), new member is added to
table from 1. with chain+prio of proto that is being deleted (atomically
with detaching of proto from chain).

3. When inserting new proto, verify that there are no corresponding
entry in hash table with same chain+prio. If there is, increment
reference counter and wait for completion. Release reference counter
when completed.

>
>>
>> Regards,
>> Vlad