[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200306134539.GI2546@localhost.localdomain>
Date: Fri, 6 Mar 2020 10:45:39 -0300
From: Marcelo Ricardo Leitner <marcelo.leitner@...il.com>
To: Paul Blakey <paulb@...lanox.com>
Cc: Edward Cree <ecree@...arflare.com>,
Saeed Mahameed <saeedm@...lanox.com>,
Oz Shlomo <ozsh@...lanox.com>,
Jakub Kicinski <jakub.kicinski@...ronome.com>,
Vlad Buslov <vladbu@...lanox.com>,
David Miller <davem@...emloft.net>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
Jiri Pirko <jiri@...lanox.com>, Roi Dayan <roid@...lanox.com>
Subject: Re: [PATCH net-next ct-offload 02/13] net/sched: act_ct: Instantiate
flow table entry actions
On Fri, Mar 06, 2020 at 03:22:01PM +0200, Paul Blakey wrote:
>
> On 06/03/2020 13:35, Edward Cree wrote:
> > On 05/03/2020 15:34, Paul Blakey wrote:
> >> NF flow table API associate 5-tuple rule with an action list by calling
> >> the flow table type action() CB to fill the rule's actions.
> >>
> >> In action CB of act_ct, populate the ct offload entry actions with a new
> >> ct_metadata action. Initialize the ct_metadata with the ct mark, label and
> >> zone information. If ct nat was performed, then also append the relevant
> >> packet mangle actions (e.g. ipv4/ipv6/tcp/udp header rewrites).
> > On one hand, the mangle actions are what's already there and they're general
> > enough to cover this. But on the other hand, an explicit NAT flow_action
> > would mean drivers didn't have to grovel through the mangles to figure out
> > that NAT is what they're doing, in the case of HW that supports NAT but not
> > arbitrary pedit mangles. On the gripping hand, if the 'NAT recogniser' can
> > be wrapped up in a library function that drivers can use, that would
> > probably be OK too.
> >
> >> Drivers that offload the ft entries may match on the 5-tuple and perform
> >> the action list.
> >>
> >> Signed-off-by: Paul Blakey <paulb@...lanox.com>
> >> Reviewed-by: Jiri Pirko <jiri@...lanox.com>
> >> ---<snip>> diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
> >> index 23eba61..0773456 100644
> >> --- a/net/sched/act_ct.c
> >> +++ b/net/sched/act_ct.c
> >> @@ -55,7 +55,199 @@ struct tcf_ct_flow_table {
> >> .automatic_shrinking = true,
> >> };
> >>
> >> +static inline struct flow_action_entry *
> >> +tcf_ct_flow_table_flow_action_get_next(struct flow_action *flow_action)
> >> +{
> >> + int i = flow_action->num_entries++;
> >> +
> >> + return &flow_action->entries[i];
> >> +}
> >> +
> >> +static void
> >> +tcf_ct_flow_table_add_action_nat_ipv4(const struct nf_conntrack_tuple *tuple,
> >> + struct nf_conntrack_tuple target,
> >> + struct flow_action *action)
> > This function could do with a comment explaining what it's doing. On
> > first reading I wondered whether those memcmp() were meant to be
> > !memcmp(). (Though that could also just mean I need more caffeine.)
These memcmp() had caught me as well. My reading of it is: "if the
addresses are different, it needs translation".
Similar situation happens with tcp/udp translations below, but there
it's just a port so '!=' was used. (like in
tcf_ct_flow_table_add_action_nat_tcp)
> Sure I'll add one.
> >> +{
> >> + struct flow_action_entry *entry;
> >> +
> >> + if (memcmp(&target.src.u3, &tuple->src.u3, sizeof(target.src.u3))) {
> >> + entry = tcf_ct_flow_table_flow_action_get_next(action);
> >> + entry->id = FLOW_ACTION_MANGLE;
> >> + entry->mangle.htype = FLOW_ACT_MANGLE_HDR_TYPE_IP4;
> >> + entry->mangle.mask = ~(0xFFFFFFFF);
> > These parens are unnecessary.
> > In fact, mask is a u32, so '0' would be equivalent, though I can see a
> > documentational argument for keeping the ~0xffffffff spelling.
> Yes its this way because mangles masks are weird for some reason. ill remove the ().
> >
> >> + entry->mangle.offset = offsetof(struct iphdr, saddr);
> >> + entry->mangle.val = htonl(target.src.u3.ip);
> > AFAICT u3.ip is defined as __be32, so this htonl() is incorrect (did
> > sparse not warn about it?). It would rather be ntohl(), but in any
> > case normal kernel practice is be32_to_cpu().
> Will do.
> >
> >> + } else if (memcmp(&target.dst.u3, &tuple->dst.u3,
> >> + sizeof(target.dst.u3))) {
> > There have been mutterings from OVS about doing both SNAT and DNAT in a
> > single rule. I'm not sure if anything got merged, but it might be
> > worth at least checking that the branches aren't both true, rather than
> > having an elseif that skips the dst check if the src changed.
> right, it is possible as the recent changes to act ct allows the same,ill change this to an if.
> >
> >> + entry = tcf_ct_flow_table_flow_action_get_next(action);
> >> + entry->id = FLOW_ACTION_MANGLE;
> >> + entry->mangle.htype = FLOW_ACT_MANGLE_HDR_TYPE_IP4;
> >> + entry->mangle.mask = ~(0xFFFFFFFF);
> >> + entry->mangle.offset = offsetof(struct iphdr, daddr);
> >> + entry->mangle.val = htonl(target.dst.u3.ip);
> >> + }
> >> +}
> >> +
> >> +static void
> >> +tcf_ct_flow_table_add_action_nat_ipv6(const struct nf_conntrack_tuple *tuple,
> >> + struct nf_conntrack_tuple target,
> >> + struct flow_action *action)
> >> +{
> >> + struct flow_action_entry *entry;
> >> + union nf_inet_addr *addr;
> >> + u32 next_offset = 0;
> >> + int i;
> >> +
> >> + if (memcmp(&target.src.u3, &tuple->src.u3, sizeof(target.src.u3))) {
> >> + addr = &target.src.u3;
> >> + next_offset = offsetof(struct iphdr, saddr);
> > Instead of setting parameters for the function tail (which rules out the
> > both-src-and-dst case), you could factor out the 'make the entries' loop
> > and just call it from here.
>
> right now its needed with src and dst
>
> >
> >> + } else if (memcmp(&target.dst.u3, &tuple->dst.u3,
> >> + sizeof(target.dst.u3))) {
> >> + addr = &target.dst.u3;
> >> + next_offset = offsetof(struct iphdr, daddr);
> >> + } else {
> >> + return;
> >> + }
> >> +
> >> + for (i = 0; i < sizeof(struct in6_addr) / sizeof(u32); i++) {
> >> + entry = tcf_ct_flow_table_flow_action_get_next(action);
> >> + entry->id = FLOW_ACTION_MANGLE;
> >> + entry->mangle.htype = FLOW_ACT_MANGLE_HDR_TYPE_IP6;
> >> + entry->mangle.mask = ~(0xFFFFFFFF);
> >> + entry->mangle.val = htonl(addr->ip6[i]);
> >> + entry->mangle.offset = next_offset;
> > You don't need to perform strength reduction, the compiler is smart
> > enough to do that itself. Just using 'offset + i * sizeof(u32)' here
> > would be clearer imho.
> >
> Not my intention :) but will do.
> >> +
> >> + next_offset += sizeof(u32);
> >> + }
> >> +}
> >> +
> >> +static void
> >> +tcf_ct_flow_table_add_action_nat_tcp(const struct nf_conntrack_tuple *tuple,
> >> + struct nf_conntrack_tuple target,
> >> + struct flow_action *action)
> >> +{
> >> + struct flow_action_entry *entry;
> >> +
> >> + if (target.src.u.tcp.port != tuple->src.u.tcp.port) {
> >> + entry = tcf_ct_flow_table_flow_action_get_next(action);
> >> + entry->id = FLOW_ACTION_MANGLE;
> >> + entry->mangle.htype = FLOW_ACT_MANGLE_HDR_TYPE_TCP;
> >> + entry->mangle.mask = ~(0xFFFF);
> > More unnecessary parens.
> will remove the all .
> >
> >> + entry->mangle.offset = offsetof(struct tcphdr, source);
> >> + entry->mangle.val = htons(target.src.u.tcp.port);
> >> + } else if (target.dst.u.tcp.port != tuple->dst.u.tcp.port) {
> >> + entry = tcf_ct_flow_table_flow_action_get_next(action);
> >> + entry->id = FLOW_ACTION_MANGLE;
> >> + entry->mangle.htype = FLOW_ACT_MANGLE_HDR_TYPE_TCP;
> >> + entry->mangle.mask = ~(0xFFFF);
> >> + entry->mangle.offset = offsetof(struct tcphdr, dest);
> >> + entry->mangle.val = htons(target.dst.u.tcp.port);
> >> + }
> >> +}
> >> +
> >> +static void
> >> +tcf_ct_flow_table_add_action_nat_udp(const struct nf_conntrack_tuple *tuple,
> >> + struct nf_conntrack_tuple target,
> >> + struct flow_action *action)
> >> +{
> >> + struct flow_action_entry *entry;
> >> +
> >> + if (target.src.u.udp.port != tuple->src.u.udp.port) {
> >> + entry = tcf_ct_flow_table_flow_action_get_next(action);
> >> + entry->id = FLOW_ACTION_MANGLE;
> >> + entry->mangle.htype = FLOW_ACT_MANGLE_HDR_TYPE_UDP;
> >> + entry->mangle.mask = ~(0xFFFF);
> >> + entry->mangle.offset = offsetof(struct udphdr, source);
> >> + entry->mangle.val = htons(target.src.u.udp.port);
> >> + } else if (target.dst.u.udp.port != tuple->dst.u.udp.port) {
> >> + entry = tcf_ct_flow_table_flow_action_get_next(action);
> >> + entry->id = FLOW_ACTION_MANGLE;
> >> + entry->mangle.htype = FLOW_ACT_MANGLE_HDR_TYPE_UDP;
> >> + entry->mangle.mask = ~(0xFFFF);
> >> + entry->mangle.offset = offsetof(struct udphdr, dest);
> >> + entry->mangle.val = htons(target.dst.u.udp.port);
> >> + }
> >> +}
> > This is all very boilerplatey; I wonder if factoring it into some
> > preprocessor [ab]use would improve matters. Pro: less risk of a
> > src/dst or udp/tcp typo hiding in there. Con: have to read macros.
>
> like ADD_MANGLE_ENTRY(action, htype,....,val)...
>
> entry = tcf_ct_flow_table_flow_action_get_next(action);
> entry->id = FLOW_ACTION_MANGLE;
> entry->mangle.htype = htype;
> entry->mangle.mask = mask;
> entry->mangle.offset = offset;
> entry->mangle.val = val;
> ?
> then im for it.
>
> >
> >> +
> >> +static void tcf_ct_flow_table_add_action_meta(struct nf_conn *ct,
> >> + enum ip_conntrack_dir dir,
> >> + struct flow_action *action)
> >> +{
> >> + struct nf_conn_labels *ct_labels;
> >> + struct flow_action_entry *entry;
> >> + u32 *act_ct_labels;
> >> +
> >> + entry = tcf_ct_flow_table_flow_action_get_next(action);
> >> + entry->id = FLOW_ACTION_CT_METADATA;
> >> + entry->ct_metadata.zone = nf_ct_zone(ct)->id;
> > I'm not quite sure what the zone is doing in the action. Surely it's
> > a property of the match. Or does this set a ct_zone for a potential
> > *second* conntrack lookup?
>
> this is part of the metadata that driver should mark the with, as it
> can be matched against in following hardware tables/rules. consider
> this set of offloaded rules:
IOW, it's how it adds the zone information to packets just received,
so that it can be matched later on.
>
> tc filter add ...... chain 0 flower ct_state -trk action ct zone 5 goto chain 1
>
> tc filter add ...... chain 0 flower ct_state -trk action ct zone 3 goto chain 1
>
> tc filter add ...... chain 1 flower ct_state +trk+new action ct zone 3 commit pipe action mirred redirect dev1
>
> tc filter add ...... chain 1 flower ct_state +trk+new action ct zone 5 commit pipe action mirred redirect dev2
>
> tc filter add ...... chain 1 flower ct_state +trk+est ct_zone 3 action mirred redirect dev1
>
> tc filter add ...... chain 1 flower ct_state +trk+est ct_zone 5 action mirred redirect dev2
>
>
> so both offloaded +est rules match on packet metadata zone field to figure out the output port,
>
> this is what this action tell hardware to do, mark the packet with this zone, so it can be matched against.
>
>
> >> +#if IS_ENABLED(CONFIG_NF_CONNTRACK_MARK)
> >> + entry->ct_metadata.mark = ct->mark;
> >> +#endif
> >> +
> >> + act_ct_labels = entry->ct_metadata.labels;
> >> + ct_labels = nf_ct_labels_find(ct);
> >> + if (ct_labels)
> >> + memcpy(act_ct_labels, ct_labels->bits, NF_CT_LABELS_MAX_SIZE);
> >> + else
> >> + memset(act_ct_labels, 0, NF_CT_LABELS_MAX_SIZE);
> >> +}
> >> +
> >> +static void tcf_ct_flow_table_add_action_nat(struct net *net,
> >> + struct nf_conn *ct,
> >> + enum ip_conntrack_dir dir,
> >> + struct flow_action *action)
> >> +{
> >> + const struct nf_conntrack_tuple *tuple = &ct->tuplehash[dir].tuple;
> >> + struct nf_conntrack_tuple target;
> >> +
> >> + nf_ct_invert_tuple(&target, &ct->tuplehash[!dir].tuple);
> >> +
> >> + tuple->src.l3num == NFPROTO_IPV4 ?
> >> + tcf_ct_flow_table_add_action_nat_ipv4(tuple, target, action) :
> >> + tcf_ct_flow_table_add_action_nat_ipv6(tuple, target, action);
> > I don't think this kind of ternary [ab]use is kernel style. Also it
> > doesn't let you check for the "not IPV6 either" case.
> > I'd suggest a switch statement. (And this whole tree of functions
> > should be able to return EOPNOTSUPPs for such "can't happen" / "we
> > are confused" cases, rather than being void.)
> we check the proto support earlier. i can change this to a switch and move the check here.
> >
> >> +
> >> + nf_ct_protonum(ct) == IPPROTO_TCP ?
> >> + tcf_ct_flow_table_add_action_nat_tcp(tuple, target, action) :
> >> + tcf_ct_flow_table_add_action_nat_udp(tuple, target, action);
> >> +}
> >> +
> >> +static int tcf_ct_flow_table_fill_actions(struct net *net,
> >> + const struct flow_offload *flow,
> >> + enum flow_offload_tuple_dir tdir,
> >> + struct nf_flow_rule *flow_rule)
> >> +{
> >> + struct flow_action *action = &flow_rule->rule->action;
> >> + const struct nf_conntrack_tuple *tuple;
> >> + struct nf_conn *ct = flow->ct;
> >> + enum ip_conntrack_dir dir;
> >> +
> >> + switch (tdir) {
> >> + case FLOW_OFFLOAD_DIR_ORIGINAL:
> >> + dir = IP_CT_DIR_ORIGINAL;
> >> + break;
> >> + case FLOW_OFFLOAD_DIR_REPLY:
> >> + dir = IP_CT_DIR_REPLY;
> >> + break;
> >> + default:
> >> + return -EOPNOTSUPP;
> >> + }
> >> +
> >> + tuple = &ct->tuplehash[dir].tuple;
> >> + if (tuple->src.l3num != NFPROTO_IPV4 &&
> >> + tuple->src.l3num != NFPROTO_IPV6)
> >> + return -EOPNOTSUPP;
> > Ah, is the proto check here rather than in
> > tcf_ct_flow_table_add_action_nat() to ensure that you don't
> > write *any* flow_action_entries in the unsupported case? In
> > that case maybe the real answer is to add a way to roll back
> > entry additions.
> > Since tcf_ct_flow_table_flow_action_get_next() doesn't appear
> > to do any allocation (or bounds-checking of num_entries!) it
> > seems all that would be needed is to save the old num_entries,
> > and restore it on failure exit.
> >
> > -ed
>
> ill add the bounds check so there is reason for this functions to fail :)
>
> and memset the new entries on fail.
>
> thanks for the review.
>
> Paul.
>
> >
> >> +
> >> + if (nf_ct_protonum(ct) != IPPROTO_TCP &&
> >> + nf_ct_protonum(ct) != IPPROTO_UDP)
> >> + return -EOPNOTSUPP;
> >> +
> >> + tcf_ct_flow_table_add_action_meta(ct, dir, action);
> >> + tcf_ct_flow_table_add_action_nat(net, ct, dir, action);
> >> + return 0;
> >> +}
> >> +
> >> static struct nf_flowtable_type flowtable_ct = {
> >> + .action = tcf_ct_flow_table_fill_actions,
> >> .owner = THIS_MODULE,
> >> };
> >>
> >>
>
Powered by blists - more mailing lists