netdev - Re: [PATCH net-next ct-offload 02/13] net/sched: act_ct: Instantiate flow table entry actions

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ce72a853-a416-4162-5ffb-c719c98fb7cc@solarflare.com>
Date:   Fri, 6 Mar 2020 11:35:18 +0000
From:   Edward Cree <ecree@...arflare.com>
To:     Paul Blakey <paulb@...lanox.com>,
        Saeed Mahameed <saeedm@...lanox.com>,
        "Oz Shlomo" <ozsh@...lanox.com>,
        Jakub Kicinski <jakub.kicinski@...ronome.com>,
        Vlad Buslov <vladbu@...lanox.com>,
        David Miller <davem@...emloft.net>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Jiri Pirko <jiri@...lanox.com>, Roi Dayan <roid@...lanox.com>
Subject: Re: [PATCH net-next ct-offload 02/13] net/sched: act_ct: Instantiate
 flow table entry actions

On 05/03/2020 15:34, Paul Blakey wrote:
> NF flow table API associate 5-tuple rule with an action list by calling
> the flow table type action() CB to fill the rule's actions.
> 
> In action CB of act_ct, populate the ct offload entry actions with a new
> ct_metadata action. Initialize the ct_metadata with the ct mark, label and
> zone information. If ct nat was performed, then also append the relevant
> packet mangle actions (e.g. ipv4/ipv6/tcp/udp header rewrites).
On one hand, the mangle actions are what's already there and they're general
 enough to cover this.  But on the other hand, an explicit NAT flow_action
 would mean drivers didn't have to grovel through the mangles to figure out
 that NAT is what they're doing, in the case of HW that supports NAT but not
 arbitrary pedit mangles.  On the gripping hand, if the 'NAT recogniser' can
 be wrapped up in a library function that drivers can use, that would
 probably be OK too.

> Drivers that offload the ft entries may match on the 5-tuple and perform
> the action list.
> 
> Signed-off-by: Paul Blakey <paulb@...lanox.com>
> Reviewed-by: Jiri Pirko <jiri@...lanox.com>
> ---<snip>> diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
> index 23eba61..0773456 100644
> --- a/net/sched/act_ct.c
> +++ b/net/sched/act_ct.c
> @@ -55,7 +55,199 @@ struct tcf_ct_flow_table {
>  	.automatic_shrinking = true,
>  };
>  
> +static inline struct flow_action_entry *
> +tcf_ct_flow_table_flow_action_get_next(struct flow_action *flow_action)
> +{
> +	int i = flow_action->num_entries++;
> +
> +	return &flow_action->entries[i];
> +}
> +
> +static void
> +tcf_ct_flow_table_add_action_nat_ipv4(const struct nf_conntrack_tuple *tuple,
> +				      struct nf_conntrack_tuple target,
> +				      struct flow_action *action)
This function could do with a comment explaining what it's doing.  On
 first reading I wondered whether those memcmp() were meant to be
 !memcmp().  (Though that could also just mean I need more caffeine.)

> +{
> +	struct flow_action_entry *entry;
> +
> +	if (memcmp(&target.src.u3, &tuple->src.u3, sizeof(target.src.u3))) {
> +		entry = tcf_ct_flow_table_flow_action_get_next(action);
> +		entry->id = FLOW_ACTION_MANGLE;
> +		entry->mangle.htype = FLOW_ACT_MANGLE_HDR_TYPE_IP4;
> +		entry->mangle.mask = ~(0xFFFFFFFF);
These parens are unnecessary.
In fact, mask is a u32, so '0' would be equivalent, though I can see a
 documentational argument for keeping the ~0xffffffff spelling.

> +		entry->mangle.offset = offsetof(struct iphdr, saddr);
> +		entry->mangle.val = htonl(target.src.u3.ip);
AFAICT u3.ip is defined as __be32, so this htonl() is incorrect (did
 sparse not warn about it?).  It would rather be ntohl(), but in any
 case normal kernel practice is be32_to_cpu().

> +	} else if (memcmp(&target.dst.u3, &tuple->dst.u3,
> +			  sizeof(target.dst.u3))) {
There have been mutterings from OVS about doing both SNAT and DNAT in a
 single rule.  I'm not sure if anything got merged, but it might be
 worth at least checking that the branches aren't both true, rather than
 having an elseif that skips the dst check if the src changed.

> +		entry = tcf_ct_flow_table_flow_action_get_next(action);
> +		entry->id = FLOW_ACTION_MANGLE;
> +		entry->mangle.htype = FLOW_ACT_MANGLE_HDR_TYPE_IP4;
> +		entry->mangle.mask = ~(0xFFFFFFFF);
> +		entry->mangle.offset = offsetof(struct iphdr, daddr);
> +		entry->mangle.val = htonl(target.dst.u3.ip);
> +	}
> +}
> +
> +static void
> +tcf_ct_flow_table_add_action_nat_ipv6(const struct nf_conntrack_tuple *tuple,
> +				      struct nf_conntrack_tuple target,
> +				      struct flow_action *action)
> +{
> +	struct flow_action_entry *entry;
> +	union nf_inet_addr *addr;
> +	u32 next_offset = 0;
> +	int i;
> +
> +	if (memcmp(&target.src.u3, &tuple->src.u3, sizeof(target.src.u3))) {
> +		addr = &target.src.u3;
> +		next_offset = offsetof(struct iphdr, saddr);
Instead of setting parameters for the function tail (which rules out the
 both-src-and-dst case), you could factor out the 'make the entries' loop
 and just call it from here.

> +	} else if (memcmp(&target.dst.u3, &tuple->dst.u3,
> +			  sizeof(target.dst.u3))) {
> +		addr = &target.dst.u3;
> +		next_offset = offsetof(struct iphdr, daddr);
> +	} else {
> +		return;
> +	}
> +
> +	for (i = 0; i < sizeof(struct in6_addr) / sizeof(u32); i++) {
> +		entry = tcf_ct_flow_table_flow_action_get_next(action);
> +		entry->id = FLOW_ACTION_MANGLE;
> +		entry->mangle.htype = FLOW_ACT_MANGLE_HDR_TYPE_IP6;
> +		entry->mangle.mask = ~(0xFFFFFFFF);
> +		entry->mangle.val = htonl(addr->ip6[i]);
> +		entry->mangle.offset = next_offset;
You don't need to perform strength reduction, the compiler is smart
 enough to do that itself.  Just using 'offset + i * sizeof(u32)' here
 would be clearer imho.

> +
> +		next_offset += sizeof(u32);
> +	}
> +}
> +
> +static void
> +tcf_ct_flow_table_add_action_nat_tcp(const struct nf_conntrack_tuple *tuple,
> +				     struct nf_conntrack_tuple target,
> +				     struct flow_action *action)
> +{
> +	struct flow_action_entry *entry;
> +
> +	if (target.src.u.tcp.port != tuple->src.u.tcp.port) {
> +		entry = tcf_ct_flow_table_flow_action_get_next(action);
> +		entry->id = FLOW_ACTION_MANGLE;
> +		entry->mangle.htype = FLOW_ACT_MANGLE_HDR_TYPE_TCP;
> +		entry->mangle.mask = ~(0xFFFF);
More unnecessary parens.

> +		entry->mangle.offset = offsetof(struct tcphdr, source);
> +		entry->mangle.val = htons(target.src.u.tcp.port);
> +	} else if (target.dst.u.tcp.port != tuple->dst.u.tcp.port) {
> +		entry = tcf_ct_flow_table_flow_action_get_next(action);
> +		entry->id = FLOW_ACTION_MANGLE;
> +		entry->mangle.htype = FLOW_ACT_MANGLE_HDR_TYPE_TCP;
> +		entry->mangle.mask = ~(0xFFFF);
> +		entry->mangle.offset = offsetof(struct tcphdr, dest);
> +		entry->mangle.val = htons(target.dst.u.tcp.port);
> +	}
> +}
> +
> +static void
> +tcf_ct_flow_table_add_action_nat_udp(const struct nf_conntrack_tuple *tuple,
> +				     struct nf_conntrack_tuple target,
> +				     struct flow_action *action)
> +{
> +	struct flow_action_entry *entry;
> +
> +	if (target.src.u.udp.port != tuple->src.u.udp.port) {
> +		entry = tcf_ct_flow_table_flow_action_get_next(action);
> +		entry->id = FLOW_ACTION_MANGLE;
> +		entry->mangle.htype = FLOW_ACT_MANGLE_HDR_TYPE_UDP;
> +		entry->mangle.mask = ~(0xFFFF);
> +		entry->mangle.offset = offsetof(struct udphdr, source);
> +		entry->mangle.val = htons(target.src.u.udp.port);
> +	} else if (target.dst.u.udp.port != tuple->dst.u.udp.port) {
> +		entry = tcf_ct_flow_table_flow_action_get_next(action);
> +		entry->id = FLOW_ACTION_MANGLE;
> +		entry->mangle.htype = FLOW_ACT_MANGLE_HDR_TYPE_UDP;
> +		entry->mangle.mask = ~(0xFFFF);
> +		entry->mangle.offset = offsetof(struct udphdr, dest);
> +		entry->mangle.val = htons(target.dst.u.udp.port);
> +	}
> +}
This is all very boilerplatey; I wonder if factoring it into some
 preprocessor [ab]use would improve matters.  Pro: less risk of a
 src/dst or udp/tcp typo hiding in there.  Con: have to read macros.

> +
> +static void tcf_ct_flow_table_add_action_meta(struct nf_conn *ct,
> +					      enum ip_conntrack_dir dir,
> +					      struct flow_action *action)
> +{
> +	struct nf_conn_labels *ct_labels;
> +	struct flow_action_entry *entry;
> +	u32 *act_ct_labels;
> +
> +	entry = tcf_ct_flow_table_flow_action_get_next(action);
> +	entry->id = FLOW_ACTION_CT_METADATA;
> +	entry->ct_metadata.zone = nf_ct_zone(ct)->id;
I'm not quite sure what the zone is doing in the action.  Surely it's
 a property of the match.  Or does this set a ct_zone for a potential
 *second* conntrack lookup?

> +#if IS_ENABLED(CONFIG_NF_CONNTRACK_MARK)
> +	entry->ct_metadata.mark = ct->mark;
> +#endif
> +
> +	act_ct_labels = entry->ct_metadata.labels;
> +	ct_labels = nf_ct_labels_find(ct);
> +	if (ct_labels)
> +		memcpy(act_ct_labels, ct_labels->bits, NF_CT_LABELS_MAX_SIZE);
> +	else
> +		memset(act_ct_labels, 0, NF_CT_LABELS_MAX_SIZE);
> +}
> +
> +static void tcf_ct_flow_table_add_action_nat(struct net *net,
> +					     struct nf_conn *ct,
> +					     enum ip_conntrack_dir dir,
> +					     struct flow_action *action)
> +{
> +	const struct nf_conntrack_tuple *tuple = &ct->tuplehash[dir].tuple;
> +	struct nf_conntrack_tuple target;
> +
> +	nf_ct_invert_tuple(&target, &ct->tuplehash[!dir].tuple);
> +
> +	tuple->src.l3num == NFPROTO_IPV4 ?
> +		tcf_ct_flow_table_add_action_nat_ipv4(tuple, target, action) :
> +		tcf_ct_flow_table_add_action_nat_ipv6(tuple, target, action);
I don't think this kind of ternary [ab]use is kernel style.  Also it
 doesn't let you check for the "not IPV6 either" case.
I'd suggest a switch statement.  (And this whole tree of functions
 should be able to return EOPNOTSUPPs for such "can't happen" / "we
 are confused" cases, rather than being void.)

> +
> +	nf_ct_protonum(ct) == IPPROTO_TCP ?
> +		tcf_ct_flow_table_add_action_nat_tcp(tuple, target, action) :
> +		tcf_ct_flow_table_add_action_nat_udp(tuple, target, action);
> +}
> +
> +static int tcf_ct_flow_table_fill_actions(struct net *net,
> +					  const struct flow_offload *flow,
> +					  enum flow_offload_tuple_dir tdir,
> +					  struct nf_flow_rule *flow_rule)
> +{
> +	struct flow_action *action = &flow_rule->rule->action;
> +	const struct nf_conntrack_tuple *tuple;
> +	struct nf_conn *ct = flow->ct;
> +	enum ip_conntrack_dir dir;
> +
> +	switch (tdir) {
> +	case FLOW_OFFLOAD_DIR_ORIGINAL:
> +		dir = IP_CT_DIR_ORIGINAL;
> +		break;
> +	case FLOW_OFFLOAD_DIR_REPLY:
> +		dir = IP_CT_DIR_REPLY;
> +		break;
> +	default:
> +		return -EOPNOTSUPP;
> +	}
> +
> +	tuple = &ct->tuplehash[dir].tuple;
> +	if (tuple->src.l3num != NFPROTO_IPV4 &&
> +	    tuple->src.l3num != NFPROTO_IPV6)
> +		return -EOPNOTSUPP;
Ah, is the proto check here rather than in
 tcf_ct_flow_table_add_action_nat() to ensure that you don't
 write *any* flow_action_entries in the unsupported case?  In
 that case maybe the real answer is to add a way to roll back
 entry additions.
Since tcf_ct_flow_table_flow_action_get_next() doesn't appear
 to do any allocation (or bounds-checking of num_entries!) it
 seems all that would be needed is to save the old num_entries,
 and restore it on failure exit.

-ed

> +
> +	if (nf_ct_protonum(ct) != IPPROTO_TCP &&
> +	    nf_ct_protonum(ct) != IPPROTO_UDP)
> +		return -EOPNOTSUPP;
> +
> +	tcf_ct_flow_table_add_action_meta(ct, dir, action);
> +	tcf_ct_flow_table_add_action_nat(net, ct, dir, action);
> +	return 0;
> +}
> +
>  static struct nf_flowtable_type flowtable_ct = {
> +	.action		= tcf_ct_flow_table_fill_actions,
>  	.owner		= THIS_MODULE,
>  };
>  
>