[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGXJAmzVYDQtBVwdhazf9R2UgMCOOwppD+EM2-NY25t+N1vJhA@mail.gmail.com>
Date: Tue, 17 Dec 2024 21:46:42 -0800
From: John Ousterhout <ouster@...stanford.edu>
To: Edward Cree <ecree.xilinx@...il.com>
Cc: netdev@...r.kernel.org
Subject: Re: [PATCH net-next v3 03/12] net: homa: create shared Homa header files
(note: these comments arrived after I posted the v4 patch series, so
fixes will appear in v5)
On Mon, Dec 16, 2024 at 10:36 PM Edward Cree <ecree.xilinx@...il.com> wrote:
>
> On 09/12/2024 17:51, John Ousterhout wrote:
> > oma_impl.h defines "struct homa", which contains overall information
> > about the Homa transport, plus various odds and ends that are used
> > throughout the Homa implementation.
>
> Should parts of 'struct homa' be per network namespace, rather than
> global, so that in systems hosting multiple containers each netns can
> configure Homa for the way it wants to use it?
Possibly. I haven't addressed the issue of customizing the
configuration very thoroughly yet, but I can imagine it might happen
at multiple levels (e.g. for a network namespace, a socket, etc.). I'd
like to defer this a bit if possible.
> > +struct homa_interest {
> > + /**
> > + * @thread: Thread that would like to receive a message. Will get
> > + * woken up when a suitable message becomes available.
> > + */
> > + struct task_struct *thread;
> > +
> > + /**
> > + * @ready_rpc: This is actually a (struct homa_rpc *) identifying the
> > + * RPC that was found; NULL if no RPC has been found yet. This
> > + * variable is used for synchronization to handoff the RPC, and
> > + * must be set only after @locked is set.
> > + */
> > + atomic_long_t ready_rpc;
> > +
> > + /**
> > + * @locked: Nonzero means that @ready_rpc is locked; only valid
> > + * if @ready_rpc is non-NULL.
> > + */
> > + int locked;
>
> These feel weird; what kind of synchronisation is this for and why
> aren't any of Linux's existing locking primitives suitable? In
> particular the non-typesafe casting of ready_rpc is unpleasant.
> I looked at sync.txt and didn't find an explanation, and it wasn't
> obvious from reading homa_register_interests() either. (Are plain
> writes to int even guaranteed to be ordered wrt the atomics on
> rpc->flags or ready_rpc?)
> My best guess from looking at how `thread` is used is that all this
> is somehow simulating a completion? You shouldn't need to manually
> do stuff like sleeping and waking threads from within something as
> generic as a protocol implementation.
This is a lock-free mechanism to hand off a complete message to a
receiver thread (which may be polling, though the polling code has
been removed from this stripped down patch series). I couldn't find an
"atomic pointer" structure, which is why the code uses atomic_long_t
(which I agree is a bit ugly).Memory ordering w.r.t. @locked is
ensured by using atomic_long_set_release to modify @ready_rpc. I'm not
sure I understand your comment about not manually sleeping and waking
threads from within Homa; is there a particular mechanism for this
that you have in mind?
I improved the comment a bit, plus added accessor functions
homa_interest_get_rpc and homa_interest_set_rpc; hopefully this will
make the code a bit less mysterious.
> > + interest->request_links.next = LIST_POISON1;
> > + interest->response_links.next = LIST_POISON1;
>
> Any particular reason why you're opencoding poisoning, rather than
> using the list helpers (which distinguish between a list_head that
> has been inited but never added, so list_empty() returns true, and
> one which has been list_del()ed and thus poisoned)?
> It would likely be easier for others to debug any issues that arise
> in Homa if when they see a list_head in an oops or crashdump they
> can relate it to the standard lifecycle.
I couldn't find any other way to do this: I want to initialize the
links to be the same state as if list_del had been called, but
list_del requires the links already to be initialized. I couldn't find
a function that just poisons the links. I suppose I could first
initialize them and then call list_del, but that felt a bit awkward
also?
> > +/**
> > + * struct homa - Overall information about the Homa protocol implementation.
> > + *
> > + * There will typically only exist one of these at a time, except during
> > + * unit tests.
> > + */
> > +struct homa {
> > + /**
> > + * @next_outgoing_id: Id to use for next outgoing RPC request.
> > + * This is always even: it's used only to generate client-side ids.
> > + * Accessed without locks.
> > + */
> > + atomic64_t next_outgoing_id;
>
> Does the ID need to be unique for the whole machine or just per-
> interface? I would imagine it should be sufficient for the
> (id, source address) or even (id, saddr, sport) tuple to be
> unique.
IDs are unique to the client machine. I've updated the comment to indicate this.
> And are there any security issues here; ought we to do anything
> like TCP does with sequence numbers to try to ensure they aren't
> guessable by an attacker?
There probably are, and the right solutions may well be similar to
TCP. I'm fairly ignorant on the potential security issues; is there
someplace where I can learn more?
> > + /**
> > + * @throttled_rpcs: Contains all homa_rpcs that have bytes ready
> > + * for transmission, but which couldn't be sent without exceeding
> > + * the queue limits for transmission. Manipulate only with "_rcu"
> > + * functions.
> > + */
> > + struct list_head throttled_rpcs;
>
> I'm not sure exactly how it works but I believe you can annotate
> the declaration with __rcu to get sparse to enforce this.
I poked around and it appears to me that list_head's don't get
declared '__rcu' (this designation is intended for pointers, if I'm
understanding correctly). Instead, the list_head is manipulated with
rcu functions such as list_for_each_entry_rcu. Let me know if I'm
missing something?
> > + /**
> > + * @next_client_port: A client port number to consider for the
> > + * next Homa socket; increments monotonically. Current value may
> > + * be in the range allocated for servers; must check before using.
> > + * This port may also be in use already; must check.
> > + */
> > + __u16 next_client_port __aligned(L1_CACHE_BYTES);
>
> Again, does guessability by an attacker pose any security risks
> here?
Potentially; how can I learn more?
> > + /**
> > + * @link_bandwidth: The raw bandwidth of the network uplink, in
> > + * units of 1e06 bits per second. Set externally via sysctl.
> > + */
> > + int link_mbps;
>
> What happens if a machine has two uplinks and someone wants to
> use Homa on both of them? I wonder if most of the granting and
> pacing part of Homa ought to be per-netdev rather than per-host.
> (Though in an SDN case with a bunch of containers issuing their
> RPCs through veths you'd want a Homa-aware bridge that could do
> the SRPT rather than bandwidth sharing, and having everything go
> through a single Homa stack instance does give you that for free.
> But then a VM use-case still needs the clever bridge anyway.)
Yes, I'm planning to implement a Homa-specific qdisc that will
implement packet pacing on a per-netdev basis, and that will eliminate
the need for this variable. Granting may also need to be per-netdev...
I'll need to think more about that. Virtual machines are more complex;
to do Homa right, pacing (and potentially granting also) must be done
in a central place that knows about all traffic on a given link.
> > + /**
> > + * @timeout_ticks: abort an RPC if its silent_ticks reaches this value.
> > + */
> > + int timeout_ticks;
>
> This feels more like a socket-level option, perhaps? Just
> thinking out loud.
Possibly. Several other configuration options may also fit in this
category. As I said above, at some point I need to think more
comprehensively about managing options; hopefully this won't be a
show-stopper for upstreaming Homa?
> > + /**
> > + * @gso_force_software: A non-zero value will cause Home to perform
> > + * segmentation in software using GSO; zero means ask the NIC to
> > + * perform TSO. Set externally via sysctl.
> > + */
>
> "Home" appears to be a typo.
Fixed.
> > + /**
> > + * @temp: the values in this array can be read and written with sysctl.
> > + * They have no officially defined purpose, and are available for
> > + * short-term use during testing.
> > + */
> > + int temp[4];
>
> I don't think this belongs in upstream. At best maybe under an ifdef
> like CONFIG_HOMA_DEBUG?
I've now arranged for this to be stripped from the upstream version of
Homa. I'll keep it in my development version.
> > +/**
> > + * homa_get_skb_info() - Return the address of Homa's private information
> > + * for an sk_buff.
> > + * @skb: Socket buffer whose info is needed.
> > + */
> > +static inline struct homa_skb_info *homa_get_skb_info(struct sk_buff *skb)
> > +{
> > + return (struct homa_skb_info *)(skb_end_pointer(skb)
> > + - sizeof(struct homa_skb_info));
> > +}
> > +
> > +/**
> > + * homa_next_skb() - Compute address of Homa's private link field in @skb.
> > + * @skb: Socket buffer containing private link field.
> > + *
> > + * Homa needs to keep a list of buffers in a message, but it can't use the
> > + * links built into sk_buffs because Homa wants to retain its list even
> > + * after sending the packet, and the built-in links get used during sending.
> > + * Thus we allocate extra space at the very end of the packet's data
> > + * area to hold a forward pointer for a list.
> > + */
> > +static inline struct sk_buff **homa_next_skb(struct sk_buff *skb)
> > +{
> > + return (struct sk_buff **)(skb_end_pointer(skb) - sizeof(char *));
> > +}
>
> This is confusing — why doesn't homa_next_skb(skb) equal
> &homa_get_skb_info(skb)->next_skb? Is one used on TX and the other
> on RX, or something?
>
> And could these subtractions be written as first casting to the
> appropriate pointer type and then subtracting 1, instead of
> subtracting sizeof from the unsigned char *end_pointer?
> (Particularly as here you've taken sizeof a different kind of
> pointer — I know sizeof(char *) == sizeof(struct sk_buff *), but
> it's still kind of unclean.)
Good points. I have reworked homa_get_skb_info to use "- 1" as you
suggested. homa_next_skb is no longer used and incorrect; I have
removed it.
> > +
> > +/**
> > + * homa_set_doff() - Fills in the doff TCP header field for a Homa packet.
> > + * @h: Packet header whose doff field is to be set.
> > + * @size: Size of the "header", bytes (must be a multiple of 4). This
> > + * information is used only for TSO; it's the number of bytes
> > + * that should be replicated in each segment. The bytes after
> > + * this will be distributed among segments.
> > + */
> > +static inline void homa_set_doff(struct data_header *h, int size)
> > +{
> > + h->common.doff = size << 2;
>
> Either put a comment here about the data offset being the high 4
> bits of doff, or use "(size >> 2) << 4" (or both!); at first
> glance this looks like a typo shifting the wrong way.
> (TCP avoids this by playing games with bitfields in struct tcphdr.)
I added a comment.
> > +/**
> > + * ipv4_to_ipv6() - Given an IPv4 address, return an equivalent IPv6 address
> > + * (an IPv4-mapped one).
> > + * @ip4: IPv4 address, in network byte order.
> > + */
> > +static inline struct in6_addr ipv4_to_ipv6(__be32 ip4)
> > +{
> > + struct in6_addr ret = {};
> > +
> > + if (ip4 == htonl(INADDR_ANY))
> > + return in6addr_any;
> > + ret.in6_u.u6_addr32[2] = htonl(0xffff);
> > + ret.in6_u.u6_addr32[3] = ip4;
> > + return ret;
> > +}
> > +
> > +/**
> > + * ipv6_to_ipv4() - Given an IPv6 address produced by ipv4_to_ipv6, return
> > + * the original IPv4 address (in network byte order).
> > + * @ip6: IPv6 address; assumed to be a mapped IPv4 address.
> > + */
> > +static inline __be32 ipv6_to_ipv4(const struct in6_addr ip6)
> > +{
> > + return ip6.in6_u.u6_addr32[3];
> > +}
> ...
> > +/**
> > + * is_mapped_ipv4() - Return true if an IPv6 address is actually an
> > + * IPv4-mapped address, false otherwise.
> > + * @x: The address to check.
> > + */
> > +static inline bool is_mapped_ipv4(const struct in6_addr x)
> > +{
> > + return ((x.in6_u.u6_addr32[0] == 0) &&
> > + (x.in6_u.u6_addr32[1] == 0) &&
> > + (x.in6_u.u6_addr32[2] == htonl(0xffff)));
> > +}
>
> These probably belong in some general inet header rather than being
> buried inside Homa. There's __ipv6_addr_type() but that might be a
> bit heavyweight; also ipv6_addr_v4mapped() and
> ipv6_addr_set_v4mapped() in include/net/ipv6.h.
I was able to replace is_mapped_ipv4 with ipv6_addr_v4mapped, and I
found ipv6_addr_set_v4mapped to replace ipv4_to_ipv6, but I couldn't
find replacements for ipv6_to_ipv4 or skb_is_ipv6, so I left those
functions in place.
-John-
Powered by blists - more mailing lists