[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5196458f-202c-4ced-b2fa-08eae468233e@redhat.com>
Date: Thu, 8 May 2025 10:35:23 +0200
From: Paolo Abeni <pabeni@...hat.com>
To: John Ousterhout <ouster@...stanford.edu>
Cc: netdev@...r.kernel.org, edumazet@...gle.com, horms@...nel.org,
kuba@...nel.org
Subject: Re: [PATCH net-next v8 04/15] net: homa: create homa_pool.h and
homa_pool.c
On 5/7/25 1:28 AM, John Ousterhout wrote:
> On Mon, May 5, 2025 at 2:51 AM Paolo Abeni <pabeni@...hat.com> wrote:
>> On 5/3/25 1:37 AM, John Ousterhout wrote:
[...]
>>> + pool->check_waiting_invoked = 0;
>>> +
>>> + return 0;
>>> +
>>> +error:
>>> + kfree(pool->descriptors);
>>> + free_percpu(pool->cores);
>>
>> The above assumes that 'pool' will be zeroed at allocation time, but the
>> allocator does not do that. You should probably add the __GFP_ZERO flag
>> to the pool allocator.
>
> The pool is allocated with kzalloc; that zeroes it, no?
You are right, I misread the alloc function used.
>>> +int homa_pool_get_pages(struct homa_pool *pool, int num_pages, u32 *pages,
>>> + int set_owner)
>>> +{
>>> + int core_num = smp_processor_id();
>>> + struct homa_pool_core *core;
>>> + u64 now = sched_clock();
>>
>> From sched_clock() documentation:
>>
>> sched_clock() has no promise of monotonicity or bounded drift between
>> CPUs, use (which you should not) requires disabling IRQs.
>>
>> Can't be used for an expiration time. You could use 'jiffies' instead,
>
> Jiffies are *really* coarse grain (4 ms on my servers). It's possible
> that I could make them work in this situation, but in general jiffies
> won't work for Homa. Homa needs to make decisions at
> microsecond-scale, and an RPC that takes one jiffy to complete is
> completely unacceptable. Homa needs a fine-grain (e.g. cycle level)
> clock that is monotonic and synchronous across cores, and as far as I
> know, such a clock is available on every server where Homa is likely
> to run. For example, I believe that the TSC counters on both Intel and
> AMD chips have had the right properties for at least 10-15 years. And
> isn't sched_clock based on TSC where it's available? So even though
> sched_clock makes no official promises, isn't the reality actually
> fine? Can I simply stipulate that Homa is not appropriate for any
> machine where sched_clock doesn't have the properties Homa needs (this
> won't be a significant limitation in practice)?
>
> Ultimately I think Linux needs to bite the bullet and provide an
> official fine-grain clock with ns precision.
If you need ns precision use a ktime_get() variant. The point here is
that you should not use sched_clock().
>>> +
>>> + /* Figure out whether this candidate is free (or can be
>>> + * stolen). Do a quick check without locking the page, and
>>> + * if the page looks promising, then lock it and check again
>>> + * (must check again in case someone else snuck in and
>>> + * grabbed the page).
>>> + */
>>> + if (!homa_bpage_available(bpage, now))
>>> + continue;
>>
>> homa_bpage_available() accesses bpage without lock, so needs READ_ONCE()
>> annotations on the relevant fields and you needed to add paied
>> WRITE_ONCE() when updating them.
>
> I think the code is safe as is. Even though some of the fields
> accessed by homa_bpage_accessible are not atomic, it's not a disaster
> if they return stale values, since homa_bpage_available is invoked
> again after acquiring the lock before making any final decisions. The
> worst that can happen is (a) skipping over a bpage that's actually
> available or (b) acquiring the lock only to discover the bpage wasn't
> actually available (and then skipping it). Neither of these is
> problematic.
>
> Also, I'm not sure what you mean by "READ_ONCE() annotations on the
> relevant fields". Do I need something additional in the field
> declaration, in addition to using READ_ONCE() and WRITE_ONCE() to
> access the field?
If you have concurrent access without any lock protection, fuzzers will
complains. If that is safe - I'm not sure at this point, looks like
double read anti-patterns could be possible - you should document it with:
READ_ONCE(<field>)
to read the relevant field outside the lock and:
WRITE_ONCE(<field>)
to write such field. Have a look at Documentation/memory-barriers.txt
for more details.
>>> + if (!spin_trylock_bh(&bpage->lock))
>>
>> Why only trylock? I have a vague memory on some discussion on this point
>> in a previous revision. You should at least add a comment here on in the
>> commit message explaning why a plain spin_lock does not fit.
>
> I think the reasoning is different here than in other situations we
> may have discussed. I have added the following comment:
> "Rather than wait for a locked page to become free, just go on to the
> next page. If the page is locked, it probably won't turn out to be
> available anyway."
The main point here (and elsewhere) is that unusual constructs that
sparked a question/comment on prior revision should be documented, see:
https://elixir.bootlin.com/linux/v6.14.5/source/Documentation/process/6.Followthrough.rst#L80
I'm sorry, I spent on this series a considerably slice of my time
recently, starving a lot of other patches. I can't spend more time here
for a while.
I would like to outline that there are critical points outstanding.
The locking schema still looks problematic. I *think* that moving the
RPC in a global hash (as opposed to per socket containers) with it's own
lock could simplify the problem and help avoiding the "AB" "BA" deadlocks.
Kernel APIs usage and integration with the socket interface should be
improved.
I haven't looked yet in details to patches after 8/15 but I suspect
relevant rework is needed there, too.
/P
Powered by blists - more mailing lists