linux-kernel - Re: [PATCH] mailbox: Document the behavior of mbox_send

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAD=FV=V_5b0RNkyfHEniMCWT+taYUUt_0EaCABQDCLB4R34tfw@mail.gmail.com>
Date: Mon, 5 Jan 2026 11:00:06 -0800
From: Doug Anderson <dianders@...omium.org>
To: Jassi Brar <jassisinghbrar@...il.com>
Cc: linux-kernel@...r.kernel.org
Subject: Re: [PATCH] mailbox: Document the behavior of mbox_send_message() w/
 NULL mssg

Hi,

On Sat, Dec 20, 2025 at 3:26 PM Jassi Brar <jassisinghbrar@...il.com> wrote:
>
> On Thu, Dec 18, 2025 at 3:40 PM Douglas Anderson <dianders@...omium.org> wrote:
> >
> > The way the mailbox core behaves when you pass a NULL `mssg` parameter
> > to mbox_send_message() seems a little questionable. Specifically, the
> > mailbox core stores the currently active message directly in its
> > `active_req` field. In at least two places it decides that if this
> > field is `NULL` then there is no active request. That means if `mssg`
> > is NULL it will always think there is no active request. The two
> > places where it does this are:
> >
> > 1. If a client calls mbox_send_message(), if `active_req` is NULL then
> >    it will call the mailbox controller to send the new message even if
> >    the mailbox controller hasn't yet called mbox_chan_txdone() on the
> >    previous (NULL) message.
> > 2. The mailbox core will never call the client's `tx_done()` callback
> >    with a NULL message because `tx_tick()` returns early whenever the
> >    message is NULL.
> >
> > The above could be seen as bugs and perhaps could be fixed. However,
> > doing a `git grep mbox_send_message.*NULL` shows 14 hits in mainline
> > today and people may be relying on the current behavior. It is,
> > perhaps, better to accept the current behavior.
> >
> > The current behavior can actually serve the purpose of providing a
> > simple way to assert an edge-triggered interrupt to the remote
> > processor on the other side of the mailbox. Specifically:
> >
> > 1. Like a normal edge-triggered interrupt, if multiple edges arrive
> >    before the interrupt is Acked they are coalesced.
> > 2. Like a normal edge-triggered interrupt, as long as the receiver
> >    (the remote processor in this case) "Ack"s the interrupt _before_
> >    checking for work and the sender (the mailbox client in this case)
> >    posts the interrupt _after_ adding new work then we can always be
> >    certain that new work will be noticed. This assumes that the
> >    mailbox clienut and remote processor have some out-of-band way to
> >    communicate work and the mailbox is just being used as an
> >    interrupt.
> >
> > Document the current behavior so that people can rely on it and know
> > that it will keep working the same way.
> >
> > NOTE: if a given mailbox client mixes and matches some NULL and some
> > non-NULL messages, things could get loopy without additional code
> > changes and rules. Without code changes, if we transfer a non-NULL
> > message then we'd stop coalescing future NULL messages until the queue
> > clears. Also: if we were transferring a NULL message and a non-NULL
> > came in, we'd send it right away but potentially report `tx_done()`
> > too early. For now, document mixing and matching NULL and non-NULL
> > messages as undefined.
> >
> So we are talking about  'doorbell' type controllers that don't transfer any
> data but only raise an irq to the other end.

Essentially. I'm still a bit new to the mailbox world, but at least in
the mailbox controller I'm more familiar with (and planning to
upstream soon) it's not actually a different type of controller,
though. The controller has shared memory to store a message and then
has a doorbell feature. If the remote processor doesn't need any data
but just needs an interrupt then we can have the mailbox controller
just ring the doorbell without any associated message.


> Ideally the client would pass the
> doorbell info via mssg in runtime, but there may be a driver that does not need
> that info from the client.

Right. Since 'mssg' is defined as arbitrary data it does seem
legitimate for clients to pass NULL for it. It's just unfortunate that
the core behaves differently when `mssg` is NULL...


> I believe some doorbell clients do call
>  mbox_send_message(chan, 0) , the arbitrary mssg 0 is unused by the driver
> and everything works fine.

Right (well, "NULL" instead of "0"). It's actually quite common. Using
`git grep -A2 mbox_send_message.*NULL`, I see:

* drivers/acpi/acpi_pcc.c
* drivers/firmware/arm_scmi/transports/mailbox.c
* drivers/firmware/imx/imx-dsp.c
* drivers/firmware/tegra/bpmp-tegra186.c
* drivers/irqchip/irq-qcom-mpm.c
* drivers/remoteproc/xlnx_r5_remoteproc.c
* drivers/rpmsg/qcom_glink_rpm.c
* drivers/rpmsg/qcom_glink_smem.c
* drivers/rpmsg/qcom_smd.c
* drivers/soc/qcom/qcom_aoss.c
* drivers/soc/qcom/smp2p.c
* drivers/soc/qcom/smsm.c
* drivers/soc/ti/wkup_m3_ipc.c
* drivers/soc/xilinx/zynqmp_power.c

...or did you mean something different when you said drivers are
passing "0" as an argument?


>  I agree that is not ideal - we should differentiate between nothing and
> 'null' data. Is that the issue you want addressed?

I guess I'm just trying to clarify what the behavior _should_ be when
`mssg` is NULL. At first I saw the current behavior as a bug. ...but
then I saw the number of people that were using NULL `mssg` and I
decided that fixing the current behavior might cause problems.

Maybe it's good just to clarify the difference in behavior so we're
all on the same page.

Let's imagine that a mailbox client quickly queues up 3 messages:
  static const u8 bogus_data[] = { 0 };
  mbox_send_message(chan, (void*) bogus_data);
  mbox_send_message(chan, (void*) bogus_data);
  mbox_send_message(chan, (void*) bogus_data);

What will that do? Assuming that the Apps Processor (AP) is much
faster than the remote processor, it will start sending the first
message right away and the other two will be queued. When the first
message is sent, the second message will be sent. When the second
message is sent, the third message will be sent. The remote processor
is guaranteed to see 3 messages.

Writing out a big diagram, I believe this will happen:

AP                                  remote
----------------------------------- -------------------
mbox_send_message(&bogus_data)
  add_to_rbuf()
  msg_submit()
    ops->send_data()
      ...
      drv_trigger_doorbell()
mbox_send_message(&bogus_data)
  add_to_rbuf()
  msg_submit()
    NOP because active_req
mbox_send_message(&bogus_data)
  add_to_rbuf()
  msg_submit()
    NOP because active_req
                                    handle_doorbell()
                                      ack_doorbell()
drv_handle_doorbell_ack()
  mbox_chan_txdone()
    tx_tick()
      msg_submit()
        ops->send_data()
          ...
          drv_trigger_doorbell()
                                    handle_doorbell()
                                      ack_doorbell()
drv_handle_doorbell_ack()
  mbox_chan_txdone()
    tx_tick()
      msg_submit()
        ops->send_data()
          ...
          drv_trigger_doorbell()
                                    handle_doorbell()
                                      ack_doorbell()
drv_handle_doorbell_ack()
  mbox_chan_txdone()
    tx_tick()


Now, let's do the same but with NULL:
  mbox_send_message(chan, NULL);
  mbox_send_message(chan, NULL);
  mbox_send_message(chan, NULL);

Again, assuming that the AP is much faster and that the doorbell is a
normal edge-triggered interrupt, today this will happen:

AP                                  remote
----------------------------------- -------------------
mbox_send_message(NULL)
  add_to_rbuf()
  msg_submit()
    ops->send_data()
      ...
      drv_trigger_doorbell()
mbox_send_message(NULL)
  add_to_rbuf()
  msg_submit()
    ops->send_data()
      ...
      drv_trigger_doorbell()
mbox_send_message(NULL)
  add_to_rbuf()
  msg_submit()
    ops->send_data()
      ...
      drv_trigger_doorbell()
                                    handle_doorbell()
                                      ack_doorbell()
drv_handle_doorbell_ack()
  mbox_chan_txdone()
    tx_tick()


If the remote processor is much faster than the AP or the AP gets
interrupted between the `mbox_send_message(NULL)` calls then obviously
things could look a bit different, but hopefully the above makes it
clear enough?

As I said, at first I viewed the difference between the "NULL" and
"bogus_data" cases as a bug and thought maybe I should fix it, but
there truly could be cases where someone might want the existing
behavior.

Presumably anyone who is passing NULL for `mssg` has some other OOB
(out of band) way to pass data to the remote processor and the
doorbell is just telling the remote proc to look at the new OOB data.
With that, let's look at a (made up) example where the AP is trying to
let the remote processor know whenever the "event_counter" changed.
I'd imagine something like this:

AP                                  remote
----------------------------------- -------------------
shared_oob_event_counter++;
mbox_send_message(NULL)
shared_oob_event_counter++;
mbox_send_message(NULL)
shared_oob_event_counter++;
mbox_send_message(NULL)
                                    handle_doorbell()
                                      read_and_handle_event_counter()
                                      ack_doorbell()

In other words the remote processor sees the doorbell, handles all the
changes to the event counter, and then acks. If we instead fixed the
NULL case to actually queue up additional messages then the remote
processor would get 3 doorbells and the "event_counter" wouldn't
change at all for the last two...

I don't know for sure, but I'm imagining that at least some of the
existing 14 drivers that are passing NULL for `mssg` may be relying on
the existing behavior... From some quick spot checking:

smp2p_update_bits() - Updates (out of band) shared memory and uses
mbox_send_message(NULL) to kick the remote processor. That sounds very
similar to my made up "event_counter" example.

smsm_update_bits() seems similar

Many of the other examples I looked at looked like they weren't
relying on the existing behavior, but I didn't look at every one and
it's harder to trace all the usage of some of the drivers...


Anyway, I'm not 100% set on any one solution, I just see that the NULL
case is (unexpectedly) different and either want it documented or to
fix it so it's not different (ideally without angry people with
pitchforks coming after me because I changed behavior). :-)

-Doug