netdev - Re: [PATCH net-next 3/7] net/rds: Wait for the FRMR_IS_FREE (or FRMR_IS_STALE) transition after posting IB_WR_LOCAL

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <6ff00a46-07f6-7be2-8e75-c87448568aa4@oracle.com>
Date:   Mon, 1 Jul 2019 22:11:01 -0700
From:   Gerd Rausch <gerd.rausch@...cle.com>
To:     santosh.shilimkar@...cle.com, netdev@...r.kernel.org
Cc:     David Miller <davem@...emloft.net>
Subject: Re: [PATCH net-next 3/7] net/rds: Wait for the FRMR_IS_FREE (or
 FRMR_IS_STALE) transition after posting IB_WR_LOCAL_INV

Hi Santosh,

On 01/07/2019 19.28, santosh.shilimkar@...cle.com wrote:
>>
> Below. All command timeouts are 60 seconds.
> 
> enum {
>         MLX4_CMD_TIME_CLASS_A   = 60000,
>         MLX4_CMD_TIME_CLASS_B   = 60000,
>         MLX4_CMD_TIME_CLASS_C   = 60000,
> };
> 

Thank you for the pointer.

> But having said that, I re-looked the code you are patching
> and thats actually only FRWR code which is purely work-request
> based so this command timeout shouldn't matter.
> 

Which brings us back full circle to the question of
what the timeout ought to be?

Please keep in mind that prior to this fix,
the RDS code didn't wait at all:

It simply posted those registration (IB_WR_REG_MR)
and invalidation (IB_WR_LOCAL_INV)
work-requests, with no regards to when the firmware
would execute them.

Arguably, waiting any amount time greater than zero
for the operation to complete is better than not waiting at all.

We can change the timeout to a high value, or even make it infinite
by using "wait_event" instead of "wait_event_timeout".

For the registration work-requests there is a benefit to wait a short
amount of time only (the trade-off described in patch #1 of this series).

For de-registration work-requests, it is beneficial to wait
until they are truly done.
But: Function "rds_ib_unreg_frmr" prior and post this change
simply moves on after a failed de-registration attempt,
and releases the pages owned by the memory region.

This patch does _not_ change that behavior.

> If the work request fails, then it will lead to flush errors and
> MRs will be marked as STALE. So this wait may not be necessary
> 

This wait is necessary to avoid the 2 scenarios described
in the commit-log message:

#1) Memory regions bouncing between "drop_list" and "clean_list"
    as items on the "clean_list" aren't really clean until
    their state transitions to "FRMR_IS_FREE".

#2) Prevent an access error as "rds_ib_post_inv" is called
    just prior to de-referencing pages via "__rds_ib_teardown_mr".
    And you certainly don't want those pages populated in the
    HCA's memory-translation-table with full access, while
    the Linux kernel 'thinks' you gave them back already
    and starts re-purposing them.

> RDS_GET_MR case is what actually showing the issue you saw
> and the fix for that Avinash has it in production kernel.

Actually, no:
Socket option RDS_GET_MR wasn't even in the code-path of the
tests I performed:

It were there RDS_CMSG_RDMA_MAP / RDS_CMSG_RDMA_DEST control
messages that ended up calling '__rds_rdma_map".

> 
> I believe with that change, registration issue becomes non-issue
> already.
> 

Please explain how that is related to this fix-suggestion?

I submitted this patch #3 and the others in this series in order
to fix bugs in the RDS that is currently shipped with Linux.

It may very well be the case that there are other changes
that Avinash put into production kernels that would be better
suited to fix this and other problems.

But that should not eliminate the need to fix what is currently broken.

Fixing what's broken does not preclude replacing the fixed code
with newer or better versions of the same.

> And as far as invalidation concerned with proxy qp, it not longer
> races with data path qp.
> 

I don't understand, please elaborate.

> May be you can try those changes if not already to see if it
> addresses the couple of cases where you ended up adding
> timeouts.
> 

I don't understand, please elaborate:
a) Are you saying this issue should not be fixed?
b) Or are you suggesting to replace this fix with a different fix?
   If it's the later, please point out what you have in mind.
c) ???

Thanks,

  Gerd