Commit Graph

1446336 Commits

Author SHA1 Message Date
Arnd Bergmann 0ee8ac903e RDMA/hfi1: Open-code rvt_set_ibdev_name()
clang warns about a function missing a printf attribute:

include/rdma/rdma_vt.h:457:47: error: diagnostic behavior may be improved by adding the 'format(printf, 2, 3)' attribute to the declaration of 'rvt_set_ibdev_name' [-Werror,-Wmissing-format-attribute]
  447 | static inline void rvt_set_ibdev_name(struct rvt_dev_info *rdi,
      | __attribute__((format(printf, 2, 3)))
  448 |                                       const char *fmt, const char *name,
  449 |                                       const int unit)

The helper was originally added as an abstraction for the hfi1 and
qib drivers needing the same thing, but now qib is gone, and hfi1
is the only remaining user of rdma_vt.

Avoid the warning and allow the compiler to check the format string by
open-coding the helper and directly assigning the device name.

Fixes: 5084c8ff21 ("IB/{rdmavt, hfi1, qib}: Self determine driver name")
Link: https://patch.msgid.link/r/20260602140453.3542427-1-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-05 12:38:42 -03:00
Jason Gunthorpe 55d984dae6 RDMA/umem: Make ib_umem_is_contiguous() safe on 32 bit
Sashiko points out the roundup_pow_of_two() only uses unsigned long but
dma_addr_t can be u64.

Change this algorithm to be simpler, compute the page size, if any page
size is found and it results in a single block then it is contiguous.

Link: https://patch.msgid.link/r/3-v1-88303e9e509f+f7-ib_umem_types_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-05 12:36:33 -03:00
Jason Gunthorpe 09ea6837a0 RDMA/umem: Be careful about boundary conditions in ib_umem_find_best_pgsz()
Several corner cases, especially important on 32 bits:

- umem->iova is u64, the function argument should pass in u64 or
  iova will be truncated
- Check that the length is not too large for the iova
- Check that lengths > 4G don't overflow the GENMASK

Link: https://patch.msgid.link/r/2-v1-88303e9e509f+f7-ib_umem_types_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-05 12:36:33 -03:00
Jason Gunthorpe bad4e98893 RDMA: Update the query_device() op
This op hasn't followed the normal pattern of passing NULL for udata when
invoked by the kernel. Instead the kernel caller creates a dummy ib_udata
on the stack and passes that in. It does not seem to currently be a bug,
but this flow should be modernized to use the new API flow and in the
process accept NULL as well.

Only mlx4 uses an input request structure, have every other driver call
ib_is_udata_in_empty() to enforce the lack of request structs.

Use ib_respond_empty_udata() in every driver that does not use a response
struct.

Ensure a check for NULL udata before calling ib_respond_udata() in
bnxt_re, efa, and mlx5.

Make mlx4 safe to be called with NULL.

Link: https://patch.msgid.link/r/2-v1-922fa8e828ba+f7-ib_udata_stack_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-03 15:12:43 -03:00
Jason Gunthorpe 43b57d73eb RDMA/core: Don't make a dummy ib_udata on the stack in create_qp
Sashiko points out the udata for destruction has to be created using
uverbs_get_cleared_udata(). Move it to ib_core_uverbs.c so that the core
qp code can call it. Rework the call chain to pass the struct
uverbs_attr_bundle right up to the driver op callback.

Fixes a possible wild stack reference in drivers during error unwinding,
mlx5 can call rdma_udata_to_drv_context() from destroy_qp() when
destroying a QP.

Fixes: 00a79d6b99 ("RDMA/core: Configure selinux QP during creation")
Link: https://patch.msgid.link/r/1-v1-922fa8e828ba+f7-ib_udata_stack_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-03 15:12:43 -03:00
Cyrill Gorcunov b548a6c4ee RDMA/irdma: Fix typo in SQ completions generation
When we generate completion for SQ the opcode while being properly read
from ring buffer is ignored when written back to completion. Seems
to be a simple typo.

Link: https://patch.msgid.link/r/ahjB87k54bYdFbft@grain
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Reviewed-by: Jacob Moroni <jmoroni@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-03 15:05:29 -03:00
Maoyi Xie ba7c4912f7 RDMA/hns: drop dead empty check in setup_root_hem()
setup_root_hem() reads the first entry of head->root and checks
the returned pointer against NULL:

    root_hem = list_first_entry(&head->root,
                                struct hns_roce_hem_item, list);
    if (!root_hem)
        return -ENOMEM;

list_first_entry() never returns NULL. On an empty list it returns
container_of(head, ..., list), a non-NULL garbage pointer that
aliases the head. So the check is dead.

The only caller adds an entry to head.root right before invoking
setup_root_hem():

    list_add(&root_hem->list, &head.root);
    ret = setup_root_hem(..., &head, ...);

So head.root is guaranteed non-empty on entry. Drop the check.

Link: https://patch.msgid.link/r/20260526054653.2054800-1-maoyixie.tju@gmail.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-03 15:04:40 -03:00
Tristan Madani d6ab440240 RDMA/rxe: Copy WQE to local buffer in non-SRQ receive path
For non-SRQ QPs, the responder reads WQE fields directly from the
shared queue buffer mapped into userspace. This allows a malicious
user to modify fields like num_sge or sge entries while the kernel
is processing the WQE, leading to out-of-bounds reads in
rxe_resp_check_length() and copy_data().

Introduce get_recv_wqe() that validates num_sge and copies the WQE
to a kernel-local buffer before processing, matching the approach
already used for SRQ WQEs in get_srq_wqe(). The srq_wqe buffer is
reused since SRQ and non-SRQ paths are mutually exclusive per QP.

Fixes: 8700e3e7c4 ("Soft RoCE driver")
Link: https://patch.msgid.link/r/20260518215040.1598586-3-tristan@talencesecurity.com
Signed-off-by: Tristan Madani <tristan@talencesecurity.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-29 20:33:59 -03:00
Tristan Madani 22b8fbded6 RDMA/rxe: Fix TOCTOU heap overflow in get_srq_wqe
get_srq_wqe() reads wqe->dma.num_sge from the shared receive queue
buffer, which is mapped into userspace. It validates num_sge against
max_sge, but then re-reads the same field to calculate the memcpy
size. A concurrent userspace thread can modify num_sge between
validation and use, causing a heap buffer overflow when copying the
WQE into qp->resp.srq_wqe.

Read num_sge into a local variable and use it for both the bounds
check and the size calculation.

Fixes: 8700e3e7c4 ("Soft RoCE driver")
Link: https://patch.msgid.link/r/20260518215040.1598586-2-tristan@talencesecurity.com
Signed-off-by: Tristan Madani <tristan@talencesecurity.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-29 20:32:48 -03:00
Jiri Pirko 3b6384dac1 RDMA/umem: Block plain userspace memory registration under CoCo bounce
When a device requires DMA bounce buffering inside a Confidential
Computing guest, __ib_umem_get_va() cannot work. The DMA mapping layer
redirects all mappings through swiotlb bounce buffers, so the device
receives DMA addresses pointing to bounce buffer memory rather than the
user's pages. Since RDMA devices access registered memory directly without
CPU involvement, there is no opportunity for swiotlb to synchronize
between the bounce buffer and the original pages.

The registration would already fail later on, since the umem mapping is
requested with DMA_ATTR_REQUIRE_COHERENT and gets rejected under
is_swiotlb_force_bounce() with -EIO. Fail early with -EOPNOTSUPP instead,
so the user gets a specific error code to react to.

Link: https://patch.msgid.link/r/20260517141311.2409230-3-jiri@resnulli.us
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-29 20:27:29 -03:00
Jiri Pirko d7a40b5194 RDMA/uverbs: Expose CoCo DMA bounce requirement to userspace
In CoCo guests, guest memory is encrypted and untrusted (T=0) devices
cannot DMA to it directly; such transfers must go through unencrypted
bounce buffers. RDMA registers user pages for direct device access,
bypassing the DMA layer and thus any bouncing, so registered memory does
not work in this configuration.

Until trusted (T=1) device detection is available, conservatively flag
every device attached to a CoCo guest. Expose the condition to userspace
as IB_UVERBS_DEVICE_CC_DMA_BOUNCE in device_cap_flags_ex so applications
can avoid memory registration and fall back to copying buffers through
send/recv.

Link: https://patch.msgid.link/r/20260517141311.2409230-2-jiri@resnulli.us
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-29 20:27:29 -03:00
Jiri Pirko 93ce776a5b RDMA/mlx5: Use UMEM attribute for QP doorbell record
Add an optional mlx5 driver-namespace UMEM attribute on QP
create so userspace can supply the doorbell record umem
explicitly, symmetric to the CQ side. Resolve it inside
mlx5_ib_db_map_user() and use it as a private DBR page when
present; otherwise take the existing UHW share-or-pin path
that preserves per-page DBR sharing across CQ/QP/SRQ in the
same process.

Add mlx5's first UVERBS_OBJECT_QP UAPI definition chain to
attach the new attr.

Link: https://patch.msgid.link/r/20260529134312.2836341-17-jiri@resnulli.us
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-29 20:20:00 -03:00
Jiri Pirko d6ab1d2439 RDMA/mlx5: Use UMEM attribute for CQ doorbell record
Add an optional mlx5 driver-namespace UMEM attribute on CQ
create so userspace can supply the doorbell record buffer
explicitly. mlx5_ib_db_map_user() resolves the attribute (or
falls back to the legacy UHW VA) into a struct
ib_uverbs_buffer_desc and runs a unified lookup-then-pin:
VA-typed descriptors share a per-page umem across CQ/QP/SRQ
in the same process, FD-typed descriptors are pinned per call.

Link: https://patch.msgid.link/r/20260529134312.2836341-16-jiri@resnulli.us
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-29 20:19:59 -03:00
Jiri Pirko 2cc10972f5 RDMA/umem: Add ib_umem_is_contiguous() stub for !CONFIG_INFINIBAND_USER_MEM
ib_umem_is_contiguous() is defined under #ifdef
CONFIG_INFINIBAND_USER_MEM, but the #else branch lacks a stub.

Add the missing inline to fix potential broken build.

Fixes: c897c2c8b8 ("RDMA/core: Add umem "is_contiguous" and "start_dma_addr" helpers")
Link: https://patch.msgid.link/r/20260529134312.2836341-15-jiri@resnulli.us
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-29 20:19:59 -03:00
Jiri Pirko 38fc5bab6c RDMA/mlx5: Use UMEM attributes for QP buffers in create_qp
Use the per-attribute UMEM helpers to pin QP buffer umems on
demand. The QP-type predicate selects between the BUF and RQ_BUF
attrs; raw-packet SQ uses its own dedicated SQ_BUF attr.

Link: https://patch.msgid.link/r/20260529134312.2836341-14-jiri@resnulli.us
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-29 20:19:59 -03:00
Jiri Pirko cd767d980c RDMA/uverbs: Use UMEM attributes for QP creation
Apply the per-attribute UMEM model to the QP create method. Add
three optional UMEM attributes that drivers pick from based on
how their user ABI lays out the QP rings:

- CREATE_QP_BUF_UMEM is a single user buffer that backs both
  the SQ and RQ of one QP. This is the common case where
  userspace pins one contiguous WQE region for the QP.
- CREATE_QP_SQ_BUF_UMEM and CREATE_QP_RQ_BUF_UMEM are a pair
  of user buffers backing the SQ and RQ independently, used
  when the two rings live in physically distinct user
  allocations and must be pinned and addressed separately.

Existing drivers would map their current umems as follows:

- mlx5: BUF for normal QPs (one ucmd->buf_addr covers SQ+RQ);
  for IB_QPT_RAW_PACKET and IB_QP_CREATE_SOURCE_QPN, the RQ
  side comes from ucmd->buf_addr (RQ-sized) via RQ_BUF and
  the SQ from ucmd->sq_buf_addr via SQ_BUF.
- mlx4: BUF, single ucmd.buf_addr covering SQ+RQ.
- hns: BUF, single ucmd.buf_addr covering SQ + ext-SGE + RQ.
- erdma: BUF, single ureq.qbuf_va sliced by the kernel into
  SQ at offset 0 and RQ at rq_offset.
- bnxt_re: SQ_BUF (ureq->qpsva) + RQ_BUF (ureq->qprva, the
  RQ side is skipped when the QP uses an SRQ).
- vmw_pvrdma: SQ_BUF (sbuf_addr) + RQ_BUF (rbuf_addr, the RQ
  side is skipped when the QP uses an SRQ).
- qedr: SQ_BUF (sq_addr) + RQ_BUF (rq_addr) for whichever
  side the QP type actually has (no SQ for XRC_TGT/GSI; no
  RQ for XRC_INI/XRC_TGT/SRQ).
- ionic: SQ_BUF (req.sq.addr) + RQ_BUF (req.rq.addr); both
  are skipped when the rings are placed in CMB instead of
  host memory.
- mana: raw-packet QP uses SQ_BUF (sq_buf_addr) only; the RC
  path uses multiple per-queue user buffers (ucmd.queue_buf[])
  that do not fit the SQ/RQ pair semantics of these attrs and
  stays on the legacy UHW path.
- efa, irdma, hfi1, ocrdma, mthca, cxgb4 and usnic do not pin
  a QP WQE buffer via umem; none of these attributes apply.

Link: https://patch.msgid.link/r/20260529134312.2836341-13-jiri@resnulli.us
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-29 20:19:59 -03:00
Jiri Pirko 7c5bbaaf44 RDMA/uverbs: Remove legacy umem field from struct ib_cq
Now that all drivers use helper to get umem and manage the lifetime,
legacy umem field in struct ib_cq is no longer needed. Remove it
along with ib_umem_get_cq_tmp() helper that populated it and both
error and destroy paths.

Link: https://patch.msgid.link/r/20260529134312.2836341-12-jiri@resnulli.us
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-29 20:19:59 -03:00
Jiri Pirko c0a94fecec RDMA/mlx4: Use ib_umem_get_cq_buf() for user CQ buffer
Pin the user CQ buffer with ib_umem_get_cq_buf() and take
ownership of the umem in the driver; fall back to
ib_umem_get_va() for the legacy UHW VA path. Apply the same
ownership pattern to the resize path.

Link: https://patch.msgid.link/r/20260529134312.2836341-11-jiri@resnulli.us
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-29 20:19:59 -03:00
Jiri Pirko c1837879e4 RDMA/bnxt_re: Use ib_umem_get_cq_buf_or_va() for user CQ buffer
Pin the user CQ buffer with ib_umem_get_cq_buf_or_va() and take
ownership of the umem in the driver. Apply the same ownership
pattern to the resize path.

Link: https://patch.msgid.link/r/20260529134312.2836341-10-jiri@resnulli.us
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-29 20:19:58 -03:00
Jiri Pirko ffdc91c993 RDMA/mlx5: Use ib_umem_get_cq_buf_or_va() for user CQ buffer
Pin the user CQ buffer with ib_umem_get_cq_buf_or_va() and take
ownership of the umem in the driver. Apply the same ownership
pattern to the resize path.

Link: https://patch.msgid.link/r/20260529134312.2836341-9-jiri@resnulli.us
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-29 20:19:58 -03:00
Jiri Pirko e3433474db RDMA/efa: Use ib_umem_get_cq_buf() for user CQ buffer
Pin the user CQ buffer with ib_umem_get_cq_buf() and take
ownership of the umem in the driver. Fall back to the
existing kernel-DMA path on NULL.

Link: https://patch.msgid.link/r/20260529134312.2836341-8-jiri@resnulli.us
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-29 20:19:58 -03:00
Jiri Pirko f6d2e53ca5 RDMA/uverbs: Add CQ buffer UMEM attribute and driver helpers
Add UVERBS_ATTR_CREATE_CQ_BUF_UMEM and two driver-facing
wrappers, ib_umem_get_cq_buf() and ib_umem_get_cq_buf_or_va(),
that pin a CQ buffer umem from it. The wrappers reuse the
existing legacy CQ buffer-attr filler.

Link: https://patch.msgid.link/r/20260529134312.2836341-7-jiri@resnulli.us
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-29 20:19:58 -03:00
Jiri Pirko 057bb70f53 RDMA/uverbs: Push out CQ buffer umem processing into a helper
Extract the UVERBS_ATTR_CREATE_CQ_BUFFER_* parser from the CQ
create handler into uverbs_create_cq_get_buffer_desc(), and wrap
it in ib_umem_get_cq_tmp(), the umem-producing helper the cq_create
handler now calls.

ib_umem_get_cq_tmp() is temporary; subsequent patches replace it
with driver-owned ib_umem_get_cq_buf*() wrappers built on the
same parser, and remove it once all CQ drivers have switched.

Link: https://patch.msgid.link/r/20260529134312.2836341-6-jiri@resnulli.us
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-29 20:19:58 -03:00
Jiri Pirko df67f15c44 RDMA/umem: Route ib_umem_get_va() through ib_umem_get_attr_or_va()
ib_umem_get_va() is now redundant: ib_umem_get_attr_or_va() with
attrs=NULL and attr_id=0 covers the exact same path. Make it a static
inline wrapper instead of a separately exported symbol.

Link: https://patch.msgid.link/r/20260529134312.2836341-5-jiri@resnulli.us
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-29 20:19:57 -03:00
Jiri Pirko 3cfdff484d RDMA/core: Introduce generic buffer descriptor infrastructure for umem
Introduce a per-attribute UVERBS_ATTR_UMEM model so each uverbs
command's umem set is explicit in its UAPI definition. Add
driver-facing wrapper helpers that pin a umem on demand from an
attribute or a VA addr; the driver owns the returned umem and
releases it from its destroy/error paths.

Link: https://patch.msgid.link/r/20260529134312.2836341-4-jiri@resnulli.us
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-29 20:19:57 -03:00
Jiri Pirko 28fb701c64 RDMA/umem: Split ib_umem_get_va() into a thin wrapper around __ib_umem_get_va()
The follow-up patch is going to introduce ib_umem_get_desc(),
the canonical desc-to-umem helper, which needs to pin a userspace VA
without going through the exported ib_umem_get_va() helper so later on
ib_umem_get_va() would use the ib_umem_get_desc() flow too.

Move the existing ib_umem_get_va() to a static __ib_umem_get_va()
and have ib_umem_get_va() as a thin wrapper that calls it.

Link: https://patch.msgid.link/r/20260529134312.2836341-3-jiri@resnulli.us
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-29 20:19:57 -03:00
Jiri Pirko af84105686 RDMA/umem: Rename ib_umem_get() to ib_umem_get_va()
The new umem getter family being introduced in follow-up patches
need a fitting name for the central all-source helper that resolves
attributes, legacy fillers and a UHW VA fallback.

Rename the existing VA-pinning helper ib_umem_get() to ib_umem_get_va()
so the name is freed up. The new name is consistent with names of rest
of the helpers that are about to be introduced.

Link: https://patch.msgid.link/r/20260529134312.2836341-2-jiri@resnulli.us
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-29 20:19:57 -03:00
Purushothaman Ramalingam ded0abacdc RDMA/rxe: Fix typos in comments
Fix typos found by codespell in driver comments:

  rxe.c:       s/mangagement/management/
  rxe_param.h: s/interations/iterations/
  rxe_resp.c:  s/recive/receive/

No functional change.

Link: https://patch.msgid.link/r/20260527104527.3222-1-purush.ramalingam@gmail.com
Signed-off-by: Purushothaman Ramalingam <purush.ramalingam@gmail.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-29 10:39:10 -03:00
Dave Hansen 57d8790620 MAINTAINERS: Remove bouncing Intel RDMA ethernet protocol maintainer
The email for Krzysztof Czurylo is bouncing. Remove the entry.

Link: https://patch.msgid.link/r/20260526205140.32714-1-dave.hansen@linux.intel.com
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-29 10:38:06 -03:00
Jason Gunthorpe 9733e9f580 RDMA/core: Move flow related functions to ib_uverbs_support.ko
mlx5 uses these as part of the driver implementation, move them to the
support module instead.

Link: https://patch.msgid.link/r/6-v3-43aba1969751+1988-ib_uverbs_support_ko_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-26 10:11:43 -03:00
Jason Gunthorpe 1a76adc9b3 RDMA/core: Move ucaps into ib_uverbs_support.ko
mlx5 uses these move them into the support module from ib_uverbs.ko.

Link: https://patch.msgid.link/r/5-v3-43aba1969751+1988-ib_uverbs_support_ko_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-26 10:11:43 -03:00
Jason Gunthorpe 2274d8cb49 RDMA/core: Make a new module for the uverbs components needed by drivers
To maintain the split where ib_uverbs.ko should not be depended on by
drivers, add a new module ib_uverbs_support.ko which contains the driver
called functions that are too large or too rare to be placed in
ib_uverbs_core.ko

Start by moving most of rdma_core.c into this module, making some
adjustments to split it from the actual uverbs FD code.

This was not done originally because we lacked EXPORT_SYMBOL_NS and I had
a fear that drivers would abuse this interface surface.

Link: https://patch.msgid.link/r/4-v3-43aba1969751+1988-ib_uverbs_support_ko_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-26 10:09:37 -03:00
Jason Gunthorpe 2c296df0ad RDMA/core: Remove uverbs_async_event_release()
Instead of having an alternative fops release always use the standard
uverbs_uobject_fd_release() and route the special async behavior back up
through uverbs_obj_fd_type ops pointer.

This removes a dependency where the technically lower level rdma_core.c is
referring to a symbol from uverbs_std_types_async_fd.c.

Link: https://patch.msgid.link/r/3-v3-43aba1969751+1988-ib_uverbs_support_ko_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-26 10:09:37 -03:00
Jason Gunthorpe ab2d2b2872 RDMA/core: Move many of the little EXPORTs from uverbs_ioctl into ib_core_uverbs
Not as many drivers need these functions but it does free efa from the
ib_uverbs.ko dependency and follows the general design better.

Link: https://patch.msgid.link/r/2-v3-43aba1969751+1988-ib_uverbs_support_ko_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-26 10:07:26 -03:00
Jason Gunthorpe 33d1117eb5 RDMA/core: Do not compile ib_core_uverbs without USER_ACCESS
Remove the entire ib_core_uverbs.c from the build if
CONFIG_INFINIBAND_USER_ACCESS is not set. These functions are only used to
support uverbs and are never callable even if they happen to get linked
in.

Provide inlines for the missing ones to return errors to further push code
elimination in drivers.

Link: https://patch.msgid.link/r/1-v3-43aba1969751+1988-ib_uverbs_support_ko_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-26 10:07:25 -03:00
Jason Gunthorpe e312f0ff9e Merge tag 'v7.1-rc5' into rdma.git for-next
For dependencies in the following patches

Resolve conflicts, use the goto labels from the rc tag.

* tag 'v7.1-rc5': (1526 commits)

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-25 13:48:00 -03:00
Tao Cui b86fd95805 RDMA/counter: Fix incorrect port index in rdma_counter_init() error cleanup
The error cleanup loop in rdma_counter_init() iterates with variable
'i' but accesses dev->port_data[port] instead of dev->port_data[i].
This causes the failed port's hstats to be freed multiple times while
leaking hstats of previously initialized ports.

Fixes: 56594ae1d2 ("RDMA/core: Annotate destroy of mutex to ensure that it is released as unlocked")
Link: https://patch.msgid.link/r/20260520104546.1776253-3-cuitao@kylinos.cn
Signed-off-by: Tao Cui <cuitao@kylinos.cn>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-25 12:40:08 -03:00
Tao Cui 4fbc823000 RDMA/counter: Fix num_counters leak on bind_qp failure in alloc_and_bind()
When __rdma_counter_bind_qp() fails in alloc_and_bind(), the error path
jumps to err_mode which frees the counter without decrementing
port_counter->num_counters. The only place that decrements is
rdma_counter_free(), which is unreachable since the counter was never
successfully bound.

This leak accumulates across repeated failures, permanently preventing
the port from switching to AUTO mode (-EBUSY in __counter_set_mode())
and blocking the MANUAL→NONE auto-revert in rdma_counter_free(). When
the mode was NONE before the call, the MANUAL mode set by
__counter_set_mode() also leaks since the revert logic is never
reached.

Add an err_bind label between the num_counters increment and the
existing err_mode label. It decrements num_counters and mirrors the
MANUAL→NONE revert from rdma_counter_free(), ensuring the port state
is fully restored on bind failure.

Link: https://patch.msgid.link/r/20260520104546.1776253-2-cuitao@kylinos.cn
Signed-off-by: Tao Cui <cuitao@kylinos.cn>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-25 12:40:08 -03:00
Lianfa Weng bbd97d71e5 RDMA/hns: Fix log flood after cmd_mbox failure
hns_roce_cmd_mbox() is the command interface between driver and
hardware. When hardware is abnormal, the unlimited error printings
after hns_roce_cmd_mbox() failure will cause log flood and even
system crash.

Replace ibdev_err() and ibdev_warn() with their ratelimited versions
in the error handling path after hns_roce_cmd_mbox() (and its wrappers
hns_roce_create_hw_ctx/hns_roce_destroy_hw_ctx) fails.

Fixes: 9a4435375c ("IB/hns: Add driver files for hns RoCE driver")
Link: https://patch.msgid.link/r/20260520055759.2354037-4-huangjunxian6@hisilicon.com
Signed-off-by: Lianfa Weng <wenglianfa@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-25 11:39:04 -03:00
Lianfa Weng 3f19c2a385 RDMA/hns: Fix warning in poll cq direct mode
CQs allocated by ib_alloc_cq() always have a comp_handler. Though
in direct mode this handler is never expected to be called, it
is still called when the driver is reset, triggering the following
WARN_ONCE():

Call trace:
ib_cq_completion_direct+0x38/0x60
hns_roce_cq_completion+0x54/0x90 (hns_roce_hw_v2]
hns_roce_handle_device_err+Ox1c8/0x340 [hns_roce_hw_v2]
hns_roce_hw_v2_uninit_instance.constprop.0+0x34/0x70 [hns_roce_hw_v2]
hns_roce_hw_v2_reset_notify+0xc4/0xe0 [hns_roce_hw_v2]
hclge_notify_roce_client+0x60/0xbc [hclge]
hclge_reset_rebuild+0x48/0x34c [hclge]
hclge_reset_subtask+0xcc/0xec [hclge]
hclge_reset_service_task+0x80/0x160 [hclge]
hclge_service_task+0x50/0x80 (hclge]
process_one_work+0x1cc/0x4d0
worker_thread+0x154/0x414
kthread+0x104/0x144
ret_from_fork+0x10/0x18

Fixes: f295e4cece ("RDMA/hns: Delete unnecessary callback functions for cq")
Link: https://patch.msgid.link/r/20260520055759.2354037-3-huangjunxian6@hisilicon.com
Signed-off-by: Lianfa Weng <wenglianfa@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-25 11:39:04 -03:00
Guangshuo Li 9a8826fdfb IB/mlx4: Fix refcount leak in add_port() error path
After kobject_init_and_add(), the lifetime of the embedded struct
kobject is expected to be managed through the kobject core reference
counting.

In add_port(), failure paths after kobject_init_and_add() must not free
struct mlx4_port directly, because the embedded kobject is then managed
by the kobject core. Freeing it directly leaves the kobject reference
counting unbalanced and can lead to incorrect lifetime handling.

Allocate the pkey and gid attribute arrays before kobject_init_and_add(),
so failures before kobject initialization can be handled by directly
freeing the allocated memory. Once kobject_init_and_add() has been
called, unwind later failures by removing any successfully created sysfs
groups, calling kobject_del(), and then releasing the embedded kobject
with kobject_put().

Fixes: c1e7e46612 ("IB/mlx4: Add iov directory in sysfs under the ib device")
Link: https://patch.msgid.link/r/20260518021910.972900-1-lgs201920130244@gmail.com
Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-25 11:25:17 -03:00
Zhu Yanjun 35744ab3d0 RDMA/rxe: Fix a use-after-free problem in rxe_mmap
rxe_mmap() removes a rxe_mmap_info struct from the pending_mmaps list
and releases pending_lock while the struct's kref is still at 1:

   list_del_init(&ip->pending_mmaps);
   spin_unlock_bh(&rxe->pending_lock);   /* ref == 1, no lock held */
   ret = remap_vmalloc_range(vma, ip->obj, 0);  /* walks PTEs */
   [...]
   rxe_vma_open(vma);                    /* kref_get, ref → 2 */
   remap_vmalloc_range_partial() walks PTEs without any lock.

A concurrent DESTROY_CQ ioctl on another CPU calls:

    kref_put(&q->ip->ref, rxe_mmap_release)   /* ref 1→0 */
    vfree(ip->obj)   /* clears vmalloc PTEs mid-walk */
    kfree(ip)        /* frees rxe_mmap_info */

This yields:

   1. Kernel crash, vmalloc_to_page() returns NULL when vfree wins the
   per-PTE race -> vm_insert_page(NULL) → GPF in validate_page_before_insert

   2. Page UAF, vmalloc_to_page() reads a stale PTE before vfree clears
   it. User VMA holds a PTE to a free'd page which might eventually get
   reallocated later by vmalloc which allows the attacker to get a clean
   page-level UAF.

   It is worth noting that even though a page-level UAF is possible given
   the strong primitive, it is statistically very difficult to achieve
   given the very short time window (after the last insert_page and before
   the kref_get).

The call trace are as below:

  Oops: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] SMP KASAN NOPTI
  KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
  CPU: 0 UID: 1000 PID: 413 Comm: poc Not tainted 7.0.0-rc5-dirty #28 PREEMPT(lazy)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  RIP: 0010:validate_page_before_insert+0x32/0x300
  Code: e5 41 57 41 56 49 89 fe 41 55 41 54 53 48 89 f3 e8 93 b5 a3 ff 48 8d 7b 08 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 7b 02 00 00 4c 8b 63 08 31 ff 4d 89 e5 41 83 e5
  RSP: 0018:ffff88811b15f2f0 EFLAGS: 00000202
  RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000008
  RBP: ffff88811b15f318 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000000 R12: ffff8881181eee00
  R13: 0000000000000000 R14: ffff8881181eee00 R15: ffff8881181eee20
  FS:  00007b1e000f76c0(0000) GS:ffff8884268e0000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007b1e00a24ac0 CR3: 0000000116eb3000 CR4: 00000000000006f0
  Call Trace:
   <TASK>
   insert_page+0x8f/0x190
   ? __pfx_insert_page+0x10/0x10
   ? kasan_save_alloc_info+0x38/0x60
   vm_insert_page+0x2e7/0x400
   remap_vmalloc_range_partial+0x212/0x3e0
   remap_vmalloc_range+0x6e/0xb0
   ? __kasan_check_write+0x14/0x30
   rxe_mmap+0x2e9/0x5d0
   ib_uverbs_mmap+0x1ad/0x2c0
   __mmap_region+0x12c2/0x2ad0
   ? __pfx___mmap_region+0x10/0x10
   ? __sanitizer_cov_trace_switch+0x58/0xb0
   ? mas_prev_slot+0x360/0x39c0
   ? __sanitizer_cov_trace_switch+0x58/0xb0
   ? mas_next_slot+0x1e5b/0x2f40
   ? __sanitizer_cov_trace_cmp8+0x18/0x30
   ? unmapped_area_topdown+0x4dd/0x610
   ? kfree+0x1b1/0x440
   ? free_cpumask_var+0x16/0x30
   ? __kasan_slab_free+0x7d/0xa0
   ? __sanitizer_cov_trace_cmp8+0x18/0x30
   mmap_region+0x2e6/0x3c0
   do_mmap+0xa3e/0x12a0
   ? __pfx_do_mmap+0x10/0x10
   ? __kasan_check_write+0x14/0x30
   ? down_write_killable+0xba/0x160
   ? __pfx_down_write_killable+0x10/0x10
   ? __sanitizer_cov_trace_cmp4+0x16/0x30
   vm_mmap_pgoff+0x2d4/0x4a0
   ? __pfx_vm_mmap_pgoff+0x10/0x10
   ? fget+0x1bf/0x270
   ksys_mmap_pgoff+0x40c/0x690
   ? __sanitizer_cov_trace_const_cmp4+0x16/0x30
   ? __pfx_ksys_mmap_pgoff+0x10/0x10
   ? __kasan_check_write+0x14/0x30
   ? _raw_spin_trylock+0xbb/0x130
   ? __pfx__raw_spin_trylock+0x10/0x10
   __x64_sys_mmap+0x135/0x1e0
   x64_sys_call+0x1c14/0x2790
   do_syscall_64+0xd2/0x1050
   ? rcu_core+0x352/0x7d0
   ? rcu_core_si+0xe/0x20
   ? handle_softirqs+0x1aa/0x650
   ? __sanitizer_cov_trace_cmp4+0x16/0x30
   ? fpregs_assert_state_consistent+0xe1/0x160
   ? irqentry_exit+0xb1/0x670
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

Link: https://patch.msgid.link/r/20260515002537.6209-1-yanjun.zhu@linux.dev
Reported-and-tested-by: nasm <n4sm@protonmail.com>
Suggested-by: nasm <n4sm@protonmail.com>
Fixes: 8700e3e7c4 ("Soft RoCE driver")
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-25 11:25:16 -03:00
Jacob Moroni 5ebb3ed757 RDMA/irdma: Fix out-of-bounds write in irdma_copy_user_pgaddrs
The irdma_copy_user_pgaddrs function loops through all of the umem DMA
blocks to populate the PBLEs and will stop when either the last DMA
block is reached or palloc->total_cnt is reached. The issue is that
the logic for checking palloc->total_cnt would only work for non-zero
values.

When irdma_setup_pbles is called with lvl==0, it
calls irdma_copy_user_pgaddrs with palloc->total_cnt==0, which means
the only way to break out of the loop is to reach the last umem DMA
block, which means it could end up going beyond the fixed size of 4
iwmr->pgaddrmem array that is used in the lvl==0 case.

In the case of QP/CQ/SRQ rings, the value of lvl is determined by a
separate input (for example, req.cq_pages in the case of a CQ). So,
we must perform explicit checking to ensure we don't overflow the
pgaddrmem array if the user provides a umem that consists of more
blocks than their provided req.cq_pages.

Fixes: b48c24c2d7 ("RDMA/irdma: Implement device supported verb APIs")
Link: https://patch.msgid.link/r/20260512183852.614045-1-jmoroni@google.com
Signed-off-by: Jacob Moroni <jmoroni@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-25 10:50:42 -03:00
Linus Torvalds e7ae89a0c9 Linux 7.1-rc5 v7.1-rc5 2026-05-24 13:48:06 -07:00
Shiraz Saleem d28654518c RDMA/mana_ib: Use ib_get_eth_speed for reporting port speed
Replace hardcoded IB_WIDTH_4X/IB_SPEED_EDR with ib_get_eth_speed()
to report the actual link speed in mana_ib_query_port().

Fixes: 4bda1d5332 ("RDMA/mana_ib: Implement port parameters")
Link: https://patch.msgid.link/r/20260512094056.264827-1-kotaranov@linux.microsoft.com
Signed-off-by: Shiraz Saleem <shirazsaleem@microsoft.com>
Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-24 17:24:24 -03:00
Rosen Penev 992ad0c012 RDMA/rtrs: Use flexible array for client path stats
Store the client path statistics in the RTRS client path allocation
instead of allocating them separately.

This ties the stats lifetime directly to the path and removes a separate
allocation failure path. Keep freeing the per-CPU stats data separately,
but do not free the embedded stats object from error paths or the stats
kobject release handler.

Link: https://patch.msgid.link/r/20260511041812.378030-1-rosenp@gmail.com
Assisted-by: Codex:GPT-5.5
Signed-off-by: Rosen Penev <rosenp@gmail.com>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-05-24 17:17:10 -03:00
Linus Torvalds 6a97c4d526 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
 "arm64:

   - Fix ITS EventID sanitisation when restoring an interrupt
     translation table.

   - Fix PPI memory leak when failing to initialise a vcpu.

   - Correctly return an error when the validation of a hypervisor trace
     descriptor fails, and limit this validation to protected mode only.

  RISC-V:

   - Fix invalid HVA warning in steal-time recording

   - Return SBI_ERR_FAILURE to guest upon OOM in pmu_event_info() and
     pmu_snapshot_set_shmem()

   - Fix NULL pointer dereference in SBI v0.1 SEND_IPI handler

   - Fix sign extension of value for MMIO loads

  s390:

   - Fix bugs in vSIE (nested virtualization) and UCONTROL, caused by
     the page table rewrite.

  x86:

   - Apply erratum #1235 workaround (disable AVIC IPI virtualization) on
     Hygon Family 18h, just like on AMD Family 17h.

   - When KVM_CAP_X86_APIC_BUS_CYCLES_NS is queried on a specific VM,
     return the VM's configured APIC bus frequency instead of the
     default. This is less confusing (read: not wrong) and makes it
     easier to fill in CPUID information that communicates the APIC bus
     frequency to the guest.

  Selftests:

   - Do not include glibc-internal <bits/endian.h>; it worked by chance
     and broke building KVM selftests with musl"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: SVM: Disable AVIC IPI virtualization on Hygon Family 18h (erratum #1235)
  KVM: selftests: Verify that KVM returns the configured APIC cycle length
  KVM: x86: Return the VM's configured APIC bus frequency when queried
  KVM: selftests: elf: Include <endian.h> instead of <bits/endian.h>
  KVM: s390: Properly reset zero bit in PGSTE
  KVM: s390: vsie: Fix redundant rmap entries
  KVM: s390: vsie: Fix unshadowing logic
  KVM: s390: Fix leaking kvm_s390_mmu_cache in case of errors
  KVM: s390: vsie: Fix memory leak when unshadowing
  KVM: arm64: Fix nVHE/pKVM hyp tracing error on invalid desc
  KVM: arm64: vgic: Free private_irqs when init fails after allocation
  KVM: arm64: vgic-its: Reject restored DTE with out-of-range num_eventid_bits
  RISC-V: KVM: Fix sign extension for MMIO loads
  RISC-V: KVM: Fix NULL pointer dereference in SBI v0.1 SEND_IPI handler
  riscv: kvm: return SBI_ERR_FAILURE for pmu_event_info() when OOM
  riscv: kvm: return SBI_ERR_FAILURE for pmu_snapshot_set_shmem() when OOM
  RISC-V: KVM: Fix invalid HVA warning in steal-time recording
2026-05-24 12:50:36 -07:00
Linus Torvalds 3526d74623 Merge tag 'x86-urgent-2026-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:

 - On SEV guests, handle set_memory_{encrypted,decrypted}() failures
   more conservatively by assuming that all affected pages are
   unencrypted (Carlos López)

 - Disable broadcast TLB flush when PCID is disabled (Tom Lendacky)

 - Fix VMX vs. hrtimer_rearm_deferred() regression (Peter Zijlstra)

 - Move IRQ/NMI dispatch code from KVM into x86 core, to prepare for a
   KVM x2apic fix (Peter Zijlstra)

 - Fix incorrect munmap() size on map_vdso() failure (Guilherme Giacomo
   Simoes)

* tag 'x86-urgent-2026-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  virt: sev-guest: Explicitly leak pages in unknown state
  x86/mm: Disable broadcast TLB flush when PCID is disabled
  x86/kvm/vmx: Fix VMX vs hrtimer_rearm_deferred()
  x86/kvm/vmx: Move IRQ/NMI dispatch from KVM into x86 core
  x86/vdso: Fix incorrect size in munmap() on map_vdso() failure
2026-05-24 11:00:45 -07:00
Linus Torvalds a674bf74b3 Merge tag 'irq-urgent-2026-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irqchip driver fixes from Ingo Molnar:

 - Fix the hardware probing error path of the renesas-rzt2h
   irqchip driver

 - Fix the exynos-combiner irqchip driver on -rt kernels
   by turning the IRQ controller spinlock into a raw spinlock

* tag 'irq-urgent-2026-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/renesas-rzt2h: Use pm_runtime_put_sync() in probe error path
  irqchip/exynos-combiner: Switch to raw_spinlock
2026-05-24 10:55:21 -07:00
Linus Torvalds ee651da6d3 Merge tag 'core-urgent-2026-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull debugobjects fix from Ingo Molnar::

 - Fix debugobjects regression on -rt kernels: don't fill the pool
   (which uses a coarse lock) if ->pi_blocked_on, because that messes up
   the priority inheritance of callers

* tag 'core-urgent-2026-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  debugobjects: Do not fill_pool() if pi_blocked_on
2026-05-24 10:48:55 -07:00