Commit Graph

2903 Commits

Author SHA1 Message Date
Linus Torvalds cc9f0629ca Merge tag 'block-6.15-20250509' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:

 - Fix for a regression in this series for loop and read/write iterator
   handling

 - zone append block update tweak

 - remove a broken IO priority test

 - NVMe pull request via Christoph:
      - unblock ctrl state transition for firmware update (Daniel
        Wagner)

* tag 'block-6.15-20250509' of git://git.kernel.dk/linux:
  block: remove test of incorrect io priority level
  nvme: unblock ctrl state transition for firmware update
  block: only update request sector if needed
  loop: Add sanity check for read/write_iter
2025-05-09 10:34:50 -07:00
Daniel Wagner 650415fca0 nvme: unblock ctrl state transition for firmware update
The original nvme subsystem design didn't have a CONNECTING state; the
state machine allowed transitions from RESETTING to LIVE directly.

With the introduction of nvme fabrics the CONNECTING state was
introduce. Over time the nvme-pci started to use the CONNECTING state as
well.

Eventually, a bug fix for the nvme-fc started to depend that the only
valid transition to LIVE was from CONNECTING. Though this change didn't
update the firmware update handler which was still depending on
RESETTING to LIVE transition.

The simplest way to address it for the time being is to switch into
CONNECTING state before going to LIVE state.

Fixes: d2fe192348 ("nvme: only allow entering LIVE from CONNECTING state")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Closes: https://lore.kernel.org/all/0134ea15-8d5f-41f7-9e9a-d7e6d82accaa@roeck-us.net
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
2025-05-07 09:01:20 +02:00
Linus Torvalds e205ff48fa Merge tag 'block-6.15-20250502' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:

 - NVMe pull request via Christoph:
     - fix queue unquiesce check on PCI slot_reset (Keith Busch)
     - fix premature queue removal and I/O failover in nvme-tcp (Michael
       Liang)
     - don't restore null sk_state_change (Alistair Francis)
     - select CONFIG_TLS where needed (Alistair Francis)
     - always free derived key data (Hannes Reinecke)
     - more quirks (Wentao Guan)

 - ublk zero copy fix

 - ublk selftest fix for UBLK_F_NEED_GET_DATA

* tag 'block-6.15-20250502' of git://git.kernel.dk/linux:
  nvmet-auth: always free derived key data
  nvmet-tcp: don't restore null sk_state_change
  nvmet-tcp: select CONFIG_TLS from CONFIG_NVME_TARGET_TCP_TLS
  nvme-tcp: select CONFIG_TLS from CONFIG_NVME_TCP_TLS
  nvme-tcp: fix premature queue removal and I/O failover
  nvme-pci: add quirks for WDC Blue SN550 15b7:5009
  nvme-pci: add quirks for device 126f:1001
  nvme-pci: fix queue unquiesce check on slot_reset
  ublk: remove the check of ublk_need_req_ref() from __ublk_check_and_get_req
  ublk: enhance check for register/unregister io buffer command
  ublk: decouple zero copy from user copy
  selftests: ublk: fix UBLK_F_NEED_GET_DATA
2025-05-02 10:24:37 -07:00
Alistair Francis 521987940a nvme-tcp: select CONFIG_TLS from CONFIG_NVME_TCP_TLS
Ensure that TLS support is enabled in the kernel when
CONFIG_NVME_TCP_TLS is enabled. Without this the code compiles, but does
not actually work unless something else enables CONFIG_TLS.

Fixes: be8e82caa6 ("nvme-tcp: enable TLS handshake upcall")
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-30 08:08:49 -05:00
Michael Liang 77e40bbce9 nvme-tcp: fix premature queue removal and I/O failover
This patch addresses a data corruption issue observed in nvme-tcp during
testing.

In an NVMe native multipath setup, when an I/O timeout occurs, all
inflight I/Os are canceled almost immediately after the kernel socket is
shut down. These canceled I/Os are reported as host path errors,
triggering a failover that succeeds on a different path.

However, at this point, the original I/O may still be outstanding in the
host's network transmission path (e.g., the NIC’s TX queue). From the
user-space app's perspective, the buffer associated with the I/O is
considered completed since they're acked on the different path and may
be reused for new I/O requests.

Because nvme-tcp enables zero-copy by default in the transmission path,
this can lead to corrupted data being sent to the original target,
ultimately causing data corruption.

We can reproduce this data corruption by injecting delay on one path and
triggering i/o timeout.

To prevent this issue, this change ensures that all inflight
transmissions are fully completed from host's perspective before
returning from queue stop. To handle concurrent I/O timeout from multiple
namespaces under the same controller, always wait in queue stop
regardless of queue's state.

This aligns with the behavior of queue stopping in other NVMe fabric
transports.

Fixes: 3f2304f8c6 ("nvme-tcp: add NVMe over TCP host driver")
Signed-off-by: Michael Liang <mliang@purestorage.com>
Reviewed-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Randy Jennings <randyj@purestorage.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-30 08:03:04 -05:00
Wentao Guan ab35ad950d nvme-pci: add quirks for WDC Blue SN550 15b7:5009
Add two quirks for the WDC Blue SN550 (PCI ID 15b7:5009) based on user
reports and hardware analysis:

 - NVME_QUIRK_NO_DEEPEST_PS:
	liaozw talked to me the problem and solved with
	nvme_core.default_ps_max_latency_us=0, so add the quirk.
	I also found some reports in the following link.

 - NVME_QUIRK_BROKEN_MSI:
	after get the lspci from Jack Rio.
	I think that the disk also have NVME_QUIRK_BROKEN_MSI.
	described in commit d5887dc6b6 ("nvme-pci: Add quirk for broken MSIs")
	as sean said in link which match the MSI 1/32 and MSI-X 17.

Log:
lspci -nn | grep -i memory
03:00.0 Non-Volatile memory controller [0108]: Sandisk Corp SanDisk Ultra 3D / WD PC SN530, IX SN530, Blue SN550 NVMe SSD (DRAM-less) [15b7:5009] (rev 01)
lspci -v -d 15b7:5009
03:00.0 Non-Volatile memory controller: Sandisk Corp SanDisk Ultra 3D / WD PC SN530, IX SN530, Blue SN550 NVMe SSD (DRAM-less) (rev 01) (prog-if 02 [NVM Express])
        Subsystem: Sandisk Corp WD Blue SN550 NVMe SSD
        Flags: bus master, fast devsel, latency 0, IRQ 35, IOMMU group 10
        Memory at fe800000 (64-bit, non-prefetchable) [size=16K]
        Memory at fe804000 (64-bit, non-prefetchable) [size=256]
        Capabilities: [80] Power Management version 3
        Capabilities: [90] MSI: Enable- Count=1/32 Maskable- 64bit+
        Capabilities: [b0] MSI-X: Enable+ Count=17 Masked-
        Capabilities: [c0] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [150] Device Serial Number 00-00-00-00-00-00-00-00
        Capabilities: [1b8] Latency Tolerance Reporting
        Capabilities: [300] Secondary PCI Express
        Capabilities: [900] L1 PM Substates
        Kernel driver in use: nvme
dmesg | grep nvme
[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.12.20-amd64-desktop-rolling root=UUID= ro splash quiet nvme_core.default_ps_max_latency_us=0 DEEPIN_GFXMODE=
[    0.059301] Kernel command line: BOOT_IMAGE=/vmlinuz-6.12.20-amd64-desktop-rolling root=UUID= ro splash quiet nvme_core.default_ps_max_latency_us=0 DEEPIN_GFXMODE=
[    0.542430] nvme nvme0: pci function 0000:03:00.0
[    0.560426] nvme nvme0: allocated 32 MiB host memory buffer.
[    0.562491] nvme nvme0: 16/0/0 default/read/poll queues
[    0.567764]  nvme0n1: p1 p2 p3 p4 p5 p6 p7 p8 p9
[    6.388726] EXT4-fs (nvme0n1p7): mounted filesystem ro with ordered data mode. Quota mode: none.
[    6.893421] EXT4-fs (nvme0n1p7): re-mounted r/w. Quota mode: none.
[    7.125419] Adding 16777212k swap on /dev/nvme0n1p8.  Priority:-2 extents:1 across:16777212k SS
[    7.157588] EXT4-fs (nvme0n1p6): mounted filesystem r/w with ordered data mode. Quota mode: none.
[    7.165021] EXT4-fs (nvme0n1p9): mounted filesystem r/w with ordered data mode. Quota mode: none.
[    8.036932] nvme nvme0: using unchecked data buffer
[    8.096023] block nvme0n1: No UUID available providing old NGUID

Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5887dc6b6c054d0da3cd053afc15b7be1f45ff6
Link: https://lore.kernel.org/all/20240422162822.3539156-1-sean.anderson@linux.dev/
Reported-by: liaozw <hedgehog-002@163.com>
Closes: https://bbs.deepin.org.cn/post/286300
Reported-by: rugk <rugk+github@posteo.de>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=208123
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-29 08:06:39 -05:00
Wentao Guan 5b960f92ac nvme-pci: add quirks for device 126f:1001
This commit adds NVME_QUIRK_NO_DEEPEST_PS and NVME_QUIRK_BOGUS_NID for
device [126f:1001].

It is similar to commit e89086c43f ("drivers/nvme: Add quirks for
device 126f:2262")

Diff is according the dmesg, use NVME_QUIRK_IGNORE_DEV_SUBNQN.

dmesg | grep -i nvme0:
  nvme nvme0: pci function 0000:01:00.0
  nvme nvme0: missing or invalid SUBNQN field.
  nvme nvme0: 12/0/0 default/read/poll queues

Link:https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e89086c43f0500bc7c4ce225495b73b8ce234c1f
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-29 08:00:34 -05:00
Keith Busch a75401227e nvme-pci: fix queue unquiesce check on slot_reset
A zero return means the reset was successfully scheduled. We don't want
to unquiesce the queues while the reset_work is pending, as that will
just flush out requeued requests to a failed completion.

Fixes: 71a5bb153b ("nvme: ensure disabling pairs with unquiesce")
Reported-by: Dhankaran Singh Ajravat <dhankaran@meta.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-29 08:00:34 -05:00
Linus Torvalds f7c2ca2584 Merge tag 'block-6.15-20250417' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:

 - MD pull via Yu:
      - fix raid10 missing discard IO accounting (Yu Kuai)
      - fix bitmap stats for bitmap file (Zheng Qixing)
      - fix oops while reading all member disks failed during
        check/repair (Meir Elisha)

 - NVMe pull via Christoph:
      - fix scan failure for non-ANA multipath controllers (Hannes
        Reinecke)
      - fix multipath sysfs links creation for some cases (Hannes
        Reinecke)
      - PCIe endpoint fixes (Damien Le Moal)
      - use NULL instead of 0 in the auth code (Damien Le Moal)

 - Various ublk fixes:
      - Slew of selftest additions
      - Improvements and fixes for IO cancelation
      - Tweak to Kconfig verbiage

 - Fix for page dirtying for blk integrity mapped pages

 - loop fixes:
      - buffered IO fix
      - uevent fixes
      - request priority inheritance fix

 - Various little fixes

* tag 'block-6.15-20250417' of git://git.kernel.dk/linux: (38 commits)
  selftests: ublk: add generic_06 for covering fault inject
  ublk: simplify aborting ublk request
  ublk: remove __ublk_quiesce_dev()
  ublk: improve detection and handling of ublk server exit
  ublk: move device reset into ublk_ch_release()
  ublk: rely on ->canceling for dealing with ublk_nosrv_dev_should_queue_io
  ublk: add ublk_force_abort_dev()
  ublk: properly serialize all FETCH_REQs
  selftests: ublk: move creating UBLK_TMP into _prep_test()
  selftests: ublk: add test_stress_05.sh
  selftests: ublk: support user recovery
  selftests: ublk: support target specific command line
  selftests: ublk: increase max nr_queues and queue depth
  selftests: ublk: set queue pthread's cpu affinity
  selftests: ublk: setup ring with IORING_SETUP_SINGLE_ISSUER/IORING_SETUP_DEFER_TASKRUN
  selftests: ublk: add two stress tests for zero copy feature
  selftests: ublk: run stress tests in parallel
  selftests: ublk: make sure _add_ublk_dev can return in sub-shell
  selftests: ublk: cleanup backfile automatically
  selftests: ublk: add io_uring uapi header
  ...
2025-04-18 09:21:14 -07:00
Hannes Reinecke 08937bcd4c nvme-multipath: sysfs links may not be created for devices
When rapidly rescanning for new namespaces nvme_mpath_add_sysfs_link() may be
called for a block device not added to sysfs. But NVME_NS_SYSFS_ATTR_LINK
had already been set, so when checking this device a second time we will fail
to create the link.

Fix this by exchanging the order of the block device check and the
NVME_NS_SYSFS_ATTR_LINK bit check.

Fixes: 4dbd2b2ebe ("nvme-multipath: Add visibility for round-robin io-policy")
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>**
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-16 07:37:36 +02:00
Hannes Reinecke 26d7fb4fd4 nvme: fixup scan failure for non-ANA multipath controllers
Commit 62baf70c32 caused the ANA log page to be re-read, even on
controllers that do not support ANA.  While this should generally
harmless, some controllers hang on the unsupported log page and
never finish probing.

Fixes: 62baf70c32 ("nvme: re-read ANA log page after ns scan completes")
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Tested-by: Srikanth Aithal <sraithal@amd.com>
[hch: more detailed commit message]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2025-04-16 07:37:36 +02:00
Linus Torvalds ff88562529 Merge tag 'block-6.15-20250411' of git://git.kernel.dk/linux
Pull more block fixes from Jens Axboe:
 "Apparently my internal clock was off, or perhaps it was just wishful
  thinking, but I sent out block fixes yesterday as my brain assumed it
  was Friday. Subsequently, that missed the NVMe fixes that should go
  into this weeks release as well. Hence, here's a followup with those,
  and another simple fix.

   - NVMe pull request via Christoph:
       - nvmet fc/fcloop refcounting fixes (Daniel Wagner)
       - fix missed namespace/ANA scans (Hannes Reinecke)
       - fix a use after free in the new TCP netns support (Kuniyuki
         Iwashima)
       - fix a NULL instead of false review in multipath (Uday Shankar)

   - Use strscpy() for null_blk disk name copy"

* tag 'block-6.15-20250411' of git://git.kernel.dk/linux:
  null_blk: Use strscpy() instead of strscpy_pad() in null_add_dev()
  nvmet-fc: put ref when assoc->del_work is already scheduled
  nvmet-fc: take tgtport reference only once
  nvmet-fc: update tgtport ref per assoc
  nvmet-fc: inline nvmet_fc_free_hostport
  nvmet-fc: inline nvmet_fc_delete_assoc
  nvmet-fcloop: add ref counting to lport
  nvmet-fcloop: replace kref with refcount
  nvmet-fcloop: swap list_add_tail arguments
  nvme-tcp: fix use-after-free of netns by kernel TCP socket.
  nvme: multipath: fix return value of nvme_available_path
  nvme: re-read ANA log page after ns scan completes
  nvme: requeue namespace scan on missed AENs
2025-04-11 08:29:35 -07:00
Kuniyuki Iwashima b013b817f3 nvme-tcp: fix use-after-free of netns by kernel TCP socket.
Commit 1be52169c3 ("nvme-tcp: fix selinux denied when calling
sock_sendmsg") converted sock_create() in nvme_tcp_alloc_queue()
to sock_create_kern().

sock_create_kern() creates a kernel socket, which does not hold
a reference to netns.  If the code does not manage the netns
lifetime properly, use-after-free could happen.

Also, TCP kernel socket with sk_net_refcnt 0 has a socket leak
problem: it remains FIN_WAIT_1 if it misses FIN after close()
because tcp_close() stops all timers.

To fix such problems, let's hold netns ref by sk_net_refcnt_upgrade().

We had the same issue in CIFS, SMC, etc, and applied the same
solution, see commit ef7134c7fc ("smb: client: Fix use-after-free
of network namespace.") and commit 9744d2bf19 ("smc: Fix
use-after-free in tcp_write_timer_handler().").

Fixes: 1be52169c3 ("nvme-tcp: fix selinux denied when calling sock_sendmsg")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-09 10:43:34 +02:00
Uday Shankar e3105f54a5 nvme: multipath: fix return value of nvme_available_path
The function returns bool so we should return false, not NULL. No
functional changes are expected.

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-08 07:07:18 +02:00
Hannes Reinecke 62baf70c32 nvme: re-read ANA log page after ns scan completes
When scanning for new namespaces we might have missed an ANA AEN.

The NVMe base spec (NVMe Base Specification v2.1, Figure 151 'Asynchonous
Event Information - Notice': Asymmetric Namespace Access Change) states:

  A controller shall not send this even if an Attached Namespace
  Attribute Changed asynchronous event [...] is sent for the same event.

so we need to re-read the ANA log page after we rescanned the namespace
list to update the ANA states of the new namespaces.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-08 07:07:14 +02:00
Hannes Reinecke 9546ad1a9b nvme: requeue namespace scan on missed AENs
Scanning for namespaces can take some time, so if the target is
reconfigured while the scan is running we may miss a Attached Namespace
Attribute Changed AEN.

Check if the NVME_AER_NOTICE_NS_CHANGED bit is set once the scan has
finished, and requeue scanning to pick up any missed change.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-07 16:26:50 +02:00
Thomas Gleixner 8fa7292fee treewide: Switch/rename to timer_delete[_sync]()
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.

Conversion was done with coccinelle plus manual fixups where necessary.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-05 10:30:12 +02:00
Linus Torvalds 949dd321de Merge tag 'block-6.15-20250403' of git://git.kernel.dk/linux
Pull more block updates from Jens Axboe:

 - NVMe pull request via Keith:
      - PCI endpoint target cleanup (Damien)
      - Early import for uring_cmd fixed buffer (Caleb)
      - Multipath documentation and notification improvements (John)
      - Invalid pci sq doorbell write fix (Maurizio)

 - Queue init locking fix

 - Remove dead nsegs parameter from blk_mq_get_new_requests()

* tag 'block-6.15-20250403' of git://git.kernel.dk/linux:
  block: don't grab elevator lock during queue initialization
  nvme-pci: skip nvme_write_sq_db on empty rqlist
  nvme-multipath: change the NVME_MULTIPATH config option
  nvme: update the multipath warning in nvme_init_ns_head
  nvme/ioctl: move fixed buffer lookup to nvme_uring_cmd_io()
  nvme/ioctl: move blk_mq_free_request() out of nvme_map_user_request()
  nvme/ioctl: don't warn on vectorized uring_cmd with fixed buffer
  nvmet: pci-epf: Keep completion queues mapped
  block: remove unused nseg parameter
2025-04-03 16:04:38 -07:00
Linus Torvalds d6b02199cd Merge tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:

 - The series "powerpc/crash: use generic crashkernel reservation" from
   Sourabh Jain changes powerpc's kexec code to use more of the generic
   layers.

 - The series "get_maintainer: report subsystem status separately" from
   Vlastimil Babka makes some long-requested improvements to the
   get_maintainer output.

 - The series "ucount: Simplify refcounting with rcuref_t" from
   Sebastian Siewior cleans up and optimizing the refcounting in the
   ucount code.

 - The series "reboot: support runtime configuration of emergency
   hw_protection action" from Ahmad Fatoum improves the ability for a
   driver to perform an emergency system shutdown or reboot.

 - The series "Converge on using secs_to_jiffies() part two" from Easwar
   Hariharan performs further migrations from msecs_to_jiffies() to
   secs_to_jiffies().

 - The series "lib/interval_tree: add some test cases and cleanup" from
   Wei Yang permits more userspace testing of kernel library code, adds
   some more tests and performs some cleanups.

 - The series "hung_task: Dump the blocking task stacktrace" from Masami
   Hiramatsu arranges for the hung_task detector to dump the stack of
   the blocking task and not just that of the blocked task.

 - The series "resource: Split and use DEFINE_RES*() macros" from Andy
   Shevchenko provides some cleanups to the resource definition macros.

 - Plus the usual shower of singleton patches - please see the
   individual changelogs for details.

* tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits)
  mailmap: consolidate email addresses of Alexander Sverdlin
  fs/procfs: fix the comment above proc_pid_wchan()
  relay: use kasprintf() instead of fixed buffer formatting
  resource: replace open coded variant of DEFINE_RES()
  resource: replace open coded variants of DEFINE_RES_*_NAMED()
  resource: replace open coded variant of DEFINE_RES_NAMED_DESC()
  resource: split DEFINE_RES_NAMED_DESC() out of DEFINE_RES_NAMED()
  samples: add hung_task detector mutex blocking sample
  hung_task: show the blocker task if the task is hung on mutex
  kexec_core: accept unaccepted kexec segments' destination addresses
  watchdog/perf: optimize bytes copied and remove manual NUL-termination
  lib/interval_tree: fix the comment of interval_tree_span_iter_next_gap()
  lib/interval_tree: skip the check before go to the right subtree
  lib/interval_tree: add test case for span iteration
  lib/interval_tree: add test case for interval_tree_iter_xxx() helpers
  lib/rbtree: add random seed
  lib/rbtree: split tests
  lib/rbtree: enable userland test suite for rbtree related data structure
  checkpatch: describe --min-conf-desc-length
  scripts/gdb/symbols: determine KASLR offset on s390
  ...
2025-04-01 10:06:52 -07:00
Maurizio Lombardi 288ff0d10b nvme-pci: skip nvme_write_sq_db on empty rqlist
nvme_submit_cmds() should check the rqlist before calling
nvme_write_sq_db(); if the list is empty, it must return immediately.

Fixes: beadf00885 ("nvme-pci: reverse request order in nvme_queue_rqs")
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-04-01 09:01:06 -07:00
John Meneghini 32c928142c nvme-multipath: change the NVME_MULTIPATH config option
Fix up the NVME_MULTIPATH config description so that
it accurately describes what it does.

Signed-off-by: John Meneghini <jmeneghi@redhat.com>
Tested-by: John Meneghini <jmeneghi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-04-01 09:01:05 -07:00
John Meneghini 47f8cc9e32 nvme: update the multipath warning in nvme_init_ns_head
The new NVME_MULTIPATH_PARAM config option requires updates
to the warning message in nvme_init_ns_head().

Signed-off-by: John Meneghini <jmeneghi@redhat.com>
Tested-by: John Meneghini <jmeneghi@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-04-01 09:01:05 -07:00
Caleb Sander Mateos 38808af53c nvme/ioctl: move fixed buffer lookup to nvme_uring_cmd_io()
nvme_map_user_request() is called from both nvme_submit_user_cmd() and
nvme_uring_cmd_io(). But the ioucmd branch is only applicable to
nvme_uring_cmd_io(). Move it to nvme_uring_cmd_io() and just pass the
resulting iov_iter to nvme_map_user_request().

For NVMe passthru operations with fixed buffers, the fixed buffer lookup
happens in io_uring_cmd_import_fixed(). But nvme_uring_cmd_io() can
return -EAGAIN first from nvme_alloc_user_request() if all tags in the
tag set are in use. This ordering difference is observable when using
UBLK_U_IO_{,UN}REGISTER_IO_BUF SQEs to modify the fixed buffer table. If
the NVMe passthru operation is followed by UBLK_U_IO_UNREGISTER_IO_BUF
to unregister the fixed buffer and the NVMe passthru goes async, the
fixed buffer lookup will fail because it happens after the unregister.

Userspace should not depend on the order in which io_uring issues SQEs
submitted in parallel, but it may try submitting the SQEs together and
fall back on a slow path if the fixed buffer lookup fails. To make the
fast path more likely, do the import before nvme_alloc_user_request().

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-04-01 09:01:05 -07:00
Caleb Sander Mateos cd683de63e nvme/ioctl: move blk_mq_free_request() out of nvme_map_user_request()
The callers of nvme_map_user_request() (nvme_submit_user_cmd() and
nvme_uring_cmd_io()) allocate the request, so have them free it if
nvme_map_user_request() fails.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-04-01 09:01:05 -07:00
Caleb Sander Mateos eada75467f nvme/ioctl: don't warn on vectorized uring_cmd with fixed buffer
The vectorized io_uring NVMe passthru opcodes don't yet support fixed
buffers. But since userspace can trigger this condition based on the
io_uring SQE parameters, it shouldn't cause a kernel warning.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixes: 23fd22e55b ("nvme: wire up fixed buffer support for nvme passthrough")
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-04-01 09:01:05 -07:00
Linus Torvalds a9fc230497 Merge tag 'soc-drivers-6.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC driver updates from Arnd Bergmann:
 "These are the updates for SoC specific drivers and related subsystems:

   - Firmware driver updates for SCMI, FF-A and SMCCC firmware
     interfaces, adding support for additional firmware features
     including SoC identification and FF-A SRI callbacks as well as
     various bugfixes

   - Memory controller updates for Nvidia and Mediatek

   - Reset controller support for microchip sam9x7 and imx8qxp/imx8qm

   - New hardware support for multiple Mediatek, Renesas and Samsung
     Exynos chips

   - Minor updates on Zynq, Qualcomm, Amlogic, TI, Samsung, Nvidia and
     Apple chips

  There will be a follow up with a few more driver updates that are
  still causing build regressions at the moment"

* tag 'soc-drivers-6.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (97 commits)
  irqchip: Add support for Amlogic A4 and A5 SoCs
  dt-bindings: interrupt-controller: Add support for Amlogic A4 and A5 SoCs
  reset: imx: fix incorrect module device table
  dt-bindings: power: qcom,kpss-acc-v2: add qcom,msm8916-acc compatible
  bus: qcom-ssc-block-bus: Fix the error handling path of qcom_ssc_block_bus_probe()
  bus: qcom-ssc-block-bus: Remove some duplicated iounmap() calls
  soc: qcom: pd-mapper: Add support for SDM630/636
  reset: imx: Add SCU reset driver for i.MX8QXP and i.MX8QM
  dt-bindings: firmware: imx: add property reset-controller
  dt-bindings: reset: atmel,at91sam9260-reset: add sam9x7
  memory: mtk-smi: Add ostd setting for mt8192
  dt-bindings: soc: samsung: exynos-usi: Drop unnecessary status from example
  firmware: tegra: bpmp: Fix typo in bpmp-abi.h
  soc/tegra: pmc: Use str_enable_disable-like helpers
  soc: samsung: include linux/array_size.h where needed
  firmware: arm_scmi: use ioread64() instead of ioread64_hi_lo()
  soc: mediatek: mtk-socinfo: Add extra entry for MT8395AV/ZA Genio 1200
  soc: mediatek: mt8188-mmsys: Add support for DSC on VDO0
  soc: mediatek: mmsys: Migrate all tables to MMSYS_ROUTE() macro
  soc: mediatek: mt8365-mmsys: Fix routing table masks and values
  ...
2025-03-27 09:05:55 -07:00
Linus Torvalds 9b960d8cd6 Merge tag 'for-6.15/block-20250322' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:

 - Fixes for integrity handling

 - NVMe pull request via Keith:
      - Secure concatenation for TCP transport (Hannes)
      - Multipath sysfs visibility (Nilay)
      - Various cleanups (Qasim, Baruch, Wang, Chen, Mike, Damien, Li)
      - Correct use of 64-bit BARs for pci-epf target (Niklas)
      - Socket fix for selinux when used in containers (Peijie)

 - MD pull request via Yu:
      - fix recovery can preempt resync (Li Nan)
      - fix md-bitmap IO limit (Su Yue)
      - fix raid10 discard with REQ_NOWAIT (Xiao Ni)
      - fix raid1 memory leak (Zheng Qixing)
      - fix mddev uaf (Yu Kuai)
      - fix raid1,raid10 IO flags (Yu Kuai)
      - some refactor and cleanup (Yu Kuai)

 - Series cleaning up and fixing bugs in the bad block handling code

 - Improve support for write failure simulation in null_blk

 - Various lock ordering fixes

 - Fixes for locking for debugfs attributes

 - Various ublk related fixes and improvements

 - Cleanups for blk-rq-qos wait handling

 - blk-throttle fixes

 - Fixes for loop dio and sync handling

 - Fixes and cleanups for the auto-PI code

 - Block side support for hardware encryption keys in blk-crypto

 - Various cleanups and fixes

* tag 'for-6.15/block-20250322' of git://git.kernel.dk/linux: (105 commits)
  nvmet: replace max(a, min(b, c)) by clamp(val, lo, hi)
  nvme-tcp: fix selinux denied when calling sock_sendmsg
  nvmet: pci-epf: Always configure BAR0 as 64-bit
  nvmet: Remove duplicate uuid_copy
  nvme: zns: Simplify nvme_zone_parse_entry()
  nvmet: pci-epf: Remove redundant 'flush_workqueue()' calls
  nvmet-fc: Remove unused functions
  nvme-pci: remove stale comment
  nvme-fc: Utilise min3() to simplify queue count calculation
  nvme-multipath: Add visibility for queue-depth io-policy
  nvme-multipath: Add visibility for numa io-policy
  nvme-multipath: Add visibility for round-robin io-policy
  nvmet: add tls_concat and tls_key debugfs entries
  nvmet-tcp: support secure channel concatenation
  nvmet: Add 'sq' argument to alloc_ctrl_args
  nvme-fabrics: reset admin connection for secure concatenation
  nvme-tcp: request secure channel concatenation
  nvme-keyring: add nvme_tls_psk_refresh()
  nvme: add nvme_auth_derive_tls_psk()
  nvme: add nvme_auth_generate_digest()
  ...
2025-03-26 18:08:55 -07:00
Linus Torvalds 91928e0d3c Merge tag 'for-6.15/io_uring-20250322' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
 "This is the first of the io_uring pull requests for the 6.15 merge
  window, there will be others once the net tree has gone in. This
  contains:

   - Cleanup and unification of cancelation handling across various
     request types.

   - Improvement for bundles, supporting them both for incrementally
     consumed buffers, and for non-multishot requests.

   - Enable toggling of using iowait while waiting on io_uring events or
     not. Unfortunately this is still tied with CPU frequency boosting
     on short waits, as the scheduler side has not been very receptive
     to splitting the (useless) iowait stat from the cpufreq implied
     boost.

   - Add support for kbuf nodes, enabling zero-copy support for the ublk
     block driver.

   - Various cleanups for resource node handling.

   - Series greatly cleaning up the legacy provided (non-ring based)
     buffers. For years, we've been pushing the ring provided buffers as
     the way to go, and that is what people have been using. Reduce the
     complexity and code associated with legacy provided buffers.

   - Series cleaning up the compat handling.

   - Series improving and cleaning up the recvmsg/sendmsg iovec and msg
     handling.

   - Series of cleanups for io-wq.

   - Start adding a bunch of selftests. The liburing repository
     generally carries feature and regression tests for everything, but
     at least for ublk initially, we'll try and go the route of having
     it in selftests as well. We'll see how this goes, might decide to
     migrate more tests this way in the future.

   - Various little cleanups and fixes"

* tag 'for-6.15/io_uring-20250322' of git://git.kernel.dk/linux: (108 commits)
  selftests: ublk: add stripe target
  selftests: ublk: simplify loop io completion
  selftests: ublk: enable zero copy for null target
  selftests: ublk: prepare for supporting stripe target
  selftests: ublk: move common code into common.c
  selftests: ublk: increase max buffer size to 1MB
  selftests: ublk: add single sqe allocator helper
  selftests: ublk: add generic_01 for verifying sequential IO order
  selftests: ublk: fix starting ublk device
  io_uring: enable toggle of iowait usage when waiting on CQEs
  selftests: ublk: fix write cache implementation
  selftests: ublk: add variable for user to not show test result
  selftests: ublk: don't show `modprobe` failure
  selftests: ublk: add one dependency header
  io_uring/kbuf: enable bundles for incrementally consumed buffers
  Revert "io_uring/rsrc: simplify the bvec iter count calculation"
  selftests: ublk: improve test usability
  selftests: ublk: add stress test for covering IO vs. killing ublk server
  selftests: ublk: add one stress test for covering IO vs. removing device
  selftests: ublk: load/unload ublk_drv when preparing & cleaning up tests
  ...
2025-03-26 17:56:00 -07:00
Peijie Shao 1be52169c3 nvme-tcp: fix selinux denied when calling sock_sendmsg
In a SELinux enabled kernel, socket_create() initializes the security
label of the socket using the security label of the calling process,
this typically works well.

However, in a containerized environment like Kubernetes, problem arises
when a privileged container(domain spc_t) connects to an NVMe target and
mounts the NVMe as persistent storage for unprivileged containers(domain
container_t).

This is because the container_t domain cannot access resources labeled
with spc_t, resulting in socket_sendmsg returning -EACCES.

The solution is to use socket_create_kern() instead of socket_create(),
which labels the socket context to kernel_t.  Access control will then
be handled by the VFS layer rather than the socket itself.

Signed-off-by: Peijie Shao <shaopeijie@cestc.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20 16:53:56 -07:00
Damien Le Moal 945e82633e nvme: zns: Simplify nvme_zone_parse_entry()
Instead of passing a pointer to a struct nvme_ctrl and a pointer to a
struct nvme_ns_head as the first two arguments of
nvme_zone_parse_entry(), pass only a pointer to a struct nvme_ns as both
the controller structure and ns head structure can be infered from the
namespace structure.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20 16:53:55 -07:00
Baruch Siach f1b47aed53 nvme-pci: remove stale comment
The ns variable has been removed in commit 62451a2b2e ("nvme: separate
command prep and issue"). Drop reference to ns in comment.

Fixes: 62451a2b2e ("nvme: separate command prep and issue")
Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20 16:53:55 -07:00
Qasim Ijaz 978540050a nvme-fc: Utilise min3() to simplify queue count calculation
Refactor nvme_fc_create_io_queues() and nvme_fc_recreate_io_queues() to
use the min3() macro to find the minimum between 3 values instead of
multiple min()'s. This shortens the code and makes it easier to read.

Signed-off-by: Qasim Ijaz <qasdev00@gmail.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20 16:53:55 -07:00
Nilay Shroff 7cbafa3ff0 nvme-multipath: Add visibility for queue-depth io-policy
This patch helps add nvme native multipath visibility for queue-depth
io-policy. It adds a new attribute file named "queue_depth" under
namespace device path node which would print the number of active/
in-flight I/O requests currently queued for the given path.

For instance, if we have a shared namespace accessible from two different
controllers/paths then accessing head block node of the shared namespace
would show the following output:

$ ls -l /sys/block/nvme1n1/multipath/
nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1

In the above example, nvme1n1 is head gendisk node created for a shared
namespace and the namespace is accessible from nvme1c1n1 and nvme1c3n1
paths. For queue-depth io-policy we can then refer the "queue_depth"
attribute file created under each namespace path:

$ cat /sys/block/nvme1n1/multipath/nvme1c1n1/queue_depth
518

$cat /sys/block/nvme1n1/multipath/nvme1c3n1/queue_depth
504

>From the above output, we can infer that I/O workload targeted at nvme1n1
uses two paths nvme1c1n1 and nvme1c3n1 and the current queue depth of each
path is 518 and 504 respectively. Reading "queue_depth" file when
configured io-policy is anything but queue-depth would show no output.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20 16:53:55 -07:00
Nilay Shroff 6546cc4a56 nvme-multipath: Add visibility for numa io-policy
This patch helps add nvme native multipath visibility for numa io-policy.
It adds a new attribute file named "numa_nodes" under namespace gendisk
device path node which prints the list of numa nodes preferred by the
given namespace path. The numa nodes value is comma delimited list of
nodes or A-B range of nodes.

For instance, if we have a shared namespace accessible from two different
controllers/paths then accessing head node of the shared namespace would
show the following output:

$ ls -l /sys/block/nvme1n1/multipath/
nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1

In the above example, nvme1n1 is head gendisk node created for a shared
namespace and this namespace is accessible from nvme1c1n1 and nvme1c3n1
paths. For numa io-policy we can then refer the "numa_nodes" attribute
file created under each namespace path:

$ cat /sys/block/nvme1n1/multipath/nvme1c1n1/numa_nodes
0-1

$ cat /sys/block/nvme1n1/multipath/nvme1c3n1/numa_nodes
2-3

>From the above output, we infer that I/O workload targeted at nvme1n1
and running on numa nodes 0 and 1 would prefer using path nvme1c1n1.
Similarly, I/O workload running on numa nodes 2 and 3 would prefer
using path nvme1c3n1. Reading "numa_nodes" file when configured
io-policy is anything but numa would show no output.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20 16:53:54 -07:00
Nilay Shroff 4dbd2b2ebe nvme-multipath: Add visibility for round-robin io-policy
This patch helps add nvme native multipath visibility for round-robin
io-policy. It creates a "multipath" sysfs directory under head gendisk
device node directory and then from "multipath" directory it adds a link
to each namespace path device the head node refers.

For instance, if we have a shared namespace accessible from two different
controllers/paths then we create a soft link to each path device from head
disk node as shown below:

$ ls -l /sys/block/nvme1n1/multipath/
nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1

In the above example, nvme1n1 is head gendisk node created for a shared
namespace and the namespace is accessible from nvme1c1n1 and nvme1c3n1
paths.

For round-robin I/O policy, we could easily infer from the above output
that I/O workload targeted to nvme1n1 would toggle across paths nvme1c1n1
and nvme1c3n1.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20 16:53:54 -07:00
Hannes Reinecke 104d0e2f62 nvme-fabrics: reset admin connection for secure concatenation
When secure concatenation is requested the connection needs to be
reset to enable TLS encryption on the new cnnection.
That implies that the original connection used for the DH-CHAP
negotiation really shouldn't be used, and we should reset as soon
as the DH-CHAP negotiation has succeeded on the admin queue.

Based on an idea from Sagi.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20 16:53:54 -07:00
Hannes Reinecke e88a7595b5 nvme-tcp: request secure channel concatenation
Add a fabrics option 'concat' to request secure channel concatenation as
specified the NVME Base Specification v2.1, section 8.3.4.3: Secure Channel
Concatenation.
When secure channel concatenation is enabled a 'generated PSK' is inserted
into the keyring such that it's available after reset.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20 16:53:54 -07:00
Hannes Reinecke 62eb89323c nvme-keyring: add nvme_tls_psk_refresh()
Add a function to refresh a generated PSK in the specified keyring.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20 16:53:54 -07:00
Easwar Hariharan 6cf828d2fb nvme: convert timeouts to secs_to_jiffies()
Commit b35108a51c ("jiffies: Define secs_to_jiffies()") introduced
secs_to_jiffies().  As the value here is a multiple of 1000, use
secs_to_jiffies() instead of msecs_to_jiffies() to avoid the
multiplication

This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with
the following Coccinelle rules:

@depends on patch@
expression E;
@@

-msecs_to_jiffies
+secs_to_jiffies
(E
- * \( 1000 \| MSEC_PER_SEC \)
)

Link: https://lkml.kernel.org/r/20250225-converge-secs-to-jiffies-part-two-v3-11-a43967e36c88@linux.microsoft.com
Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Acked-by: Keith Busch <kbusch@kernel.org>
Cc: Carlos Maiolino <cem@kernel.org>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Maol <dlemoal@kernel.org>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: David Sterba <dsterba@suse.com>
Cc: Dick Kennedy <dick.kennedy@broadcom.com>
Cc: Dongsheng Yang <dongsheng.yang@easystack.cn>
Cc: Fabio Estevam <festevam@gmail.com>
Cc: Frank Li <frank.li@nxp.com>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: Ilpo Jarvinen <ilpo.jarvinen@linux.intel.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: James Bottomley <james.bottomley@HansenPartnership.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Kalesh Anakkur Purayil <kalesh-anakkur.purayil@broadcom.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: Mark Brown <broonie@kernel.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Nicolas Palix <nicolas.palix@imag.fr>
Cc: Niklas Cassel <cassel@kernel.org>
Cc: Oded Gabbay <ogabbay@kernel.org>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Sascha Hauer <s.hauer@pengutronix.de>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Selvin Thyparampil Xavier <selvin.xavier@broadcom.com>
Cc: Shawn Guo <shawnguo@kernel.org>
Cc: Shyam-sundar S-k <Shyam-sundar.S-k@amd.com>
Cc: Takashi Iwai <tiwai@suse.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17 12:16:58 -07:00
Jens Axboe a9381351dd Merge tag 'nvme-6.14-2025-03-13' of git://git.infradead.org/nvme into block-6.14
Pull NVMe fixes from Keith:

"nvme fixes for Linux 6.14

 - Concurrent pci error and hotplug handling fix (Keith)
 - Endpoint function fixes (Damien)"

* tag 'nvme-6.14-2025-03-13' of git://git.infradead.org/nvme:
  nvmet: pci-epf: Do not add an IRQ vector if not needed
  nvmet: pci-epf: Set NVMET_PCI_EPF_Q_LIVE when a queue is fully created
  nvme-pci: fix stuck reset on concurrent DPC and HP
2025-03-13 09:41:57 -06:00
Anuj Gupta 75618ac6e9 block: remove unused parameter 'q' parameter in __blk_rq_map_sg()
request_queue param is no longer used by blk_rq_map_sg and
__blk_rq_map_sg. Remove it.

Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250313035322.243239-1-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-13 05:46:19 -06:00
Shin'ichiro Kawasaki 9bce6b5f89 block: change blk_mq_add_to_batch() third argument type to bool
Commit 1f47ed294a ("block: cleanup and fix batch completion adding
conditions") modified the evaluation criteria for the third argument,
'ioerror', in the blk_mq_add_to_batch() function. Initially, the
function had checked if 'ioerror' equals zero. Following the commit, it
started checking for negative error values, with the presumption that
such values, for instance -EIO, would be passed in.

However, blk_mq_add_to_batch() callers do not pass negative error
values. Instead, they pass status codes defined in various ways:

- NVMe PCI and Apple drivers pass NVMe status code
- virtio_blk driver passes the virtblk request header status byte
- null_blk driver passes blk_status_t

These codes are either zero or positive, therefore the revised check
fails to function as intended. Specifically, with the NVMe PCI driver,
this modification led to the failure of the blktests test case nvme/039.
In this test scenario, errors are artificially injected to the NVMe
driver, resulting in positive NVMe status codes passed to
blk_mq_add_to_batch(), which unexpectedly processes the failed I/O in a
batch. Hence the failure.

To correct the ioerror check within blk_mq_add_to_batch(), make all
callers to uniformly pass the argument as boolean. Modify the callers to
check their specific status codes and pass the boolean value 'is_error'.
Also describe the arguments of blK_mq_add_to_batch as kerneldoc.

Fixes: 1f47ed294a ("block: cleanup and fix batch completion adding conditions")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/20250311104359.1767728-3-shinichiro.kawasaki@wdc.com
[axboe: fold in documentation update]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-12 08:26:36 -06:00
Shin'ichiro Kawasaki e5c2bcc0cd nvme: move error logging from nvme_end_req() to __nvme_end_req()
Before the Commit 1f47ed294a ("block: cleanup and fix batch completion
adding conditions"), blk_mq_add_to_batch() did not add failed
passthrough requests to batch, and returned false. After the commit,
blk_mq_add_to_batch() always adds passthrough requests to batch
regardless of whether the request failed or not, and returns true. This
affected error logging feature in the NVME driver.

Before the commit, the call chain of failed passthrough request was as
follows:

nvme_handle_cqe()
 blk_mq_add_to_batch() .. false is returned, then call nvme_pci_complete_rq()
 nvme_pci_complete_rq()
  nvme_complete_rq()
   nvme_end_req()
    nvme_log_err_passthru() .. error logging
    __nvme_end_req()        .. end of the rqeuest

After the commit, the call chain is as follows:

nvme_handle_cqe()
 blk_mq_add_to_batch() .. true is returned, then set nvme_pci_complete_batch()
 ..
 nvme_pci_complete_batch()
  nvme_complete_batch()
   nvme_complete_batch_req()
    __nvme_end_req() .. end of the request, without error logging

To make the error logging feature work again for passthrough requests, move the
nvme_log_err_passthru() call from nvme_end_req() to __nvme_end_req().

While at it, move nvme_log_error() call for non-passthrough requests together
with nvme_log_err_passthru(). Even though the trigger commit does not affect
non-passthrough requests, move it together for code simplicity.

Fixes: 1f47ed294a ("block: cleanup and fix batch completion adding conditions")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250311104359.1767728-2-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-11 07:49:35 -06:00
Keith Busch 3f674e7b67 nvme-pci: fix stuck reset on concurrent DPC and HP
The PCIe error handling has the nvme driver quiesce the device, attempt
to restart it, then wait for that restart to complete.

A PCIe DPC event also toggles the PCIe link. If the slot doesn't have
out-of-band presence detection, this will trigger a pciehp
re-enumeration.

The error handling that calls nvme_error_resume is holding the device
lock while this happens. This lock blocks pciehp's request to disconnect
the driver from proceeding.

Meanwhile the nvme's reset can't make forward progress because its
device isn't there anymore with outstanding IO, and the timeout handler
won't do anything to fix it because the device is undergoing error
handling.

End result: deadlocked.

Fix this by having the timeout handler short cut the disabling for a
disconnected PCIe device. The downside is that we're relying on an IO
timeout to clean up this mess, which could be a minute by default.

Tested-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-10 09:15:48 -07:00
Arnd Bergmann c339956727 Merge tag 'asahi-soc-rtkit-6.15' of https://github.com/AsahiLinux/linux into soc/drivers
Apple SoC RTKit IPC library updates for 6.15:

- Additional logging for errors
- A few minor improvements and bugfixes required for drivers that are
  yet to be upstreamed

* tag 'asahi-soc-rtkit-6.15' of https://github.com/AsahiLinux/linux:
  soc: apple: rtkit: Cut syslog messages after the first '\0'
  soc: apple: rtkit: Use high prio work queue
  soc: apple: rtkit: Implement OSLog buffers properly
  soc: apple: rtkit: Add and use PWR_STATE_INIT instead of _ON
  soc: apple: rtkit: Fix use-after-free in apple_rtkit_crashlog_rx()
  soc: apple: rtkit: Pass the crashlog to the crashed() callback
  soc: apple: rtkit: Check & log more failures

Link: https://lore.kernel.org/r/20250302113842.58092-1-sven@svenpeter.dev
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2025-03-06 17:15:03 +01:00
Dan Carpenter 528361c499 nvme-tcp: fix signedness bug in nvme_tcp_init_connection()
The kernel_recvmsg() function returns an int which could be either
negative error codes or the number of bytes received.  The problem is
that the condition:

        if (ret < sizeof(*icresp)) {

is type promoted to type unsigned long and negative values are treated
as high positive values which is success, when they should be treated as
failure.  Handle invalid positive returns separately from negative
error codes to avoid this problem.

Fixes: 578539e096 ("nvme-tcp: fix connect failure on receiving partial ICResp PDU")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-05 10:37:01 -08:00
Xinyu Zhang 99fde895ff nvme: map uring_cmd data even if address is 0
When using kernel registered bvec fixed buffers, the "address" is
actually the offset into the bvec rather than userspace address.
Therefore it can be 0.

We can skip checking whether the address is NULL before mapping
uring_cmd data. Bad userspace address will be handled properly later when
the user buffer is imported.

With this patch, we will be able to use the kernel registered bvec fixed
buffers in io_uring NVMe passthru with ublk zero-copy support.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Xinyu Zhang <xizhang@purestorage.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250227223916.143006-4-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-28 07:05:31 -07:00
Maurizio Lombardi ad95bab0cd nvme-tcp: fix potential memory corruption in nvme_tcp_recv_pdu()
nvme_tcp_recv_pdu() doesn't check the validity of the header length.
When header digests are enabled, a target might send a packet with an
invalid header length (e.g. 255), causing nvme_tcp_verify_hdgst()
to access memory outside the allocated area and cause memory corruptions
by overwriting it with the calculated digest.

Fix this by rejecting packets with an unexpected header length.

Fixes: 3f2304f8c6 ("nvme-tcp: add NVMe over TCP host driver")
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-27 18:19:08 -08:00
Maurizio Lombardi afb41b08c4 nvme-tcp: Fix a C2HTermReq error message
In H2CTermReq, a FES with value 0x05 means "R2T Limit Exceeded"; but
in C2HTermReq the same value has a different meaning (Data Transfer Limit
Exceeded).

Fixes: 84e009042d ("nvme-tcp: add basic support for the C2HTermReq PDU")
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-27 15:13:23 -08:00
Pavel Begunkov 69d483d5f4 io_uring/nvme: pass issue_flags to io_uring_cmd_import_fixed()
io_uring_cmd_import_fixed() will need to know the io_uring execution
state in following commits, for now just pass issue_flags into it
without actually using.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250224213116.3509093-5-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-27 07:17:39 -07:00