Commit Graph

1444597 Commits

Author SHA1 Message Date
Bryam Vargas 2dc0bfd2fe partitions: aix: bound the pp_count scan to the ppe array
aix_partition() reads the physical volume descriptor into a fixed-size
struct pvd and then scans its physical-partition-extent array:

	int numpps = be16_to_cpu(pvd->pp_count);
	...
	for (i = 0; i < numpps; i += 1) {
		struct ppe *p = pvd->ppe + i;
		...
		lp_ix = be16_to_cpu(p->lp_ix);

pvd points at a single kmalloc()'d struct pvd whose ppe[] member holds a
fixed ARRAY_SIZE(pvd->ppe) (1016) entries, but the loop runs up to the
on-disk pp_count.  pp_count is an unvalidated __be16 read straight from
the descriptor, so a crafted AIX image with pp_count larger than 1016
drives the loop to read pvd->ppe[i] past the end of the allocation (up
to 65535 entries, ~2 MB out of bounds).

The partition scan runs without mounting anything, when a block device
with a crafted AIX/IBM partition table appears (an attacker-supplied
image attached with losetup -P, or a device auto-scanned by udev), via
msdos_partition() -> aix_partition().

Clamp the scan to the number of entries the ppe[] array can hold.

Fixes: 6ceea22bbb ("partitions: add aix lvm partition support files")
Cc: stable@vger.kernel.org
Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
Acked-by: Philippe De Muyter <phdm@macqel.be>
Link: https://patch.msgid.link/20260607064137.302574-1-hexlabsecurity@proton.me
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-08 07:41:21 -06:00
Bart Van Assche 5f0777166e block: Enable lock context analysis
Now that all block/*.c files have been annotated, enable lock context
analysis for all these source files.

Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/e248ca3aeead238bbc489cf3afdafcbff9e41faf.1780682325.git.bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05 13:41:11 -06:00
Bart Van Assche f10b2de2af block/mq-deadline: Make the lock context annotations compatible with Clang
While sparse ignores the __acquires() and __releases() arguments, Clang
verifies these. Make the arguments of __acquires() and __releases()
acceptable for Clang.

Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/3b6e336ced91e27213608ffce205ccd24f4ba285.1780682325.git.bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05 13:41:11 -06:00
Bart Van Assche b4591b9152 block/Kyber: Make the lock context annotations compatible with Clang
While sparse ignores the __acquires() and __releases() arguments, Clang
verifies these. Make the arguments of __acquires() and __releases()
acceptable for Clang.

Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/91cb8c790fc8b26b8aa742569fbf8c2c1d099dac.1780682325.git.bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05 13:41:11 -06:00
Bart Van Assche 131f14125a block/blk-mq-debugfs: Improve lock context annotations
Make the existing lock context annotations compatible with Clang. Add
the lock context annotations that are missing.

Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/f58fe220ff98f9dfddfed4573f40005c773b7fb7.1780682325.git.bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05 13:41:11 -06:00
Bart Van Assche a255026594 block/blk-iocost: Inline iocg_lock() and iocg_unlock()
Both iocg_lock() and iocg_unlock() use conditional locking. Fold these
functions into their callers such that unlocking becomes unconditional.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/f8c9867788957d2e40a32e23c6d9b866e480ad9d.1780682325.git.bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05 13:41:11 -06:00
Bart Van Assche 1ff85a3879 block/blk-iocost: Split ioc_rqos_throttle()
Prepare for inlining iocg_lock() and iocg_unlock() by moving the code
between these two calls into a new function. No functionality has been
changed.

Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/a6d3ed953cef6669d23a80923bf46600733cbdae.1780682325.git.bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05 13:41:11 -06:00
Bart Van Assche 73bb2480e3 block/crypto: Annotate the crypto functions
Add the lock context annotations required for Clang's thread-safety
analysis.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Cc: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/297b40e43a7f9b7d20e91a6c44b41a69d01f5c63.1780682325.git.bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05 13:41:11 -06:00
Bart Van Assche 998cda78d4 block/cgroup: Inline blkg_conf_{open,close}_bdev_frozen()
The blkg_conf_open_bdev_frozen() calling convention is not compatible
with lock context annotations. Fold both blkg_conf_open_bdev_frozen()
and blkg_conf_close_bdev_frozen() into their only caller. This patch
prepares for enabling lock context analysis.

The type of 'memflags' has been changed from unsigned long into unsigned
int to match the type of current->flags. See also <linux/sched.h>.

Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/05661d1555decc6dd5389174ba448d803b72ed9a.1780682325.git.bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05 13:41:11 -06:00
Bart Van Assche 6a7717a2df block/blk-iocost: Combine two error paths in ioc_qos_write()
Reduce code duplication by combining two error paths. No functionality
has been changed.

Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/80d4fc1ecd5eaf187c0a31c63a1033a7326d4c7e.1780682325.git.bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05 13:41:11 -06:00
Bart Van Assche 9865e41664 block/cgroup: Improve lock context annotations
Add lock context annotations where these are missing. Move the
blkg_conf_prep() annotation into block/blk-cgroup.h to make it visible
to all blkg_conf_prep() callers.

Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/58ddd6e2b960bdfa03d0007984386bc0ba351391.1780682325.git.bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05 13:41:11 -06:00
Bart Van Assche c574c3cc36 block/cgroup: Split blkg_conf_exit()
Split blkg_conf_exit() into blkg_conf_unprep() and blkg_conf_close_bdev()
because blkg_conf_exit() is not compatible with the Clang thread-safety
annotations. Remove blkg_conf_exit(). Rename blkg_conf_exit_frozen() into
blkg_conf_close_bdev_frozen(). Add thread-safety annotations to the new
functions.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/c1ec1f1c4b675bc5f187f77b3e6436234c6b244c.1780682325.git.bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05 13:41:11 -06:00
Bart Van Assche ea4f575e72 block/cgroup: Split blkg_conf_prep()
Move the blkg_conf_open_bdev() call out of blkg_conf_prep() to make it
possible to add lock context annotations to blkg_conf_prep(). Change an
if-statement in blkg_conf_open_bdev() into a WARN_ON_ONCE() call. Export
blkg_conf_open_bdev() because it is called by the BFQ I/O scheduler and
the BFQ I/O scheduler may be built as a kernel module.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/e6ea0387f413217c8561a0ca54ce7b846aa5c7c5.1780682325.git.bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05 13:41:11 -06:00
Bart Van Assche 3033c86fa1 block/bdev: Annotate the blk_holder_ops callback functions
The four callback functions in blk_holder_ops all release the
bd_holder_lock. Annotate these functions accordingly.

Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/be51cf81110f691ebd5868ac2f15ceb847805bc8.1780682325.git.bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05 13:41:11 -06:00
Bart Van Assche 08d912bc44 block: Annotate the queue limits functions
Let the thread-safety checker verify whether every start of a queue
limits update is followed by a call to a function that finishes a queue
limits update.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/8f71062b6d0fcf2b80bc8cda701c453224755439.1780682325.git.bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05 13:41:11 -06:00
Marco Crivellari 7e712f292e block: Add WQ_PERCPU to alloc_workqueue users
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:

   commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
   commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")

The refactoring is going to alter the default behavior of
alloc_workqueue() to be unbound by default.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU. For more details see the Link tag below.

In order to keep alloc_workqueue() behavior identical, explicitly request
WQ_PERCPU.

Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/
Suggested-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://patch.msgid.link/20260604105347.168322-1-marco.crivellari@suse.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05 11:21:39 -06:00
Jens Axboe ed60c09f29 Merge tag 'nvme-7.2-2026-06-04' of git://git.infradead.org/nvme into for-7.2/block
Pull NVMe updates from Keith:

"- Per-controller timeouts
 - Multipath telemetry
 - Namespace format validation
 - Various other fixes"

* tag 'nvme-7.2-2026-06-04' of git://git.infradead.org/nvme: (34 commits)
  nvme: export controller reconnect event count via sysfs
  nvme: export controller reset event count via sysfs
  nvme: export I/O failure count when no path is available via sysfs
  nvme: export I/O requeue count when no path is usable via sysfs
  nvme: export command error counters via sysfs
  nvme: export multipath failover count via sysfs
  nvme: export command retry count via sysfs
  nvme: add diag attribute group under sysfs
  nvme-tcp: lockdep: use dynamic lockdep keys per socket instance
  nvme-tcp: move nvme_tcp_reclassify_socket()
  nvme: validate FDP configuration descriptor sizes
  nvmet-auth: validate reply message payload bounds against transfer length
  nvme: refresh multipath head zoned limits from path limits
  nvme: fix FDP fdpcidx bounds check
  nvme-tcp: Use WQ_PERCPU explicitly if wq_unbound is false.
  nvmet: fix pre-auth out-of-bounds heap read in Discovery Get Log Page
  nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks
  nvme-multipath: require exact iopolicy names for module parameter
  nvme-multipath: pass NS head to nvme_mpath_revalidate_paths()
  nvme-pci: fix out-of-bounds access in nvme_setup_descriptor_pools
  ...
2026-06-05 05:18:58 -06:00
Nilay Shroff 3c8c284dfc nvme: export controller reconnect event count via sysfs
When an NVMe-oF link goes down, the driver attempts to recover the
connection by repeatedly reconnecting to the remote controller at
configured intervals. A maximum number of reconnect attempts is also
configured, after which recovery stops and the controller is removed
if the connection cannot be re-established.

The driver maintains a counter, nr_reconnects, which is incremented on
each reconnect attempt. However if in case the reconnect is successful
then this counter reset to zero. Moreover, currently, this counter is
only reported via kernel log messages and is not exposed to userspace.
Since dmesg is a circular buffer, this information may be lost over
time.

So introduce a new accumulator which accumulates nr_reconnect attempts
and also expose this accumulator per-fabric ctrl via a new sysfs
attribute reconnect_count, under diag attribute grroup to provide
persistent visibility into the number of reconnect attempts made by the
host. This information can help users diagnose unstable links or
connectivity issues. Furthermore, this sysfs attribute is also writable
so user may reset it to zero, if needed.

The reconnect_count can also be consumed by monitoring tools such as
nvme-top to improve controller-level observability.

Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-06-04 01:57:40 -07:00
Nilay Shroff 29aafaaf58 nvme: export controller reset event count via sysfs
The NVMe controller transitions into the RESETTING state during error
recovery, link instability, firmware activation, or when a reset is
explicitly triggered by the user.

Expose a per-ctrl sysfs attribute reset_count, under diag attribute
group to provide visibility into these RESETTING state transitions.
Observing the frequency of reset events can help users identify issues
such as PCIe errors or unstable fabric links. This counter is also
writable thus allowing user to reset its value, if needed.

This counter can also be consumed by monitoring tools such as nvme-top
to improve controller-level observability.

Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-06-04 01:57:36 -07:00
Nilay Shroff a8e434cb03 nvme: export I/O failure count when no path is available via sysfs
When I/O is submitted to the NVMe namespace head and no available path
can handle the request, the driver fails the I/O immediately. Currently,
such failures are only reported via kernel log messages, which may be
lost over time since dmesg is a circular buffer.

Add a new ns-head sysfs counter io_fail_no_available_path_count, under
diag attribute group to expose the number of I/Os that failed due to the
absence of an available path. This provides persistent visibility into
path-related I/O failures and can help users diagnose the cause of I/O
errors. This counter is also writable and so user may reset its value,
if needed.

This counter can also be consumed by monitoring tools such as nvme-top.

Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-06-04 01:57:32 -07:00
Nilay Shroff 76b5e1591e nvme: export I/O requeue count when no path is usable via sysfs
When the NVMe namespace head determines that there is no currently
available path to handle I/O (for example, while a controller is
resetting/connecting or due to a transient link failure), incoming
I/Os are added to the requeue list.

Currently, there is no visibility into how many I/Os have been requeued
in this situation. Add a new ns-head sysfs counter
io_requeue_no_usable_path_count, under diag attribute group to expose
the number of I/Os that were requeued due to the absence of an available
path. This counter is also writable thus allowing user to reset it, if
needed.

This statistic can help users understand I/O slowdowns or stalls caused
by temporary path unavailability, and can be consumed by monitoring
tools such as nvme-top for real-time observability.

Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-06-04 01:57:28 -07:00
Nilay Shroff 30ab37a128 nvme: export command error counters via sysfs
When an NVMe command completes with an error status, the driver
logs the error to the kernel log. However, these messages may be
lost or overwritten over time since dmesg is a circular buffer.

Expose per-path and ctrl sysfs attribute command_error_count, under
diag attribute group to provide persistent visibility into error
occurrences. This allows users to observe the total number of commands
that have failed on a given path over time, which can be useful for
diagnosing path health and stability.

This attribute is both readable and writable thus allowing user to reset
these counters. These counters can also be consumed by observability
tools such as nvme-top to provide additional insight into NVMe error
behavior.

Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-06-04 01:57:25 -07:00
Nilay Shroff 66ee95b3d4 nvme: export multipath failover count via sysfs
When an NVMe command completes with a path-specific error, the NVMe
driver may retry the command on an alternate controller or path if one
is available. These failover events indicate that I/O was redirected
away from the original path.

Currently, the number of times requests are failed over to another
available path is not visible to userspace. Exposing this information
can be useful for diagnosing path health and stability.

Export per-path sysfs attribute "multipath_failover_count" under diag
attribute group. This attribute is both readable and writable and thus
allowing user to reset the counter. This counter can be consumed by
monitoring tools such as nvme-top to help identify paths that
consistently trigger failovers under load.

Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-06-04 01:57:19 -07:00
Nilay Shroff ab5af2903b nvme: export command retry count via sysfs
When Advanced Command Retry Enable (ACRE) is configured, a controller
may interrupt command execution and return a completion status
indicating command interrupted with the DNR bit cleared. In this case,
the driver retries the command based on the Command Retry Delay (CRD)
value provided in the completion status.

Currently, these command retries are handled entirely within the NVMe
driver and are not visible to userspace. As a result, there is no
observability into retry behavior, which can be a useful diagnostic
signal.

Expose a per-namespace sysfs attribute command_retries_count, under
diag attribute group to provide visibility into retry activity. This
information can help identify controller-side congestion under load
and enables comparison across paths in multipath setups (for example,
detecting cases where one path experiences significantly more retries
than another under identical workloads).

This exported metric is intended for diagnostics and monitoring tools
such as nvme-top, and does not change command retry behavior. A new
sysfs attribute named "command_retries_count" is added for this purpose.
This attribute is both readable as well as writable. So user could
reset this counter if needed.

Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-06-04 01:57:13 -07:00
Nilay Shroff 37afebc79a nvme: add diag attribute group under sysfs
Add a new diag attribute group under:
/sys/class/nvme/<ctrl>/
/sys/block/<nvme-path-dev>/
/sys/block/<ns-head-dev>/

This new sysfs attribute group will be used to organize NVMe diagnostic
and telemetry-related counters under it.

Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-06-04 01:57:06 -07:00
Shin'ichiro Kawasaki 19bdb70c77 nvme-tcp: lockdep: use dynamic lockdep keys per socket instance
When NVMe-TCP controller setup and teardown are repeated with lockdep
enabled, lockdep reports false positives WARN for the following locks:

  1) &q->elevator_lock        : IO scheduler change context
  2) &q->q_usage_counter(io)  : SCSI disk probe context
  3) fs_reclaim               : CPU hotplug bring-up context
  4) cpu_hotplug_lock         : socket establishment context
  5) sk_lock-AF_INET-NVME     : MQ sched dispatch context for the socket
  6) set->srcu                : NVMe controller delete context

The lockdep WARN was observed by running blktests test case nvme/005 for
tcp transport on v7.1-rc1 kernel with a patch. Refer to the Link tag for
the details of the WARN.

This is a false positive because lockdep confuses lock 4) (socket
establishment) with lock 5) (socket in use) for different socket
instances. The locks belong to different sockets, but lockdep treats
them as the same due to shared static lockdep keys.

Fix this by using dynamically allocated lockdep keys per socket instance
instead of static keys nvme_tcp_sk_key[] and nvme_tcp_slock_key[]. Add
nvme_tcp_sk_key and nvme_tcp_slock_key fields to struct nvme_tcp_queue
and pass them to sock_lock_init_class_and_name() for proper lockdep
tracking. Change the argument of nvme_tcp_reclassify_socket() from
'struct socket *' to 'struct nvme_tcp_queue *' to pass both the socket
and the keys. Add CONFIG_DEBUG_LOCK_ALLOC guards to nvme_tcp_alloc_queue()
and nvme_tcp_free_queue() to register and unregister the dynamic keys.
Additionally, move nvme_tcp_reclassify_socket() inside these guards since
it's only needed when lockdep is enabled.

Link: https://lore.kernel.org/linux-nvme/afB5syZbUrppgsDQ@shinmob/
Suggested-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-06-04 01:19:08 -07:00
Shin'ichiro Kawasaki 2caaa52c1a nvme-tcp: move nvme_tcp_reclassify_socket()
Move nvme_tcp_reclassify_socket() in tcp.c after the struct
nvme_tcp_queue definition. This is preparation for adding a reference
to struct nvme_tcp_queue in the function, which would otherwise cause a
compile failure due to the struct being defined after the function.

Move the entire CONFIG_DEBUG_LOCK_ALLOC block along with the function
to maintain the code organization.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-06-03 02:43:54 -07:00
liuxixin 0ef4daa653 nvme: validate FDP configuration descriptor sizes
Validate descriptor sizes while walking the FDP configurations log so
dsze == 0 or a descriptor past the log end cannot cause unbounded
iteration or reads past the buffer.

Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: liuxixin <gliuxen@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-06-03 02:42:07 -07:00
Tianchu Chen 3a413ece25 nvmet-auth: validate reply message payload bounds against transfer length
nvmet_auth_reply() accesses the variable-length rval[] array using
attacker-controlled hl (hash length) and dhvlen (DH value length) fields
without verifying they fit within the allocated buffer of tl bytes.

A malicious NVMe-oF initiator can craft a DHCHAP_REPLY message with a
small transfer length but large hl/dhvlen values, causing out-of-bounds
heap reads when the target processes the DH public key (rval + 2*hl) or
performs the host response memcmp.

With DH authentication configured, the OOB pointer is passed directly to
sg_init_one() and read by crypto_kpp_compute_shared_secret(), reaching
up to 526 bytes past the buffer. This is exploitable pre-authentication.

Add bounds validation ensuring sizeof(*data) + 2*hl + dhvlen <= tl before
any access to the variable-length fields.

Discovered by Atuin - Automated Vulnerability Discovery Engine.

Fixes: db1312dd95 ("nvmet: implement basic In-Band Authentication")
Cc: stable@vger.kernel.org
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Signed-off-by: Tianchu Chen <flynnnchen@tencent.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-06-03 02:40:33 -07:00
Thorsten Blum 3f1eccd372 n64cart: use strscpy in n64cart_probe
strcpy() has been deprecated [1] because it performs no bounds checking
on the destination buffer, which can lead to buffer overflows. While the
current code works correctly, replace strcpy() with the safer strscpy()
to follow secure coding best practices.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strcpy

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://patch.msgid.link/20260517172617.3954-2-thorsten.blum@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-02 17:43:32 -06:00
Thorsten Blum aa528cd12c block/partitions/acorn: use min in {riscix,linux}_partition
Use min() to replace the open-coded implementations and to simplify
riscix_partition() and linux_partition().

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20260602160757.973736-3-thorsten.blum@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-02 11:14:52 -06:00
Yu Kuai 6636e16e60 block, bfq: release cgroup stats with bfq_group
BFQ cgroup stats contain percpu counters embedded in struct bfq_group,
but the old free path destroys them from bfq_pd_free(), which is tied
to blkg policy-data teardown.

That is not the same lifetime as struct bfq_group. BFQ pins bfq_group
while bfq_queue entities refer to it, so bfq_pd_free() can drop the
policy-data reference while other bfq_group references still exist. The
following blkcg change also defers policy-data release through RCU and
leaves BFQ to run the final bfqg_put() from an RCU callback. For that
conversion, stats teardown must belong to the last bfq_group put, not to
policy-data teardown.

Move stats teardown to bfqg_put() so the embedded counters are destroyed
exactly when the last bfq_group reference is released, before kfree(bfqg).

Without this preparatory change, the RCU-delayed policy-data free
conversion reproduced the following KASAN report:

  BUG: KASAN: slab-use-after-free in percpu_counter_destroy_many+0xf1/0x2e0
  Write of size 8 at addr ffff88811d9409e0 by task test_blkcg/535

  CPU: 0 UID: 0 PID: 535 Comm: test_blkcg Not tainted 7.1.0-rc2-g1e14adca0199 #1 PREEMPT  ea13f83d4b74a12510d20db4a7d9a0fe8275f05c
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
  Call Trace:
   <TASK>
   dump_stack_lvl+0x54/0x70
   print_address_description+0x77/0x200
   ? percpu_counter_destroy_many+0xf1/0x2e0
   print_report+0x64/0x70
   kasan_report+0x118/0x150
   ? percpu_counter_destroy_many+0xf1/0x2e0
   percpu_counter_destroy_many+0xf1/0x2e0
   __mmdrop+0x1d8/0x350
   finish_task_switch+0x3f5/0x570
   __schedule+0xe8e/0x18a0
   schedule+0xfe/0x1c0
   schedule_timeout+0x7f/0x1d0
   __wait_for_common+0x26c/0x3f0
   wait_for_completion_state+0x21/0x40
   call_usermodehelper_exec+0x271/0x2c0
   __request_module+0x296/0x410
   elv_iosched_store+0x1bc/0x2c0
   queue_attr_store+0x152/0x1c0
   kernfs_fop_write_iter+0x1d7/0x280
   vfs_write+0x580/0x630
   ksys_write+0xec/0x190
   do_syscall_64+0x156/0x490
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

  Allocated by task 535:
   kasan_save_track+0x3e/0x80
   __kasan_kmalloc+0x72/0x90
   bfq_pd_alloc+0x60/0x100 [bfq]
   blkg_create+0x3bb/0xbe0
   blkg_lookup_create+0x3a2/0x460
   blkg_conf_start+0x24a/0x2d0
   bfq_io_set_weight+0x17f/0x430 [bfq]
   cgroup_file_write+0x1c5/0x4b0
   kernfs_fop_write_iter+0x1d7/0x280
   vfs_write+0x580/0x630
   ksys_write+0xec/0x190
   do_syscall_64+0x156/0x490
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

  Freed by task 0:
   kasan_save_track+0x3e/0x80
   kasan_save_free_info+0x46/0x50
   __kasan_slab_free+0x3a/0x60
   kfree+0x14e/0x4f0
   rcu_core+0x6f3/0xcd0
   handle_softirqs+0x1a0/0x550
   __irq_exit_rcu+0x8c/0x150
   irq_exit_rcu+0xe/0x20
   sysvec_apic_timer_interrupt+0x6e/0x80
   asm_sysvec_apic_timer_interrupt+0x1a/0x20

  Last potentially related work creation:
   kasan_save_stack+0x3e/0x60
   kasan_record_aux_stack+0x99/0xb0
   call_rcu+0x55/0x5c0
   blkg_free_workfn+0x130/0x220
   process_scheduled_works+0x655/0xb60
   worker_thread+0x446/0x600
   kthread+0x1f4/0x230
   ret_from_fork+0x259/0x420
   ret_from_fork_asm+0x1a/0x30

Signed-off-by: Yu Kuai <yukuai@fygo.io>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260601061502.899552-1-yukuai@fygo.io
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-02 07:24:58 -06:00
Yao Sang 59c0517123 nvme: refresh multipath head zoned limits from path limits
queue_limits_stack_bdev() updates the multipath head limits from the
path queue, but it does not propagate max_open_zones or
max_active_zones. As a result, a zoned multipath namespace head can
keep stale 0/0 values even after a ready path reports finite zoned
resource limits.

When refreshing the head limits in nvme_update_ns_info(), stack the
zoned resource limits directly after stacking the path queue limits.
Use min_not_zero() so the block layer's 0 value keeps its "no limit"
meaning while finite limits are combined conservatively.

This avoids advertising "no limit" on the multipath head while keeping
the zoned-limit handling local to the NVMe multipath update path.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yao Sang <sangyao@kylinos.cn>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-06-02 05:30:59 -07:00
liuxixin 0967074f68 nvme: fix FDP fdpcidx bounds check
The fdpcidx bounds check sets n = NUMFDPC + 1 but used > instead of >=,
incorrectly accepting fdp_idx when it equals n (i.e. NUMFDPC + 1).

Fixes: 30b5f20bb2 ("nvme: register fdp parameters with the block layer")
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: liuxixin <gliuxen@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-06-02 05:20:57 -07:00
Kuniyuki Iwashima 8757fd9500 nvme-tcp: Use WQ_PERCPU explicitly if wq_unbound is false.
Since commit 21c05ca88a ("workqueue: Add warnings and ensure
one among WQ_PERCPU or WQ_UNBOUND is present"), we must explicitly
set WQ_PERCPU or WQ_UNBOUND when creating workqueue.

nvme_tcp_init_module() sets WQ_UNBOUND when the module param
wq_unbound is set, but otherwise, WQ_PERCPU is missing, triggering
the warning below:

  workqueue: nvme_tcp_wq is using neither WQ_PERCPU or WQ_UNBOUND. Setting WQ_PERCPU.
  WARNING: kernel/workqueue.c:5856 at __alloc_workqueue+0x1d02/0x2070 kernel/workqueue.c:5855, CPU#0: swapper/0/1

Let's set WQ_PERCPU if wq_unbound is false.

Reported-by: syzbot+d078cba4418e65f61984@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6a1a9a86.323e8352.141b09.0001.GAE@google.com/
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-06-02 03:46:27 -07:00
Bryam Vargas 53cd102a7a nvmet: fix pre-auth out-of-bounds heap read in Discovery Get Log Page
nvmet_execute_disc_get_log_page() validates only the dword alignment
of the host-supplied Log Page Offset (lpo).  The 64-bit offset is then
added to a small kzalloc'd buffer that holds the discovery log page
and the result is passed straight to nvmet_copy_to_sgl(), which
memcpy()s data_len bytes out to the host with no source-side bound
check:

    u64 offset      = nvmet_get_log_page_offset(req->cmd);  /* 64-bit host */
    size_t data_len = nvmet_get_log_page_len(req->cmd);     /* 32-bit host */
    ...
    if (offset & 0x3) { ... }                               /* only check */
    ...
    alloc_len = sizeof(*hdr) + entry_size * discovery_log_entries(req);
    buffer = kzalloc(alloc_len, GFP_KERNEL);
    ...
    status = nvmet_copy_to_sgl(req, 0, buffer + offset, data_len);

The Discovery controller is unauthenticated -- nvmet_host_allowed()
returns true unconditionally for the discovery subsystem -- so the call
is reachable pre-authentication by any TCP/RDMA/FC peer that can reach
the nvmet target.  With a discovery log page of ~1 KiB, an attacker
requesting up to 4 KiB starting at offset == alloc_len reads the next
slab page out and gets its content returned over the fabric (an
empirical run on a default nvmet-tcp loopback target leaked 81
canonical kernel pointers in one Get Log Page response).  Pointing the
offset at unmapped kernel memory faults the in-kernel memcpy and
crashes (or panics, on panic_on_oops=1) the target host instead.

The attacker-controlled source-side offset pattern
"nvmet_copy_to_sgl(req, 0, buffer + ATTACKER_OFFSET, ...)" is unique
to nvmet_execute_disc_get_log_page in the entire nvmet codebase: every
other Get Log Page handler in admin-cmd.c either ignores lpo (and
silently starts every response at offset 0) or tracks a local
destination offset with a fixed source pointer.

Validate the host-supplied offset against the log page size, cap the
copy length to what is actually available, and zero-fill any remainder
of the host transfer buffer.  The zero-fill matches the existing
short-response pattern in nvmet_execute_get_log_changed_ns()
(admin-cmd.c) and prevents leaking transport SGL contents when the
host asks for more bytes than the log page contains.

Fixes: a07b4970f4 ("nvmet: add a generic NVMe target")
Cc: stable@vger.kernel.org
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-06-02 03:43:27 -07:00
Achkinazi, Igor 88bac2c1a7 nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks
When nvme_ns_head_submit_bio() remaps a bio from the multipath head to a
per-path namespace, bio_set_dev() clears BIO_REMAPPED.  The remapped bio
is then resubmitted through submit_bio_noacct() which calls
bio_check_eod() because BIO_REMAPPED is not set.

This races with nvme_ns_remove() which zeroes the per-path capacity
before synchronize_srcu():

  CPU 0 (IO submission)
  ---------------------
  srcu_read_lock()
  nvme_find_path() -> ns
    [NVME_NS_READY is set]

  CPU 1 (namespace removal)
  -------------------------
  clear_bit(NVME_NS_READY)
  set_capacity(ns->disk, 0)
  synchronize_srcu()  <- blocks

  CPU 0 (IO submission)
  ---------------------
  bio_set_dev(bio, ns->disk->part0)
    [clears BIO_REMAPPED]
  submit_bio_noacct(bio)
    -> bio_check_eod() sees capacity=0
    -> bio fails with IO error

The SRCU read lock prevents synchronize_srcu() from completing, but does
not prevent set_capacity(0) from executing.  The bio fails the EOD check
before it reaches the NVMe driver, so nvme_failover_req() never gets a
chance to redirect it to another path of multipath.  IO errors are
reported to the application despite another path being available.

On older kernels (before commit 0b64682e78 "block: skip unnecessary
checks for split bio"), the same race was also reachable through split
remainders resubmitted via submit_bio_noacct().

Fix this by setting BIO_REMAPPED after bio_set_dev() in
nvme_ns_head_submit_bio().  This skips bio_check_eod() on the per-path
device; the EOD check already passed on the multipath head.

NVMe per-path namespace devices are always whole disks (bd_partno=0), so
the blk_partition_remap() skip also gated by BIO_REMAPPED is a no-op.
The flag does not persist across failover and cannot go stale if the
namespace geometry changes between attempts: nvme_failover_req() calls
bio_set_dev() to redirect the bio back to the multipath head, which
clears BIO_REMAPPED.  When nvme_requeue_work() resubmits through
submit_bio_noacct(), bio_check_eod() runs normally against the current
capacity.

Same approach as commit 3a905c37c3 ("block: skip bio_check_eod for
partition-remapped bios").

Fixes: a7c7f7b2b6 ("nvme: use bio_set_dev to assign ->bi_bdev")
Cc: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Igor Achkinazi <igor.achkinazi@dell.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-06-02 03:23:05 -07:00
liyouhong 4cf06977bd nvme-multipath: require exact iopolicy names for module parameter
The iopolicy module parameter uses strncmp prefix matching, so values
like "numax" are accepted as "numa".  The per-subsystem sysfs attribute
already requires an exact match via sysfs_streq().  Parse both through
a shared helper so invalid values are rejected consistently.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: liyouhong <liyouhong@kylinos.cn>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-06-02 03:16:26 -07:00
John Garry f078d1aa52 nvme-multipath: pass NS head to nvme_mpath_revalidate_paths()
In nvme_mpath_revalidate_paths(), we are passed a NS pointer and use that
to lookup the NS head and then use that same NS pointer as an iter variable.

It makes more sense pass the NS head and use a local variable for the NS
iter.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-06-02 03:15:08 -07:00
Jens Axboe b30288887c Merge tag 'md-7.2-20260531' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux into for-7.2/block
Pull MD updates and fixes from Yu Kuai:

"Bug Fixes:
 - Only requeue dm-raid bios when dm is suspending. (Benjamin Marzinski)
 - Reset raid10 read_slot when reusing r10bio for discard. (Chen Cheng)
 - Fix raid1/raid10 deadlock in read error recovery path.
   (Abd-Alrhman Masalkhi)
 - Fix raid1/raid10 error-path detection with md_cloned_bio().
   (Abd-Alrhman Masalkhi)
 - Fix raid1/raid10 bio accounting for split md cloned bios.
   (Abd-Alrhman Masalkhi)
 - Fix raid1 nr_pending leak in REQ_ATOMIC bad-block path.
   (Abd-Alrhman Masalkhi)

 Improvements:
 - Skip redundant raid_disks updates when the value is unchanged.
   (Abd-Alrhman Masalkhi)

 Cleanups:
 - Update MAINTAINERS email addresses. (Yu Kuai, Li Nan)
 - Clean up raid1 read error handling. (Christoph Hellwig)
 - Move the exceed_read_errors condition out of fix_read_error().
   (Christoph Hellwig)
 - Use str_plural() in raid0 dump_zones(). (Thorsten Blum)"

* tag 'md-7.2-20260531' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux:
  md/raid0: use str_plural helper in dump_zones
  raid1: fix nr_pending leak in REQ_ATOMIC bad-block error path
  md/raid1: move the exceed_read_errors condition out of fix_read_error
  md/raid1: cleanup handle_read_error
  md/raid1,raid10: fix bio accounting for split md cloned bios
  md/raid1,raid10: fix error-path detection with md_cloned_bio()
  md/raid1,raid10: fix deadlock in read error recovery path
  md/raid10: reset read_slot when reusing r10bio for discard
  md: skip redundant raid_disks update when value is unchanged
  dm-raid: only requeue bios when dm is suspending
  MAINTAINERS: Update Li Nan's E-mail address
  MAINTAINERS: update Yu Kuai's email address
2026-06-01 12:52:20 -06:00
Christoph Böhmwalder 9310b955c8 MAINTAINERS: use new drbd-dev mailing list
We are migrating from our own infrastructure to lists.linux.dev, so
change the drbd-dev address to point to the new domain.

Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://patch.msgid.link/20260513065557.36042-1-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-01 08:23:43 -06:00
Rosen Penev 2e1b3f4c51 rbd: check snap_count against RBD_MAX_SNAP_COUNT
snap_count is u32 but the comparison is against a SIZE_MAX-derived value
(~2^61 on 64-bit), which clang flags as always false with
-Wtautological-constant-out-of-range-compare.

The proper check here should be that snap_count does not go over
RBD_MAX_SNAP_COUNT.

Assisted-by: Opencode:Big-pickle
Signed-off-by: Rosen Penev <rosenp@gmail.com>
Reviewed-by: Alex Elder <elder@riscstar.com>
Link: https://patch.msgid.link/20260530011255.52916-1-rosenp@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-31 19:48:53 -06:00
Haoze Xie 2957771379 rust: block: fix GenDisk cleanup paths
GenDiskBuilder::build() still has fallible work after
__blk_mq_alloc_disk(), but its error path only recovers the
foreign queue data. That leaks the temporary gendisk and
request_queue until later teardown. If the caller moved the last
Arc<TagSet<T>> into build(), the leaked queue can retain blk-mq
state after the tag set is dropped.

Fix the pre-registration failure path by dropping the temporary
gendisk reference with put_disk() before recovering queue_data,
so disk_release() can tear down the owned queue.

Also pair GenDisk::drop() with put_disk() after del_gendisk().
Once a Rust GenDisk has been added with device_add_disk(),
del_gendisk() only unregisters it; the final gendisk reference
still has to be dropped to complete the release path.

Fixes: 3253aba340 ("rust: block: introduce `kernel::block::mq` module")
Cc: stable@kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
Signed-off-by: Haoze Xie <royenheart@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Link: https://patch.msgid.link/b70aff9a920cc42110fe5cf454c3099561863519.1780063368.git.royenheart@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-31 19:45:06 -06:00
Thorsten Blum 717359a168 md/raid0: use str_plural helper in dump_zones
Replace the manual ternary "s" pluralization with str_plural() to
simplify the code.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://patch.msgid.link/20260527141932.1243503-2-thorsten.blum@linux.dev
Signed-off-by: Yu Kuai <yukuai@fygo.io>
2026-05-31 19:09:20 +08:00
Abd-Alrhman Masalkhi 909d9dc3b5 raid1: fix nr_pending leak in REQ_ATOMIC bad-block error path
In raid1_write_request(), each per-mirror loop iteration begins by
incrementing rdev->nr_pending. If a REQ_ATOMIC write encounters a
badblock within the requested range, the code jumps to err_handle
without dropping the reference taken for the current mirror.

err_handle's cleanup loop will only decrements for k < i and
r1_bio->bios[k] is non-NULL. The current slot is therefore skipped,
leaving its nr_pending reference leaked permanently. The reference
prevents the rdev from ever being removed, since raid1_remove_conf()
refuses to remove an rdev with nr_pending > 0.

Fix this by calling rdev_dec_pending() before jumping to err_handle.

Fixes: f2a38abf5f ("md/raid1: Atomic write support")
Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
Link: https://patch.msgid.link/20260530151411.4119-1-abd.masalkhi@gmail.com
Signed-off-by: Yu Kuai <yukuai@fygo.io>
2026-05-31 19:09:19 +08:00
Christoph Hellwig 6e3b0b9133 md/raid1: move the exceed_read_errors condition out of fix_read_error
This condition much better fits into the only caller, limiting
fix_read_error to actually fix up data devices after a read error.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260529054308.2720300-3-hch@lst.de
Signed-off-by: Yu Kuai <yukuai@fygo.io>
2026-05-31 19:09:19 +08:00
Christoph Hellwig fcba803132 md/raid1: cleanup handle_read_error
Unwind the main conditional with duplicate conditions and initialize
variables at initialization time where possible.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260529054308.2720300-2-hch@lst.de
Signed-off-by: Yu Kuai <yukuai@fygo.io>
2026-05-31 19:09:19 +08:00
Abd-Alrhman Masalkhi ba976e3501 md/raid1,raid10: fix bio accounting for split md cloned bios
Use md_cloned_bio() to control bio accounting instead of relying
on r1bio_existed in raid1 or the io_accounting flag in raid10.

The previous logic does not reliably reflect whether a bio is an
md cloned bio. When a failed bio is split and resubmitted via
bio_submit_split_bioset() on the error path, this can lead to either
double accounting for md cloned bios, or missing accounting for bios
returned from bio_submit_split_bioset()

Fix this by using md_cloned_bio() to detect md cloned bios and
skip accounting accordingly.

Fixes: bb2a9acefa ("md/raid1: switch to use md_account_bio() for io accounting")
Fixes: 8204552383 ("md/raid10: switch to use md_account_bio() for io accounting")
Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
Reviewed-by: Xiao Ni <xiao@kernel.org>
Link: https://patch.msgid.link/20260501114652.590037-4-abd.masalkhi@gmail.com
Signed-off-by: Yu Kuai <yukuai@fygo.io>
2026-05-31 19:09:18 +08:00
Abd-Alrhman Masalkhi 811545e092 md/raid1,raid10: fix error-path detection with md_cloned_bio()
Detect the error path using md_cloned_bio() instead of relying
on r1_bio in raid1 or r10_bio->read_slot in raid10, which may be
NULL or -1 after splitting and resubmitting a failed bio.

As a result, the error path may not be recognized and memory
allocations can incorrectly use GFP_NOIO instead of
(GFP_NOIO | __GFP_HIGH), which can lead to a deadlock under
memory pressure.

Fixes: 689389a06c ("md/raid1: simplify handle_read_error().")
Fixes: 545250f248 ("md/raid10: simplify handle_read_error()")
Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
Reviewed-by: Xiao Ni <xiao@kernel.org>
Link: https://patch.msgid.link/20260501114652.590037-3-abd.masalkhi@gmail.com
Signed-off-by: Yu Kuai <yukuai@fygo.io>
2026-05-31 19:09:18 +08:00
Abd-Alrhman Masalkhi 7b15c24f80 md/raid1,raid10: fix deadlock in read error recovery path
raid1d and raid10d may resubmit a split md cloned bio while handling
a read error. In this case, resubmitting the bio can lead to a deadlock
if the array is suspended before md_handle_request() acquires an
active_io reference via percpu_ref_tryget_live().

Since the cloned bio already holds an active_io reference,
trying to acquire another reference via percpu_ref_tryget_live()
can lead to a deadlock while the array is suspended.

Fix this by using percpu_ref_get() for md cloned bios.

Fixes: bb2a9acefa ("md/raid1: switch to use md_account_bio() for io accounting")
Fixes: 8204552383 ("md/raid10: switch to use md_account_bio() for io accounting")
Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
Reviewed-by: Xiao Ni <xiao@kernel.org>
Reviewed-by: Yu Kuai <yukuai@fygo.io>
Link: https://patch.msgid.link/20260501114652.590037-2-abd.masalkhi@gmail.com
Signed-off-by: Yu Kuai <yukuai@fygo.io>
2026-05-31 19:09:18 +08:00