Commit Graph

1385352 Commits

Author SHA1 Message Date
Heiner Kallweit e211c463b7 net: phy: stop exporting phy_driver_unregister
After 42e2a9e11a ("net: phy: dp83640: improve phydev and driver
removal handling") we can stop exporting also phy_driver_unregister().

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/2bab950e-4b70-4030-b997-03f48379586f@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-30 13:17:16 +02:00
Dragos Tatulea a1b501a8c6 page_pool: Clamp pool size to max 16K pages
page_pool_init() returns E2BIG when the page_pool size goes above 32K
pages. As some drivers are configuring the page_pool size according to
the MTU and ring size, there are cases where this limit is exceeded and
the queue creation fails.

The page_pool size doesn't have to cover a full queue, especially for
larger ring size. So clamp the size instead of returning an error. Do
this in the core to avoid having each driver do the clamping.

The current limit was deemed to high [1] so it was reduced to 16K to avoid
page waste.

[1] https://lore.kernel.org/all/1758532715-820422-3-git-send-email-tariqt@nvidia.com/

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250926131605.2276734-2-dtatulea@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-30 12:16:23 +02:00
Dmitry Antipov 2ade91705b tipc: adjust tipc_nodeid2string() to return string length
Since the value returned by 'tipc_nodeid2string()' is not used, the
function may be adjusted to return the length of the result, which
is helpful to drop a few calls to 'strlen()' in 'tipc_link_create()'
and 'tipc_link_bc_create()'. Compile tested only.

Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250926074113.914399-1-dmantipov@yandex.ru
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-30 11:22:39 +02:00
Qingfang Deng 38b04ed707 6pack: drop redundant locking and refcounting
The TTY layer already serializes line discipline operations with
tty->ldisc_sem, so the extra disc_data_lock and refcnt in 6pack
are unnecessary.

Removing them simplifies the code and also resolves a lockdep warning
reported by syzbot. The warning did not indicate a real deadlock, since
the write-side lock was only taken in process context with hardirqs
disabled.

Reported-by: syzbot+5fd749c74105b0e1b302@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68c858b0.050a0220.3c6139.0d1c.GAE@google.com/
Signed-off-by: Qingfang Deng <dqfext@gmail.com>
Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://patch.msgid.link/20250925051059.26876-1-dqfext@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-30 10:10:59 +02:00
Oleksij Rempel 7bd80ed89d Documentation: net: add flow control guide and document ethtool API
Introduce a new document, flow_control.rst, to provide a comprehensive
guide on Ethernet Flow Control in Linux. The guide explains how flow
control works, how autonegotiation resolves pause capabilities, and how
to configure it using ethtool and Netlink.

In parallel, document the pause and pause-stat attributes in the
ethtool.yaml netlink spec. This enables the ynl tool to generate
kernel-doc comments for the corresponding enums in the UAPI header,
making the C interface self-documenting.

Finally, replace the legacy flow control section in phy.rst with a
reference to the new document and add pointers in the relevant C source
files.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250924120241.724850-1-o.rempel@pengutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-30 09:48:31 +02:00
Jakub Kicinski c5cb31c992 Merge branch 'dpll-add-phase-offset-averaging-factor'
Ivan Vecera says:

====================
dpll: add phase offset averaging factor

For some hardware, the phase shift may result from averaging previous values
and the newly measured value. In this case, the averaging is controlled by
a configurable averaging factor.

Add new device level attribute phase-offset-avg-factor, appropriate
callbacks and implement them in zl3073x driver.
====================

Link: https://patch.msgid.link/20250927084912.2343597-1-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:57:43 -07:00
Ivan Vecera 9363b48376 dpll: zl3073x: Allow to configure phase offset averaging factor
The DPLL phase measurement block uses an exponential moving average with
a configurable averaging factor. Measurements are taken at approximately
40 Hz or at the reference frequency, whichever is lower.

Currently, factor=2 is used to prioritize fast response for dynamic
phase changes. For applications needing a stable, precise average phase
offset where rapid changes are unlikely, a higher factor is recommended.

Implement the .phase_offset_avg_factor_get/set callbacks to allow a user
to adjust this factor.

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20250927084912.2343597-4-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:57:41 -07:00
Ivan Vecera e28d5a68b6 dpll: add phase_offset_avg_factor_get/set callback ops
Add new callback operations for a dpll device:
- phase_offset_avg_factor_get(...) - to obtain current phase offset
  averaging factor from dpll device,
- phase_offset_avg_factor_set(...) - to set phase offset averaging factor

Obtain the factor value using the get callback and provide it to the user
if the device driver implement this callback. Execute the set callback upon
user requests, if the driver implement it.

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
v2:
* do not require 'set' callback to retrieve current value
* always call 'set' callback regardless of current value
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20250927084912.2343597-3-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:57:41 -07:00
Ivan Vecera a680581f6a dpll: add phase-offset-avg-factor device attribute to netlink spec
Add dpll device level attribute DPLL_A_PHASE_OFFSET_AVG_FACTOR to allow
control over a calculation of reported phase offset value. Attribute is
present, if the driver provides such capability, otherwise attribute
shall not be present.

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20250927084912.2343597-2-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:57:41 -07:00
Jakub Kicinski 377ea33128 Merge tag 'mlx5-next-lag' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Tariq Toukan says:

====================
mlx5-next updates 2025-09-28

* tag 'mlx5-next-lag' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
  net/mlx5: IFC add balance ID and LAG per MP group bits
  net/mlx5: Add IFC bit for TIR/SQ order capability
====================

Link: https://patch.msgid.link/1759093989-841873-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:49:59 -07:00
Russell King (Oracle) 6d3728d424 net: stmmac: remove stmmac_hw_setup() excess documentation parameter
The kernel build bot reports:

Warning: drivers/net/ethernet/stmicro/stmmac/stmmac_main.c:3438 Excess function parameter 'ptp_register' description in 'stmmac_hw_setup'

Fix it.

Reported-by: kernel test robot <lkp@intel.com>
Fixes: 98d8ea566b ("net: stmmac: move timestamping/ptp init to stmmac_hw_setup() caller")
Closes: https://lore.kernel.org/oe-kbuild-all/202509290927.svDd6xuw-lkp@intel.com/
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1v38Y7-00000008UCQ-3w27@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:47:15 -07:00
Jakub Kicinski 4363d18219 Merge branch 'selftest-packetdrill-import-tfo-server-tests'
Kuniyuki Iwashima says:

====================
selftest: packetdrill: Import TFO server tests.

The series imports 15 TFO server tests from google/packetdrill and
adds 2 more tests.

The repository has two versions of tests for most scenarios; one uses
the non-experimental option (34), and the other uses the experimental
option (255) with 0xF989.

Basically, we only import the non-experimental version of tests, and
for the experimental option, tcp_fastopen_server_experimental_option.pkt
is added.

The following tests are not (yet) imported:

  * icmp-baseline.pkt
  * simple1.pkt / simple2.pkt / simple3.pkt

The former is completely covered by icmp-before-accept.pkt.

The later's delta is the src/dst IP pair to generate a different
cookie, but supporting dualstack requires churn in ksft_runner.sh,
so defered to future series.  Also, sockopt-fastopen-key.pkt covers
the same function.

The following tests have the experimental version only, so converted
to the non-experimental option:

  * client-ack-dropped-then-recovery-ms-timestamps.pkt
  * sockopt-fastopen-key.pkt

For the imported tests, these common changes are applied.

  * Add SPDX header
  * Adjust path to default.sh
  * Adjust sysctl w/ set_sysctls.py
  * Use TFO_COOKIE instead of a raw hex value
  * Use SOCK_NONBLOCK for socket() not to block accept()
  * Add assertions for TCP state if commented
  * Remove unnecessary delay (e.g. +0.1 setsockopt(SO_REUSEADDR), etc)

With this series, except for simple{1,2,3}.pkt, we can remove TFO server
tests in google/packetdrill.
====================

Link: https://patch.msgid.link/20250927213022.1850048-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:41:40 -07:00
Kuniyuki Iwashima 9b62d53cc8 selftest: packetdrill: Import client-ack-dropped-then-recovery-ms-timestamps.pkt
This also does not have the non-experimental version, so converted to FO.

The comment in .pkt explains the detailed scenario.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250927213022.1850048-14-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:41:39 -07:00
Kuniyuki Iwashima 05b9f505fb selftest: packetdrill: Import sockopt-fastopen-key.pkt
sockopt-fastopen-key.pkt does not have the non-experimental
version, so the Experimental version is converted, FOEXP -> FO.

The test sets net.ipv4.tcp_fastopen_key=0-0-0-0 and instead
sets another key via setsockopt(TCP_FASTOPEN_KEY).

The first listener generates a valid cookie in response to TFO
option without cookie, and the second listner creates a TFO socket
using the valid cookie.

TCP_FASTOPEN_KEY is adjusted to use the common key in default.sh
so that we can use TFO_COOKIE and support dualstack.  Similarly,
TFO_COOKIE_ZERO for the 0-0-0-0 key is defined.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250927213022.1850048-13-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:41:39 -07:00
Kuniyuki Iwashima be90c7b3d5 selftest: packetdrill: Refine tcp_fastopen_server_reset-after-disconnect.pkt.
These changes are applied to follow the imported packetdrill tests.

  * Call setsockopt(TCP_FASTOPEN)
  * Remove unnecessary accept() delay
  * Add assertion for TCP states
  * Rename to tcp_fastopen_server_trigger-rst-reconnect.pkt.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250927213022.1850048-12-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:41:39 -07:00
Kuniyuki Iwashima 21f7fb31ae selftest: packetdrill: Import opt34/*-trigger-rst.pkt.
This imports the non-experimental version of opt34/*-trigger-rst.pkt.

                                     | accept() | SYN data |
  -----------------------------------+----------+----------+
  listener-closed-trigger-rst.pkt    |    no    |  unread  |
  unread-data-closed-trigger-rst.pkt |   yes    |  unread  |

Both files test that close()ing a SYN_RECV socket with unread SYN data
triggers RST.

The files are renamed to have the common prefix, trigger-rst.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250927213022.1850048-11-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:41:39 -07:00
Kuniyuki Iwashima 5920f154e1 selftest: packetdrill: Import opt34/reset-* tests.
This imports the non-experimental version of opt34/reset-*.pkt.

                                   |  Child  |              RST              | sk_err  |
  ---------------------------------+---------+-------------------------------+---------+
  reset-after-accept.pkt           |   TFO   |   after accept(), SYN_RECV    |  read() |
  reset-close-with-unread-data.pkt |   TFO   |   after accept(), SYN_RECV    | write() |
  reset-before-accept.pkt          |   TFO   |  before accept(), SYN_RECV    |  read() |
  reset-non-tfo-socket.pkt         | non-TFO |  before accept(), ESTABLISHED | write() |

The first 3 files test scenarios where a SYN_RECV socket receives RST
before/after accept() and data in SYN must be read() without error,
but the following read() or fist write() will return ECONNRESET.

The last test is similar but with non-TFO socket.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250927213022.1850048-10-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:41:39 -07:00
Kuniyuki Iwashima a8b1750e68 selftest: packetdrill: Import opt34/icmp-before-accept.pkt.
This imports the non-experimental version of icmp-before-accept.pkt.

This file tests the scenario where an ICMP unreachable packet for a
not-yet-accept()ed socket changes its state to TCP_CLOSE, but the
SYN data must be read without error, and the following read() returns
EHOSTUNREACH.

Note that this test support only IPv4 as icmp is used.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250927213022.1850048-9-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:41:39 -07:00
Kuniyuki Iwashima 5ed080f85a selftest: packetdrill: Import opt34/fin-close-socket.pkt.
This imports the non-experimental version of fin-close-socket.pkt.

This file tests the scenario where a TFO child socket's state
transitions from SYN_RECV to CLOSE_WAIT before accept()ed.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250927213022.1850048-8-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:41:39 -07:00
Kuniyuki Iwashima e57b3933ab selftest: packetdrill: Add test for experimental option.
The only difference between non-experimental vs experimental TFO
option handling is SYN+ACK generation.

When tcp_parse_fastopen_option() parses a TFO option, it sets
tcp_fastopen_cookie.exp to false if the option number is 34,
and true if 255.

The value is carried to tcp_options_write() to generate a TFO option
with the same option number.

Other than that, all the TFO handling is the same and the kernel must
generate the same cookie regardless of the option number.

Let's add a test for the handling so that we can consolidate
fastopen/server/ tests and fastopen/server/opt34 tests.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250927213022.1850048-7-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:41:39 -07:00
Kuniyuki Iwashima 399e0a7ed9 selftest: packetdrill: Add test for TFO_SERVER_WO_SOCKOPT1.
TFO_SERVER_WO_SOCKOPT1 is no longer enabled by default, and
each server test requires setsockopt(TCP_FASTOPEN).

Let's add a basic test for TFO_SERVER_WO_SOCKOPT1.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250927213022.1850048-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:41:38 -07:00
Kuniyuki Iwashima 0b8f164eb2 selftest: packetdrill: Import TFO server basic tests.
This imports basic TFO server tests from google/packetdrill.

The repository has two versions of tests for most scenarios; one uses
the non-experimental option (34), and the other uses the experimental
option (255) with 0xF989.

This only imports the following tests of the non-experimental version
placed in [0].  I will add a specific test for the experimental option
handling later.

                             | TFO | Cookie | Payload |
  ---------------------------+-----+--------+---------+
  basic-rw.pkt               | yes |  yes   |   yes   |
  basic-zero-payload.pkt     | yes |  yes   |    no   |
  basic-cookie-not-reqd.pkt  | yes |   no   |   yes   |
  basic-non-tfo-listener.pkt |  no |  yes   |   yes   |
  pure-syn-data.pkt          | yes |   no   |   yes   |

The original pure-syn-data.pkt missed setsockopt(TCP_FASTOPEN) and did
not test TFO server in some scenarios unintentionally, so setsockopt()
is added where needed.  In addition, non-TFO scenario is stripped as
it is covered by basic-non-tfo-listener.pkt.  Also, I added basic- prefix.

Link: https://github.com/google/packetdrill/tree/bfc96251310f/gtests/net/tcp/fastopen/server/opt34 #[0]
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250927213022.1850048-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:41:38 -07:00
Kuniyuki Iwashima 97b3b8306f selftest: packetdrill: Define common TCP Fast Open cookie.
TCP Fast Open cookie is generated in __tcp_fastopen_cookie_gen_cipher().

The cookie value is generated from src/dst IPs and a key configured by
setsockopt(TCP_FASTOPEN_KEY) or net.ipv4.tcp_fastopen_key.

The default.sh sets net.ipv4.tcp_fastopen_key, and the original packetdrill
defines the corresponding cookie as TFO_COOKIE in run_all.py. [0]

Then, each test does not need to care about the value, and we can easily
update TFO_COOKIE in case __tcp_fastopen_cookie_gen_cipher() changes the
algorithm.

However, some tests use the bare hex value for specific IPv4 addresses
and do not support IPv6.

Let's define the same TFO_COOKIE in ksft_runner.sh.

We will replace such bare hex values with TFO_COOKIE except for a single
test for setsockopt(TCP_FASTOPEN_KEY).

Link: https://github.com/google/packetdrill/blob/7230b3990f94/gtests/net/packetdrill/run_all.py#L65 #[0]
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250927213022.1850048-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:41:19 -07:00
Kuniyuki Iwashima 261cb8b123 selftest: packetdrill: Require explicit setsockopt(TCP_FASTOPEN).
To enable TCP Fast Open on a server, net.ipv4.tcp_fastopen must
have 0x2 (TFO_SERVER_ENABLE), and we need to do either

  1. Call setsockopt(TCP_FASTOPEN) for the socket
  2. Set 0x400 (TFO_SERVER_WO_SOCKOPT1) additionally to net.ipv4.tcp_fastopen

The default.sh sets 0x70403 so that each test does not need setsockopt().
(0x1 is TFO_CLIENT_ENABLE, and 0x70000 is ...???)

However, some tests overwrite net.ipv4.tcp_fastopen without
TFO_SERVER_WO_SOCKOPT1 and forgot setsockopt(TCP_FASTOPEN).

For example, pure-syn-data.pkt [0] tests non-TFO servers unintentionally,
except in the first scenario.

To prevent such an accident, let's require explicit setsockopt().

TFO_CLIENT_ENABLE is necessary for
tcp_syscall_bad_arg_fastopen-invalid-buf-ptr.pkt.

Link: https://github.com/google/packetdrill/blob/bfc96251310f/gtests/net/tcp/fastopen/server/opt34/pure-syn-data.pkt #[0]
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250927213022.1850048-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:41:07 -07:00
Kuniyuki Iwashima 70dd4775db selftest: packetdrill: Set ktap_set_plan properly for single protocol test.
The cited commit forgot to update the ktap_set_plan call.

ktap_set_plan sets the number of tests (KSFT_NUM_TESTS), which must
match the number of executed tests (KTAP_CNT_PASS + KTAP_CNT_SKIP +
KTAP_CNT_XFAIL) in ktap_finished.

Otherwise, the selftest exit()s with 1.

Let's adjust KSFT_NUM_TESTS based on supported protocols.

While at it, misalignment is fixed up.

Fixes: a5c10aa3d1 ("selftests/net: packetdrill: Support single protocol test.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250927213022.1850048-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:41:06 -07:00
Alok Tiwari 4ed9db2dc5 net: rtnetlink: fix typo in rtnl_unregister_all() comment
Corrected "rtnl_unregster()" -> "rtnl_unregister()" in the
  documentation comment of "rtnl_unregister_all()"

Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250929085418.49200-1-alok.a.tiwari@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:31:08 -07:00
Eric Dumazet 7d452516b6 Revert "net: group sk_backlog and sk_receive_queue"
This reverts commit 4effb335b5.

This was a benefit for UDP flood case, which was later greatly improved
with commits 6471658dc6 ("udp: use skb_attempt_defer_free()")
and b650bf0977 ("udp: remove busylock and add per NUMA queues").

Apparently blamed commit added a regression for RAW sockets, possibly
because they do not use the dual RX queue strategy that UDP has.

sock_queue_rcv_skb_reason() and RAW recvmsg() compete for sk_receive_buf
and sk_rmem_alloc changes, and them being in the same
cache line reduce performance.

Fixes: 4effb335b5 ("net: group sk_backlog and sk_receive_queue")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202509281326.f605b4eb-lkp@intel.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: David Ahern <dsahern@kernel.org>
Cc: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250929182112.824154-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:30:32 -07:00
Furong Xu 9dd4e022bf net: stmmac: Convert open-coded register polling to helper macro
Drop the open-coded register polling routines.
Use readl_poll_timeout_atomic() in atomic state.

Also adjust the delay time to 10us which seems more reasonable.

Tested on NXP i.MX8MP and ROCKCHIP RK3588 boards,
the break condition was met right after the first polling,
no delay involved at all.
So the 10us delay should be long enough for most cases.

Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Furong Xu <0x1207@gmail.com>
Link: https://patch.msgid.link/20250927081036.10611-1-0x1207@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:27:45 -07:00
Jakub Kicinski 74f7c5233e Merge branch 'mptcp-receive-path-improvement'
Matthieu Baerts says:

====================
mptcp: receive path improvement

This series includes several changes to the MPTCP RX path. The main
goals are improving the RX performances, and increase the long term
maintainability.

Some changes reflects recent(ish) improvements introduced in the TCP
stack: patch 1, 2 and 3 are the MPTCP counter part of SKB deferral free
and auto-tuning improvements. Note that patch 3 could possibly fix
additional issues, and overall such patch should protect from similar
issues to arise in the future.

Patches 4-7 are aimed at introducing the socket backlog usage which will
be done in a later series to process the packets received by the
different subflows while the msk socket is owned.

Patch 8 is not related to the RX path, but it contains additional tests
for new features recently introduced in net-next.
====================

Link: https://patch.msgid.link/20250927-net-next-mptcp-rcv-path-imp-v1-0-5da266aa9c1a@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:23:37 -07:00
Matthieu Baerts (NGI0) c912f935a5 selftests: mptcp: join: validate new laminar endp
Here are a few sub-tests for mptcp_join.sh, validating the new 'laminar'
endpoint type.

In a setup where subflows created using the routing rules would be
rejected by the listener, and where the latter announces one IP address,
some cases are verified:

- Without any 'laminar' endpoints: no new subflows are created.

- With one 'laminar' endpoint: a second subflow is created.

- With multiple 'laminar' endpoints: 2 IPv4 subflows are created.

- With one 'laminar' endpoint, but the server announcing a second IP
  address, only one subflow is created.

- With one 'laminar' + 'subflow' endpoint, the same endpoint is only
  used once.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250927-net-next-mptcp-rcv-path-imp-v1-8-5da266aa9c1a@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:23:36 -07:00
Paolo Abeni 59701b1870 mptcp: minor move_skbs_to_msk() cleanup
Such function is called only by __mptcp_data_ready(), which in turn
is always invoked when msk is not owned by the user: we can drop the
redundant, related check.

Additionally mptcp needs to propagate the socket error only for
current subflow.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Tested-by: Geliang Tang <geliang@kernel.org>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250927-net-next-mptcp-rcv-path-imp-v1-7-5da266aa9c1a@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:23:36 -07:00
Paolo Abeni 68c7af988b mptcp: factor out a basic skb coalesce helper
The upcoming patch will introduced backlog processing for MPTCP
socket, and we want to leverage coalescing in such data path.

Factor out the relevant bits not touching memory accounting to
deal with such use-case.

Co-developed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250927-net-next-mptcp-rcv-path-imp-v1-6-5da266aa9c1a@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:23:35 -07:00
Paolo Abeni c4ebc4ee4e mptcp: remove unneeded mptcp_move_skb()
Since commit b7535cfed2 ("mptcp: drop legacy code around RX EOF"),
sk_shutdown can't change during the main recvmsg loop, we can drop
the related race breaker.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Tested-by: Geliang Tang <geliang@kernel.org>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250927-net-next-mptcp-rcv-path-imp-v1-5-5da266aa9c1a@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:23:35 -07:00
Paolo Abeni 9a0afe0db4 mptcp: introduce the mptcp_init_skb helper
Factor out all the skb initialization step in a new helper and
use it. Note that this change moves the MPTCP CB initialization
earlier: we can do such step as soon as the skb leaves the
subflow socket receive queues.

Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Geliang Tang <geliang@kernel.org>
Tested-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250927-net-next-mptcp-rcv-path-imp-v1-4-5da266aa9c1a@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:23:35 -07:00
Paolo Abeni e118cdc34d mptcp: rcvbuf auto-tuning improvement
Apply to the MPTCP auto-tuning the same improvements introduced for the
TCP protocol by the merge commit 2da35e4b4d ("Merge branch
'tcp-receive-side-improvements'").

The main difference is that TCP subflow and the main MPTCP socket need
to account separately for OoO: MPTCP does not care for TCP-level OoO
and vice versa, as a consequence do not reflect MPTCP-level rcvbuf
increase due to OoO packets at the subflow level.

This refeactor additionally allow dropping the msk receive buffer update
at receive time, as the latter only intended to cope with subflow receive
buffer increase due to OoO packets.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/487
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/559
Reviewed-by: Geliang Tang <geliang@kernel.org>
Tested-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250927-net-next-mptcp-rcv-path-imp-v1-3-5da266aa9c1a@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:23:35 -07:00
Paolo Abeni a755677974 tcp: make tcp_rcvbuf_grow() accessible to mptcp code
To leverage the auto-tuning improvements brought by commit 2da35e4b4d
("Merge branch 'tcp-receive-side-improvements'"), the MPTCP stack need
to access the mentioned helper.

Acked-by: Geliang Tang <geliang@kernel.org>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250927-net-next-mptcp-rcv-path-imp-v1-2-5da266aa9c1a@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:23:35 -07:00
Paolo Abeni 9aa59323f2 mptcp: leverage skb deferral free
Usage of the skb deferral API is straight-forward; with multiple
subflows actives this allow moving part of the received application
load into multiple CPUs.

Also fix a typo in the related comment.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Tested-by: Geliang Tang <geliang@kernel.org>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250927-net-next-mptcp-rcv-path-imp-v1-1-5da266aa9c1a@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:23:34 -07:00
Eric Dumazet f017c1f768 tcp: use skb->len instead of skb->truesize in tcp_can_ingest()
Some applications are stuck to the 20th century and still use
small SO_RCVBUF values.

After the blamed commit, we can drop packets especially
when using LRO/hw-gro enabled NIC and small MSS (1500) values.

LRO/hw-gro NIC pack multiple segments into pages, allowing
tp->scaling_ratio to be set to a high value.

Whenever the receive queue gets full, we can receive a small packet
filling RWIN, but with a high skb->truesize, because most NIC use 4K page
plus sk_buff metadata even when receiving less than 1500 bytes of payload.

Even if we refine how tp->scaling_ratio is estimated,
we could have an issue at the start of the flow, because
the first round of packets (IW10) will be sent based on
the initial tp->scaling_ratio (1/2)

Relax tcp_can_ingest() to use skb->len instead of skb->truesize,
allowing the peer to use final RWIN, assuming a 'perfect'
scaling_ratio of 1.

Fixes: 1d2fbaad7c ("tcp: stronger sk_rcvbuf checks")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250927092827.2707901-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:20:35 -07:00
Jakub Kicinski d210ee58da Merge tag 'for-net-next-2025-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Luiz Augusto von Dentz says:

====================
bluetooth-next pull request for net-next:

core:

 - MAINTAINERS: add a sub-entry for the Qualcomm bluetooth driver
 - Avoid a couple dozen -Wflex-array-member-not-at-end warnings
 - bcsp: receive data only if registered
 - HCI: Fix using LE/ACL buffers for ISO packets
 - hci_core: Detect if an ISO link has stalled
 - ISO: Don't initiate CIS connections if there are no buffers
 - ISO: Use sk_sndtimeo as conn_timeout

drivers:

 - btusb: Check for unexpected bytes when defragmenting HCI frames
 - btusb: Add new VID/PID 13d3/3627 for MT7925
 - btusb: Add new VID/PID 13d3/3633 for MT7922
 - btusb: Add USB ID 2001:332a for D-Link AX9U rev. A1
 - btintel: Add support for BlazarIW core
 - btintel_pcie: Add support for _suspend() / _resume()
 - btintel_pcie: Define hdev->wakeup() callback
 - btintel_pcie: Add Bluetooth core/platform as comments
 - btintel_pcie: Add id of Scorpious, Panther Lake-H484
 - btintel_pcie: Refactor Device Coredump

* tag 'for-net-next-2025-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (30 commits)
  Bluetooth: Avoid a couple dozen -Wflex-array-member-not-at-end warnings
  Bluetooth: hci_sync: Fix using random address for BIG/PA advertisements
  Bluetooth: ISO: don't leak skb in ISO_CONT RX
  Bluetooth: ISO: free rx_skb if not consumed
  Bluetooth: ISO: Fix possible UAF on iso_conn_free
  Bluetooth: SCO: Fix UAF on sco_conn_free
  Bluetooth: bcsp: receive data only if registered
  Bluetooth: btusb: Add new VID/PID 13d3/3633 for MT7922
  Bluetooth: btusb: Add new VID/PID 13d3/3627 for MT7925
  Bluetooth: remove duplicate h4_recv_buf() in header
  Bluetooth: btusb: Check for unexpected bytes when defragmenting HCI frames
  Bluetooth: hci_core: Print information of hcon on hci_low_sent
  Bluetooth: hci_core: Print number of packets in conn->data_q
  Bluetooth: Add function and line information to bt_dbg
  Bluetooth: MGMT: Fix not exposing debug UUID on MGMT_OP_READ_EXP_FEATURES_INFO
  Bluetooth: hci_core: Detect if an ISO link has stalled
  Bluetooth: ISO: Use sk_sndtimeo as conn_timeout
  Bluetooth: HCI: Fix using LE/ACL buffers for ISO packets
  Bluetooth: ISO: Don't initiate CIS connections if there are no buffers
  MAINTAINERS: add a sub-entry for the Qualcomm bluetooth driver
  ...
====================

Link: https://patch.msgid.link/20250927154616.1032839-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:13:51 -07:00
Michael S. Tsirkin c39d6d4d93 ptr_ring: __ptr_ring_zero_tail micro optimization
__ptr_ring_zero_tail currently does the - 1 operation twice:
- during initialization of head
- at each loop iteration

Let's just do it in one place, all we need to do
is adjust the loop condition. this is better:
- a slightly clearer logic with less duplication
- uses prefix -- we don't need to save the old value
- one less - 1 operation - for example, when ring is empty
  we now don't do - 1 at all, existing code does it once

Text size shrinks from 15081 to 15050 bytes.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/bcd630c7edc628e20d4f8e037341f26c90ab4365.1758976026.git.mst@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:13:10 -07:00
Jakub Kicinski e8c4840d0c Merge branch 'net-wangxun-support-to-configure-rss'
Jiawen Wu says:

====================
net: wangxun: support to configure RSS

Implement ethtool ops for RSS configuration, and support multiple RSS
for multiple pools.
====================

Link: https://patch.msgid.link/20250926023843.34340-1-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:11:18 -07:00
Jiawen Wu 2a251b85ce net: libwx: restrict change user-set RSS configuration
Enable/disable SR-IOV will change the number of rings, thereby changing
the RSS configuration that the user has set.

So reject these attempts if netif_is_rxfh_configured() returns true. And
remind the user to reset the RSS configuration.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Link: https://patch.msgid.link/20250926023843.34340-5-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:11:16 -07:00
Jiawen Wu 2556f80a6a net: wangxun: add RSS reta and rxfh fields support
Add ethtool ops for Rx flow hashing, query and set RSS indirection table
and hash key. Disable UDP RSS by default, and support to configure L4
header fields with TCP/UDP/SCTP for flow hasing.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Link: https://patch.msgid.link/20250926023843.34340-4-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:11:16 -07:00
Jiawen Wu 58f244b256 net: libwx: move rss_field to struct wx
For global RSS and multiple RSS scheme, the RSS type fields are defined
identically in the registers. So they can be defined as the macros
WX_RSS_FIELD_* to cleanup the codes. And to prepare for the RXFH support
in the next patch, move the rss_field to struct wx.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Link: https://patch.msgid.link/20250926023843.34340-3-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:11:16 -07:00
Jiawen Wu 1be6db0497 net: libwx: support separate RSS configuration for every pool
For those devices which support 64 pools, they also support PF and VF
(i.e. different pools) to configure different RSS key and hash table.
Enable multiple RSS, use up to 64 RSS configurations and each pool has a
specific configuration.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Link: https://patch.msgid.link/20250926023843.34340-2-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:11:16 -07:00
Eric Dumazet 1fb0e47161 net: remove one stac/clac pair from move_addr_to_user()
Convert the get_user() and __put_user() code to the
fast masked_user_access_begin()/unsafe_{get|put}_user()
variant.

This patch increases the performance of an UDP recvfrom()
receiver (netserver) on 120 bytes messages by 7 %
on an AMD EPYC 7B12 64-Core Processor platform.

Presence of audit_sockaddr() makes difficult
to avoid the stac/clac pair in the copy_to_user() call,
this is left for a future patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250925230929.3727873-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:04:37 -07:00
Eric Dumazet 2b235765e9 scm: use masked_user_access_begin() in put_cmsg()
Use the greatest and latest uaccess construct to get an optimal code.

Before :

	lea    (%r9,%rcx,1),%r10
	movabs $<USER_PTR_MAX>,%r11
	mov    $0xfffffff2,%eax
	cmp    %rcx,%r10
	jb     ffffffff81cdc312 <put_cmsg+0x152>
	cmp    %r11,%r10
	ja     ffffffff81cdc312 <put_cmsg+0x152>
	stac
	lfence
	mov    %r9,(%rcx)

After:

	movabs $<USER_PTR_MAX>,%r9
	cmp    %r9,%rax
	cmova  %r9,%rax
	stac
	mov    %rcx,(%rax)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250925224914.3590290-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:03:42 -07:00
Jakub Kicinski 3806446f60 Merge branch 'net-stmmac-drop-frames-causing-hlbs-error'
Rohan G Thomas says:

====================
net: stmmac: Drop frames causing HLBS error

This patchset consists of following patchset to avoid netdev watchdog
reset due to Head-of-Line Blocking due to EST scheduling error.
 1. Drop those frames causing HLBS error
 2. Add HLBS frame drops to taprio stats

v2: https://lore.kernel.org/r/20250915-hlbs_2-v2-1-27266b2afdd9@altera.com
====================

Link: https://patch.msgid.link/20250925-hlbs_2-v3-0-3b39472776c2@altera.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 17:49:36 -07:00
Rohan G Thomas de17376cad net: stmmac: tc: Add HLBS drop count to taprio stats
Add the count of the frames dropped by Head-Of-Line Blocking due to
Scheduling(HLBS) error to taprio window drop count stats.

Signed-off-by: Rohan G Thomas <rohan.g.thomas@altera.com>
Reviewed-by: Matthew Gerlach <matthew.gerlach@altera.com>
Reviewed-by: Furong Xu <0x1207@gmail.com>
Link: https://patch.msgid.link/20250925-hlbs_2-v3-2-3b39472776c2@altera.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 17:49:34 -07:00
Rohan G Thomas 7ce48d4974 net: stmmac: est: Drop frames causing HLBS error
Drop those frames causing Head-of-Line Blocking due to Scheduling
(HLBS) error to avoid HLBS interrupt flooding and netdev watchdog
timeouts due to blocked packets. Tx queues can be configured to drop
those blocked packets by setting Drop Frames causing Scheduling Error
(DFBS) bit of EST_CONTROL register.

Also, add per queue HLBS drop count.

Signed-off-by: Rohan G Thomas <rohan.g.thomas@altera.com>
Reviewed-by: Matthew Gerlach <matthew.gerlach@altera.com>
Reviewed-by: Furong Xu <0x1207@gmail.com>
Link: https://patch.msgid.link/20250925-hlbs_2-v3-1-3b39472776c2@altera.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 17:49:34 -07:00