When the user asks to show device resources, devlink first queries the
device's dpipe tables so that it will be able to show the association
between resources and dpipe tables.
In this flow, 'ctx->resources' is always NULL as resources have yet to
be retrieved. As a result, the dpipe tables are not associated with a
resource identifier and the resource show command does not show any
dpipe tables:
$ devlink resource show pci/0000:03:00.0
pci/0000:03:00.0:
name kvd size 258048 unit entry dpipe_tables none
resources:
name linear size 98304 occ 1 unit entry size_min 0 size_max 159744 size_gran 128 dpipe_tables none
resources:
name singles size 16384 occ 1 unit entry size_min 0 size_max 159744 size_gran 1 dpipe_tables none
name chunks size 49152 occ 0 unit entry size_min 0 size_max 159744 size_gran 32 dpipe_tables none
name large_chunks size 32768 occ 0 unit entry size_min 0 size_max 159744 size_gran 512 dpipe_tables none
name hash_double size 65408 unit entry size_min 32768 size_max 192512 size_gran 128 dpipe_tables none
name hash_single size 94336 unit entry size_min 65536 size_max 225280 size_gran 128 dpipe_tables none
name span_agents size 3 occ 0 unit entry dpipe_tables none
name counters size 32766 occ 4 unit entry dpipe_tables none
resources:
name rif size 8192 occ 0 unit entry dpipe_tables none
name flow size 24574 occ 4 unit entry dpipe_tables none
name global_policers size 1000 unit entry dpipe_tables none
resources:
name single_rate_policers size 968 occ 0 unit entry dpipe_tables none
name rif_mac_profiles size 1 occ 0 unit entry dpipe_tables none
name rifs size 1000 occ 1 unit entry dpipe_tables none
name port_range_registers size 16 occ 0 unit entry dpipe_tables none
name physical_ports size 64 occ 32 unit entry dpipe_tables none
Fix by moving the check against 'ctx->resources' to the place where it
is actually used. Output after the fix:
$ devlink resource show pci/0000:03:00.0
pci/0000:03:00.0:
name kvd size 258048 unit entry dpipe_tables none
resources:
name linear size 98304 occ 1 unit entry size_min 0 size_max 159744 size_gran 128
dpipe_tables:
table_name mlxsw_adj
resources:
name singles size 16384 occ 1 unit entry size_min 0 size_max 159744 size_gran 1 dpipe_tables none
name chunks size 49152 occ 0 unit entry size_min 0 size_max 159744 size_gran 32 dpipe_tables none
name large_chunks size 32768 occ 0 unit entry size_min 0 size_max 159744 size_gran 512 dpipe_tables none
name hash_double size 65408 unit entry size_min 32768 size_max 192512 size_gran 128
dpipe_tables:
table_name mlxsw_host6
name hash_single size 94336 unit entry size_min 65536 size_max 225280 size_gran 128
dpipe_tables:
table_name mlxsw_host4
name span_agents size 3 occ 0 unit entry dpipe_tables none
name counters size 32766 occ 4 unit entry dpipe_tables none
resources:
name rif size 8192 occ 0 unit entry dpipe_tables none
name flow size 24574 occ 4 unit entry dpipe_tables none
name global_policers size 1000 unit entry dpipe_tables none
resources:
name single_rate_policers size 968 occ 0 unit entry dpipe_tables none
name rif_mac_profiles size 1 occ 0 unit entry dpipe_tables none
name rifs size 1000 occ 1 unit entry dpipe_tables none
name port_range_registers size 16 occ 0 unit entry dpipe_tables none
name physical_ports size 64 occ 32 unit entry dpipe_tables none
Fixes: 0e7e1819453c ("devlink: relax dpipe table show dependency on resources")
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Add str_to_bool() helper function to lib/utils.c that uses
parse_one_of() to parse boolean values. Update devlink to
use this common implementation.
Signed-off-by: Petr Oros <poros@redhat.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Move mnlg.c to lib/ and mnlg.h to include/ to allow code reuse
across multiple tools.
Signed-off-by: Petr Oros <poros@redhat.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Add support for the new inactive switchdev mode [1].
A user can start the eswitch in switchdev or switchdev_inactive mode.
Active: Traffic is enabled on this eswitch FDB.
Inactive: Traffic is ignored/dropped on this eswitch FDB.
An example use case:
$ devlink dev eswitch set pci/0000:08:00.1 mode switchdev_inactive
Setup FDB pipeline and netdev representors
...
Once ready to start receiving traffic
$ devlink dev eswitch set pci/0000:08:00.1 mode switchdev
[1] https://lore.kernel.org/all/20251107000831.157375-1-saeed@kernel.org/
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Kernel commit c0ef144695910 ("devlink: Add support for u64 parameters")
added support for 64bit devlink parameters, add the support for them
also into devlink utility userspace counterpart.
Tested on Microchip EDS2 development board...
Prior patch:
root@eds2:~# devlink dev param set i2c/1-0070 name clock_id value 1234 cmode driverinit
Value type not supported
root@eds2:~#
After patch:
root@eds2:~# devlink dev param set i2c/1-0070 name clock_id value 1234 cmode driverinit
root@eds2:~#
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Currently, devlink silently exits when a non-existent device is specified
for flashing or when the user lacks sufficient permissions. This makes it
hard to diagnose the problem.
Print an appropriate error message in these cases to improve user feedback.
Prior:
$ devlink dev flash foo/bar file test
$ sudo devlink dev flash foo/bar file test
$
After patch:
$ devlink/devlink dev flash foo/bar file test
devlink answers: Operation not permitted
$ sudo devlink/devlink dev flash foo/bar file test
devlink answers: No such device
Fixes: 9b13cddfe268 ("devlink: implement flash status monitoring")
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Add a new devlink health set option to configure the health
reporter’s burst period. The burst period defines a time window
during which recovery attempts for reported errors are allowed.
Once this period expires, the configured grace period begins.
This feature addresses cases where multiple errors occur
simultaneously due to a common root cause. Without a burst period,
the grace period starts immediately after the first error recovery
attempt finishes. This means that only the first error might be
recovered, while subsequent errors are blocked during the grace period.
With the burst period, the reporter initiates a recovery attempt for
every error reported within this time window before the grace period
starts.
Example:
$ devlink health set pci/0000:00:09.0 reporter tx burst_period 500
Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Kernel commit 1bbdb81a9836 ("devlink: Fix excessive stack usage in rate TC bandwidth parsing")
introduced a dedicated attribute set (DEVLINK_RATE_TC_ATTR_*) for entries nested
under DEVLINK_ATTR_RATE_TC_BWS.
Update the parser to reflect this change by validating the nested
attributes and sync the UAPI header to include the changes.
Fixes: c83d1477f8b2 ("Add support for 'tc-bw' attribute in devlink-rate")
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Introduce a new attribute 'tc-bw' to devlink-rate, allowing users to
set the bandwidth allocation per traffic class. The new attribute
enables fine-grained QoS configurations by assigning relative bandwidth
shares to each traffic class, supporting more precise traffic shaping,
which helps in achieving more precise bandwidth management across
traffic streams.
Add support for configuring 'tc-bw' via the devlink userspace utility
and parse the 'tc-bw' arguments for accurate bandwidth assignment per
traffic class.
This feature supports 8 traffic classes as defined by the IEEE 802.1Qaz
standard.
Example commands:
- devlink port function rate add pci/0000:08:00.0/group \
tx_share 10Gbit tx_max 50Gbit tc-bw 0:20 1:0 2:0 3:0 4:0 5:80 6:0 7:0
- devlink port function rate set pci/0000:08:00.0/group \
tc-bw 0:20 1:0 2:0 3:0 4:0 5:80 6:0 7:0
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Port param show command arg parser used the devlink dev flag
instead of the port, which caused to not identify the port device
argument, causing the following error:
$ devlink port param show eth0 name link_type
Wrong identification string format.
Devlink identification ("bus_name/dev_name") expected
Use the correct the devlink handle flag.
Fixes: 70faecdca8f5 ("devlink: implement dump selector for devlink objects show commands")
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
When parsing with selector, there's a list of extended handles
(devname/busname/x) which require special treatment.
DL_OPT_HANDLEP is one of them. The code tries to parse devname/busname
handle and in case it is successful, it goes the "dump" way. However if
it's not, parsing is directly done. That is wrong, as the options may
still be incomplete. Do break in that case instead allowing to do dry
parse and possibly go the "dump" way in case the option list is not
complete.
Fixes: 70faecdca8f5 ("devlink: implement dump selector for devlink objects show commands")
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
When the return value of rtnl_talk() is greater than
or equal to 0, 'answer' will be allocated.
The 'answer' should be free after using,
otherwise it will cause memory leak.
Signed-off-by: Minhong He <heminhong@kylinos.cn>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Print all of the missing parameters, also in the presence of unknown ones.
Take for example a correct command:
$ devlink resource set pci/0000:01:00.0 path /kvd/linear size 98304
And remove the "size" keyword:
$ devlink resource set pci/0000:01:00.0 path /kvd/linear 98304
That yields output:
Resource size expected.
Unknown option "98304"
Prior to the patch only the last line of output was present. And if user
would forgot also the "path" keyword, there will be additional line:
Resource path expected.
in the stderr.
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
dl_opts_put() function missed to consider IO eqs option flag.
Due to this, when max_io_eqs setting is applied only when it
is combined with other attributes such as roce/hw_addr.
When max_io_eqs is the only attribute set, it missed to
apply the attribute.
Fix it by adding the missing flag.
Fixes: e8add23c59b7 ("devlink: Support setting max_io_eqs")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Devices send event notifications for the IO queues,
such as tx and rx queues, through event queues.
Enable a privileged owner, such as a hypervisor PF, to set the number
of IO event queues for the VF and SF during the provisioning stage.
example:
Get maximum IO event queues of the VF device::
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
function:
hw_addr 00:00:00:00:00:00 ipsec_packet disabled max_io_eqs 10
Set maximum IO event queues of the VF device::
$ devlink port function set pci/0000:06:00.0/2 max_io_eqs 32
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
function:
hw_addr 00:00:00:00:00:00 ipsec_packet disabled max_io_eqs 32
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Devlink dev may contain one or more nested devlink instances.
Print them using previously introduced pr_out_nested_handle_obj()
helper.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
If port function contains nested handle attribute, print it.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Nested handle may contain DEVLINK_ATTR_NETNS_ID attribute that indicates
the network namespace where the nested devlink instance resides. Process
this converting to netns name if possible and print to user.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
For existing pr_out_nested_handle() user (line card), the output stays
the same. For the new users, introduce __pr_out_nested_handle()
to allow to print devlink instance as object allowing to carry
attributes in it (like netns).
Note that as __pr_out_handle_start() and pr_out_handle_end() are newly
used, the function is moved below the definitions.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Instead of printing out new line unconditionally, use __pr_out_newline()
to print it only when needed avoiding double prints.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Use snprintf instead of sprintf to ensure only valid memory is printed
to and the output string is properly terminated.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Support port function commands to enable / disable IPsec packet
offloads, this is used to control the port IPsec device capabilities.
When IPsec packet capability is disabled for a function of the port
(default), function cannot offload IPsec operation. When enabled, IPsec
operation can be offloaded by the function of the port.
Enabling IPsec packet offloads lets the kernel to delegate
encrypt/decrypt operations, as well as encapsulation and SA/policy and
state to the device hardware.
Example of a PCI VF port which supports IPsec packet offloads:
$ devlink port show pci/0000:06:00.0/1
pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0
function:
hw_addr 00:00:00:00:00:00 roce enable ipsec_crypto disable ipsec_packet disable
$ devlink port function set pci/0000:06:00.0/1 ipsec_packet enable
$ devlink port show pci/0000:06:00.0/1
pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0
function:
hw_addr 00:00:00:00:00:00 roce enable ipsec_crypto disable ipsec_packet enable
Signed-off-by: Dima Chumak <dchumak@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Support port function commands to enable / disable IPsec crypto
offloads, this is used to control the port IPsec device capabilities.
When IPsec crypto capability is disabled for a function of the port
(default), function cannot offload IPsec operation. When enabled, IPsec
operation can be offloaded by the function of the port.
Enabling IPsec crypto offloads lets the kernel to delegate XFRM state
processing and encrypt/decrypt operation to the device hardware.
Example of a PCI VF port which supports IPsec crypto offloads:
$ devlink port show pci/0000:06:00.0/1
pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0
function:
hw_addr 00:00:00:00:00:00 roce enable ipsec_crypto disable
$ devlink port function set pci/0000:06:00.0/1 ipsec_crypto enable
$ devlink port show pci/0000:06:00.0/1
pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0
function:
hw_addr 00:00:00:00:00:00 roce enable ipsec_crypto enable
Signed-off-by: Dima Chumak <dchumak@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Introduce a new helper dl_argv_parse_with_selector() to be used
by show() functions instead of dl_argv().
Implement it to check if all needed options got get commands are
specified. In case they are not, ask kernel for dump passing only
the options (attributes) that are present, creating sort of partial
key to instruct kernel to do partial dump.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
In preparation to the follow-up dump selector patch, make sure that the
command line arguments parsing function returns -ENOENT in case the
option is missing so the caller can distinguish.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
In preparation to the follow-up dump selector patch, introduce function
dl_argv_dry_parse() which allows to do dry parsing of command line
arguments without printing out any error messages to the user.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Currently, handle parsing is destructive as the "\0" string ends are
being put in certain positions during parsing. That prevents it from
being used repeatedly. This is problematic with the follow-up patch
implementing dry-parsing. Fix by making a copy of handle argv during
parsing.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
This is basically a cosmetic change. The SB index is not required to be
passed by user and implicitly index 0 is used. This is ensured by
special treating at the end of dl_argv_parse(). Move this option from
optional to required options.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Be in-sync with port help and port man page and spell out the possible
states instead of "STATE".
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
It is common for all iproute2 apps to have command line option
names matching with show command outputs. However, that is not true
in case of trap and trap group devlink objects.
Correct would be to have "trap" and "group" in the outputs, but that is
not possible to change now. Instead of that, accept "name" instead of
"trap" and "group" options.
Examples:
$ devlink trap show netdevsim/netdevsim1
netdevsim/netdevsim1:
name source_mac_is_multicast type drop generic true action drop group l2_drops
name vlan_tag_mismatch type drop generic true action drop group l2_drops
name ingress_vlan_filter type drop generic true action drop group l2_drops
name ingress_spanning_tree_filter type drop generic true action drop group l2_drops
name port_list_is_empty type drop generic true action drop group l2_drops
name port_loopback_filter type drop generic true action drop group l2_drops
name fid_miss type exception generic false action trap group l2_drops
name blackhole_route type drop generic true action drop group l3_drops
name ttl_value_is_too_small type exception generic true action trap group l3_exceptions
name tail_drop type drop generic true action drop group buffer_drops
name ingress_flow_action_drop type drop generic true action drop group acl_drops
name egress_flow_action_drop type drop generic true action drop group acl_drops
name igmp_query type control generic true action mirror group mc_snooping
name igmp_v1_report type control generic true action trap group mc_snooping
$ devlink trap show netdevsim/netdevsim1 trap source_mac_is_multicast
netdevsim/netdevsim1:
name source_mac_is_multicast type drop generic true action drop group l2_drops
$ devlink trap show netdevsim/netdevsim1 name source_mac_is_multicast
netdevsim/netdevsim1:
name source_mac_is_multicast type drop generic true action drop group l2_drops
$ devlink trap group
netdevsim/netdevsim1:
name l2_drops generic true
name l3_drops generic true policer 1
name l3_exceptions generic true policer 1
name buffer_drops generic true policer 2
name acl_drops generic true policer 3
name mc_snooping generic true policer 3
$ devlink trap group show netdevsim/netdevsim1 group l2_drops
netdevsim/netdevsim1:
name l2_drops generic true
$ devlink trap group show netdevsim/netdevsim1 name l2_drops
name l2_drops generic true
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
The devlink utility stores an interface map that can be used to map an
interface name to a devlink port and vice versa. The map is populated by
issuing a devlink port dump via 'DEVLINK_CMD_PORT_GET' command.
Cited commits started to populate the map only when it is actually
needed. One such case is when a dump (e.g., shared buffer dump) only
returns devlink port handles. When pretty printing is required, the
utility will consult the map to translate the devlink port handles to
the corresponding interface names.
The above is problematic as it means that the port dump response(s) will
be queued to the same receive buffer as the response(s) of the dump that
triggered the port dump, resulting in a failed dump [1].
Fix by using a different netlink socket for the population of the
interface map.
[1]
$ devlink sb tc bind show
kernel answers: Device or resource busy
Failed to create index map
//0:
sb 0 tc 4 type egress pool 4 threshold 9
kernel answers: Device or resource busy
[...]
$ echo $?
1
Fixes: 5cddbb274eab ("devlink: load port-ifname map on demand")
Fixes: 63d84b1fc98d ("devlink: load ifname map on demand from ifname_map_rev_lookup() as well")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
There is a json footer missed for trap-policer output in "devlink mon".
So add it and fix the json output.
Fixes: a66af5569337 ("devlink: Add devlink trap policer set and show commands")
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Suppor port function commands to enable / disable migratable
capability, this is used to set the port function as migratable.
Live migration is the process of transferring a live virtual machine
from one physical host to another without disrupting its normal
operation.
In order for a VM to be able to perform LM, all the VM components must
be able to perform migration. e.g.: to be migratable.
In order for VF to be migratable, VF must be bound to VFIO driver with
migration support.
When migratable capability is enable for a function of the port, the
device is making the necessary preparations for the function to be
migratable, which might include disabling features which cannot be
migrated.
Example of LM with migratable function configuration:
Set migratable of the VF's port function.
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0
vfnum 1
function:
hw_addr 00:00:00:00:00:00 migratable disable
$ devlink port function set pci/0000:06:00.0/2 migratable enable
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0
vfnum 1
function:
hw_addr 00:00:00:00:00:00 migratable enable
Bind VF to VFIO driver with migration support:
$ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/unbind
$ echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:08:00.0/driver_override
$ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/bind
Attach VF to the VM.
Start the VM.
Perform LM.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Support port function commands to enable / disable RoCE, this is used to
control the port RoCE device capabilities.
When RoCE is disabled for a function of the port, function cannot create
any RoCE specific resources (e.g GID table).
It also saves system memory utilization. For example disabling RoCE
enable a VF/SF to save 1 Mbytes of system memory per function.
Example of a PCI VF port which supports a port function:
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum
0 vfnum 1
function:
hw_addr 00:00:00:00:00:00 roce enabled
$ devlink port function set pci/0000:06:00.0/2 roce disable
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum
0 vfnum 1
function:
hw_addr 00:00:00:00:00:00 roce disabled
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Recent kernels send PORT_NEW message with when ifname changes,
so benefit from that by having ifnames updated.
Whenever there is a message containing DEVLINK_ATTR_PORT_NETDEV_NAME
attribute, use it to update ifname map.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
There is a common code in pr_out_port_handle_start() and
pr_out_port_handle_start_arr(). As the next patch is going to extend it
even more, push the code into common helper.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Currently, when user specifies ifname as a handle on command line of
devlink, the related devlink port is looked-up in previously taken dump
of all devlink ports on the system. There are 3 problems with that:
1) The dump iterates over all devlink instances in kernel and takes a
devlink instance lock for each.
2) Dumping all devlink ports would not scale.
3) Alternative ifnames are not exposed by devlink netlink interface.
Instead, benefit from RTNL get link command extension and get the
devlink port handle info from IFLA_DEVLINK_PORT attribute, if supported.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Add couple of helpers to alloc/free of map object alongside with list
addition/removal.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
The kernel has gained support for reading from regions without needing to
create a snapshot. To use this support, the DEVLINK_ATTR_REGION_DIRECT
attribute must be added to the command.
For the "read" command, if the user did not specify a snapshot, add the new
attribute to request a direct read. The "dump" command will still require a
snapshot. While technically a dump could be performed without a snapshot it
is not guaranteed to be atomic unless the region size is no larger than
256 bytes.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Setting a parent during creation of the node doesn't work, despite
documentation [1] clearly saying that it should.
[1] man/man8/devlink-rate.8
Example:
$ devlink port function rate add pci/0000:4b:00.0/node_custom parent node_0
Unknown option "parent"
Fix this by passing DL_OPT_PORT_FN_RATE_PARENT as an argument to
dl_argv_parse() when it gets called from cmd_port_fn_rate_add().
Fixes: 6c70aca76ef2 ("devlink: Add port func rate support")
Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
To fully utilize hierarchical QoS algorithm new attribute 'tx_weight'
needs to be introduced. Weight attribute allows for usage of Weighted
Fair Queuing arbitration scheme among siblings. This arbitration
scheme can be used simultaneously with the strict priority.
Introduce ability to configure tx_weight from devlink userspace
utility. Make the new attribute optional.
Example commands:
$ devlink port function rate add pci/0000:4b:00.0/node_custom \
tx_weight 50 parent node_0
$ devlink port function rate set pci/0000:4b:00.0/2 tx_weight 20
Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Signed-off-by: David Ahern <dsahern@kernel.org>