Skip to main content
The following sections provide details of new features and updates introduced in each release.

Kubex

This topic summarizes both new and updated features introduced in Kubex
The following changes have been made to Kubex in this release:
  • New Kubex AI Agent Experience
    • Natural-Language Interaction: Users can now simply ask questions in plain English and get accurate, contextual answers directly from the AI agent.
    • Guided Navigation: The agent can take users to the right features or pages in the product—no need to search menus or remember where things live.
    • Action Assistance: Beyond answering questions, the agent can help surface relevant insights, and streamline everyday tasks.
    • Enterprise-Grade Data Privacy: No customer data is used for model training. The underlying LLM does not learn from or use customer data to train or improve itself.

    Figure: Kubex AI Agent

    Enhanced GPU Analysis
    • The Kubex GPU analysis recommends the optimal GPU model and configuration for maximum potential savings. This includes alternative GPU models, provided they are compatible with the workload.
    • The newly introduced GPU Catalog Map provides a visual comparison of how the current workload scores against all GPU models and configurations.

    Figure: GPU Catalog Map

The following changes have been made to Kubex in this release:
  • Policy Editing
    • Version 3.8 includes a new and often requested capability to edit analysis policies via the UI.
    • The “policy_admin” role enables specific users with the ability to edit both cloud and container analysis policy settings directly from the Kubex UI.
    • Additionally, the last update date – and who made the update – are now exposed in the UI.
    • The policy descriptions have been updated to make them easier to understand, and tooltips now include detailed information about each setting along with their default value.

    Figure: Policy Editing

    Cloud Optimization Trends Improvements
    • The performance in loading and rendering the optimization trends charts located on the cloud Summary and Optimization Trends tabs has been significantly improved in this release.
    Kubex Data Collection for GKE Autopilot Clusters
    • Kubex now supports data collection and analysis of containers running in GKE Autopilot Clusters providing visibility to container level risk and waste along with remediation recommendations for optimizing container resource allocations.
The following changes have been made to Kubex in this release:
  • CPU Throttling Risks
    • The CPU Throttling risks due to undersized CPU Limits are highlighted in the Histogram.
    • Clicking on the risk item (as shown below) will take you to the Analysis Details page showing all the containers with CPU Throttling risks, and the Kubex recommendation to address it.

    Figure: Analysis Details Table - Recommended MIG Device

  • Optimization Trends
    • The Optimization Trends reports visualize historical trends across key operational metrics, including instance counts, costs, savings, and CPU/Memory utilization. This helps users better understand performance patterns and optimization impact over time.
    • Dashboards can be personalized — users can select which reports to display and organize them in their preferred layout. Each report can be resized to small, medium, or large for optimal viewing.
    • Users can save dashboards as public (where others in your organization can view but not edit) and private (where only they can view/edit them).

    Figure: Optimization Trends on the Summary Tab

    Figure: Optimization Trends Dashboard

The following changes have been made to Kubex in this release:
  • GPU MIG Recommendation—Supports NVIDIA MIG (Multi-Instance GPU) technology, recommending whether workloads should use a full GPU or a fraction (MIG slice).
    • The Analysis Details table will highlight the suggested MIG profile (and the GPU fraction it represents).

    Figure: Analysis Details Table - Recommended MIG Device

    • MIG recommendation info is also available in the GPU tab for an individual container manifest

    Figure: GPU Tab - Optimization Summary with MIG Recommendation

  • Updated Container Summary Tab
    • “Waste” links now take you directly to the Analysis Details tab with the appropriate filter preselected
    • “Info” tooltip has been added to the Spend Breakdown modal. This highlights the total unspecified CPU/Memory request. With this info, the numbers here and in the Analysis Details table will match up.
  • Memory Limit Events (Last Day) added to Analysis Details table—This is a hidden column by default.
  • Node/Node Group Analysis—Improved quality of recommendations by switching to “Working Set Memory” when available. Working set memory is a better metric to analyze the memory requirements for Nodes.
  • Other Improvements & Updates
    • GPU and GPU Memory metrics viewer chart: added “Current Request” line
    • Added a new view on the Analysis Details table: “GPU Request Surplus”. This highlights the container manifests that have surplus GPU and can be optimized to reduce waste.
    • Container API now include a “analyzedOn” date which allows the automation to only consider fresh recommendations.
    • Removed the word “AI” from “AI Analysis Details”
  • Policies—View policies for Cloud environments in a new modal, accessible from the Policies tab and Analysis Details table.

    Figure: Policy Modal

    • Policies Tab

    Figure: Policies Tab

    • Analysis Details Table

    Figure: Analysis Details Table - Policy

  • Updated Azure Catalog Map—Azure Catalog Map has been updated to show all supported VM types and sizes. These are displayed on the map when “Commonly Used” is unselected.

    Figure: Updated Azure Catalog Map

  • Added columns to Analysis Details table
    • Comment
    • Recommendation Reason
    • RDS Cluster ID
    • Resource Tags
  • SSO with Okta—Removed requirement for Okta’s CAS (Central Authentication Service).
  • Password Reset—Password reset emails now link to Kubex instead of Kubex Console.
The following changes have been made to Kubex in this release:
  • Updated Container Summary Tab—The container summary page has been updated with a new look and feel to better highlight current spend, waste, and risk.

Figure: Container Summary Tab

  • Container Out of Memory Kills—Container out of memory kill data is now visible in OOM Kills (Last Day) column on the Analysis Details page:
    • Badges will display on summary and histogram page to indicate if there are OOM Kills in the environment
    • ML Model and historical data available in the Utilization Charts carousel and Metrics Viewer
  • CPU Throttling (%)—CPU Throttling data is now visible in CPU Throttling (%) column on the Analysis Details page:
    • ML Model and historical data available in Utilization Charts carousel and Metrics Viewer
  • Automation Tab—Automation tab has been added to highlight request and limit changes made by the Kubex Automation Controller. An overview of optimized manifests is also available on the summary page
  • Business & Operational Attributes—Various business and operations related attributes (e.g., Application, Business Unit, Cost Center, Operational Environment) have been added to the tree view/filter options and Analysis Details columns
  • GPU Optimization for Container Resource Requests—– Containers with GPUs allocated now receive recommendations for downsize opportunities. The GPU Overview page has been updated to highlight this information:
    • Added system views to Analysis Details (Immediate GPU Savings & Addtional GPU Waste)
The following changes have been made to Public Cloud optimization in this release:
  • Effort Popup—The Effort column on the Analysis Details page is now clickable, showing a popup with additional details on the effort required to change to the recommended instance type.
    • Overall effort is the highest effort of the individual rule items. Possible values are None, Low, Medium, and High
  • Business & Operational Attributes—Various business and operations related attributes (e.g., Application, Business Unit, Cost Center, Operational Environment) have been added to the tree view/filter options and the Analysis Details page. columns
The following changes have been made to Kubex in this release:
  • Updated Container Summary Tab—The container summary page has been updated to better highlight both your current spend and the details of possible savings.
  • Updated Node Metrics—The Node Metrics Viewer has been updated to show the following GPU metrics:
    • GPU Utilization %
    • GPU Memory Utilization (%)
    • GPU Power Usage (Watts)
  • Updated Utilization Chart—The node chart showing the number of pods running on a node has been updated to show the limit for the selected node.
  • Customizable Data Aggregation—You now have the ability to edit how grouped data in a table is aggregated (i.e. sum, average, max, etc.) and save your customization in a table view. This feature is available for containers, node groups, nodes and cloud instances.
Refer to the Kubex documentation for details.
The following changes have been made to the Public Cloud optimization in this release:
  • Cloud Connections—In this release you can now can create and edit the connections to AWS and Azure. You can also view data collection status for each of the connections. Kubex uses these connections to collect data from each account, daily.
  • Accessing the Catalog Map—You can now access the catalog map directly from the Analysis Details tab. The catalog map with a modal view has been added to the utilization charts. You now do not need to leave the Analysis Details tab to use the catalog map.
  • Summary Tab—The Cloud Summary tab has been updated to improve useability.
Refer to the Kubex documentation for details.
The following changes have been made to improve how Kubex determines the cost of public cloud instances, using customer-specific discounts:
  • Azure Hybrid Benefit Status—As part of enhancing the cost data model, Kubex needs to distinguish between systems that are BYOL (also called Azure Hybrid Benefit or AHB) to accurately represent the cost of the system. The license model is available in the existing data collection configuration so no changes are required to the minimum permissions for BYOL VMs.
  • Azure Instance OS Name—Kubex now determines the type of Linux OS. In this release RHEL and SUSE are supported. The OS name is used to accurately determine instance costs.
  • Updated Cost Model—The cloud cost data model has been updated to apply discounts based on the following details:
    • OS Type—The type of Linux OS installed on a cloud instance. This is not an attribute and is stored in a table but cannot be edited or cleared like an attribute.
    • License Model (Attr_key is aws:licensemodel)—The licensing model that has been configured for a cloud instance )
    • Life Cycle (Attr_key is aws:lifecycle)—This attribute is not currently used but is collected for future development. It is used to store whether or not the instance is a spot instance. Possible values are “spot” or “normal”
Refer to the Kubex documentation for details.
The following changes have been made to Kubex in this release:
  • Node Modal—A new modal opens when you click a node name on the Node Details page.
  • Performance Improvements— The Kubex UI no longer retrieves data from the Kubex Reporting Database (RDB). All UI component queries are directed exclusively to Kubex-specific tables in MS SQL.
Refer to the Kubex documentation for details.
The following changes have been made to Kubex in this release:
  • GPU Metrics for Containers—In this release Kubex introduces reporting of GPU configuration and utilization data for containers, node groups and nodes. This data enables you to identify GPU-enabled containers and nodes in your environment and see how effectively you are utilizing your GPU resources. You must have data forwarder v4.2.1 deployed to collect the GPU data from your environments. There are updates on many pages to highlight GPU resource configuration and utilization:
    • Updated Analysis Details—A number of data points allow you to identify containers with GPU requests and to determine where the GPU and GPU memory request values are higher than peak utilization over a given period. The analysis populates these attribute values based on the date range and data selection options that are defined in your container optimization policy. The Analysis Details page has been updated to display the following GPU data.
      • GPU Model
      • GPU Request
      • Total GPU Request
      • GPU Average
      • %GPU Average
      • GPU Sustained
      • GPU Min
      • GPU Memory Request
      • Total GPU Memory Request
      • GPU Memory Average
      • % GPU Memory Average
      • GPU Memory Sustained
      • GPU Memory Min
      • GPU Power Usage Peak
      • GPU Power Usage Average
    • New Views—Four new views have been added to the Analysis Details page. These views allow you to focus on your GPU data:
      • GPU Inventory
      • GPU Cost
      • GPU Low Utilization
      • GPU Memory Low Utilization
      The default view has been updated to include GPU data.
    • Updated Summary—New GPU-specific cards have been added to the containers Summary page. A note includes a link that opens a modal containing GPU pricing details. The top two cards on the Summary page have also been updated to better highlight both your current spend and the details of possible savings.
    • New GPU Tab—A new GPU tab has been added when reviewing a single container. This new page provides details of the GPU utilization on the selected container.
    • Updated Metrics Viewer—The following GPU metrics are available in the metrics viewer:
      • GPU Utilization in GPUs - Average Container
      • GPU Memory Utilization (MB) - Average Container
      • GPU Memory Utilization (%) - Average Container
      • GPU Power Usage (Watts) - Average Container
  • GPU Metrics for Node Groups—GPU metrics have also been added for Node Groups and individual nodes.
    • Allocatable GPU Memory (GB)
    • Allocatable GPUs
    • Allocatable Memory (GB)
    • Average CPU Utilization (%)
    • Average GPU Memory Utilization (%)
    • Average GPU Utilization (%)
    • Average Memory Utilization (%)
    • GPU - Node Balance Ratio
    • GPU Capacity
    • GPU Memory - Node Balance Ratio
    • GPU Memory Capacity (GB)
    • GPU Request (% of Allocatable)
    • No. of Nodes with Underused GPU
    • No. of Nodes with Underused GPU Memory
    • Nodes with Underused GPU Memory(%)
    • Nodes with Underused GPU(%)
    • Peak GPU Memory Utilization (%)
    • Peak GPU Utilization (%)
    • Primary Node GPU Model
    • Primary Node Type GPU (GPUs)
    • Primary Node Type GPU Memory (GB)
    • Unallocated GPUs
  • Updated Node Overview Tab—This tab has been updated add GPU details and to improve usability.
  • New Data in Analysis Details—Six new data points allow you to find containers where the CPU and memory request values are higher than peak utilization over a given period. The analysis populates these values based on the date range and data selection options that are defined in your container optimization policy.
    • CPU Peak—Highest peak value in mCores, on the busiest container.
    • CPU Sustained—Highest sustained value in mCores, on the average container.
    • CPU Min—Lowest value in mCores, on the average container.
    • Memory Peak—Highest peak value in MB, on the busiest container.
    • Memory Sustained—Highest sustained value in MB, on the average container.
    • Memory Min—Lowest value in mCores, on the average container.
  • Timezone—The utilization charts for containers, node groups and nodes have been updated to standardize the timeline on UTC. The X-axis label has been updated to indicate the timezone designation for all charts to ‘UTC ’.
  • Updated Node Overview—The node overview page has been updated to improve useability and add details of GPU utilization.
Refer to the Kubex documentation for details.
The following changes have been made to Public Cloud optimization in this release:
  • Timezone—The X-axis label has been updated to indicate the timezone designation for all historical utilization charts to ‘UTC ’ This change to does not currently apply to the ML charts. The ML charts on the Analysis Details page use Kubex server’s time. So this will be either “Eastern Time” or “Central Europe Time”.
Refer to the Kubex documentation for details.
The following changes have been made in this release:
  • Users no longer need to clear their browser cache after an upgrade. When using the Kubex Console users must clear their browser cache after an upgrade,
  • Analysis Details Export —When the content of the Analysis Details is exported, the Container Name is provided as a hyperlink that takes you to the Overview page for the selected container. When you open the exported .XLS file, the content initially appears as plain text, and when you click on the cell it becomes a hyperlink.
  • Node Group Overview—The cards on the Node Group Overview page have been updated to better report on the status of a single node group. Additionally, utilization charts have been added to this page so you can review detailed metrics for a selected node group.
  • New Filter—A new public filter, “With automation enabled”, has been added to the container tree viewer. When selected this filter displays all containers with automation enabled. i.e. Automation Enabled = true.
  • Improved Performance>—By focusing on nodes that were running in the last day rather than the past 7 days you will notice improved performance when working with nodes and node groups.
  • Access to the Online Help—A new help icon provides direct access to the online help.
  • Renamed Value—The value, “No. of Containers” has been renamed to, “Avg No. of Containers”. The new name more appropriately describes the value. The following pages have been updated: Summary, Overview, Histogram and Analysis Details.
Refer to the Kubex documentation for details.

In this release Kubex introduces Public Cloud optimization features. In this new console, Kubex’s patented analytics determine optimal resource settings for your public cloud environments and display the results in a new user interface.

Use the new console to review resource utilization across your AWS, Azure and Google Cloud environments.

You can access the new console from your existing instance using:
https://<customer name>.kubex.ai
See the new Kubex website to start a free trial.Use your existing Kubex credentials to access the console. You can either close the browser window or navigate back to the Kubex Console to log out.The new console uses the latest MSSQL and Postgres databases.
The following changes have been made to Kubex in this release:
  • Automation Tab—A new automation tab has been added to report on the status of the Kubex Container Automation solution.
  • Updated Connections Table —The following changes have been made to this table:
    • Three new columns have been added: Cluster Version, Kubex Collection Stack Version and Prometheus Version
    • A link that provides secure Kubex credentials has been added. The credentials are provided in a code snippet than can be copied directly into your values.edit.yaml file. This new method makes it easier for you to set up container data collection, reducing the previous manual dependency of requesting these credentials from Kubex.
  • Updated Optimization Breakdown—Hovering over any bar in this chart, displays a description of the recommended optimization. Badges have also been added to indicate significant information, such a s memory events or unspecified resource settings.
  • Summary Page—The Summary page has been updated for improved readability.
  • Cost Modeling—Kubex now has improved algorithms that use existing utilization data to estimate the cost for containers with unspecified CPU/memory request values. The Summary, Analysis Details, and Overview pages have all been updated to show the cost estimates for these containers. The following updates have been made to show these values:
    • Two columns have been added to the Analysis Details tab:
      • “Total Surplus CPU Request from Unspecified (mCores)”;
      • “Total Surplus Memory Request from Unspecified (MB)”
    • Two system views have been added to the Analysis Details tab:
      • “CPU Request Shortfall (from Unspecified)”
      • “Memory Request Shortfall (from Unspecified)”
    • The Overview tab and single instance modal page have been updated to add details of unspecified CPU and memory request settings. These additional cards are only displayed if the selected container does not have specified CPU and/or memory request settings.
Refer to the Kubex documentation for details.
This release provides enhancements to the Kubernetes automation framework including an updated version of the Mutating Admission Controller (MAC) that sends configuration and modification event status information to your Kubex instanceA new Automation status page has been added to Kubex that reports on the status of containers that have been enabled for automation, You can enable/disable containers for automation using a right-click context menu from the Kubex container tree viewer. Refer to the Kubex documentation for details.Refer to Mutating Admission Controller for details on deploying and using the Kubex Container Automation solution.
The following endpoints have been added to the Kubex API:
  • GET /kubernetes/clusters
  • GET /kubernetes/clusters/<clusterName>
  • GET /kubernetes/clusters/<clusterName>/containers
  • GET /kubernetes/clusters/<clusterName>/containers?details=true
  • GET /kubernetes/clusters/<clusterName>/automation
Two new attributes have been added to support the automation features.
  • Kubex Automation—Indicates whether the container is eligible for automation within Kubex.The attribute is set one-time, is static, and is not inherited from cluster, node group, or any higher level.
  • Kubex Policy—Indicates what policy to use when automation is enabled.
See Kubex API New Features
The token expiry has been increased from 5 to 60 minutes. Re-authorize will no longer be required for longer-running pipeline operations that previously exceed the 5-minute limit. The JSON Web Token (JWT) secures the Kubex API so users can only make authorized API requests.See Authorize
In this release, singleton scale groups, both ASGs and VM Scale Sets, will now be analyzed as scale groups rather than as EC2 or VM instances. You will observe the following changes:
  • Kubex Console:
    • Singleton ASGs/VM Scale Sets will now only appear in the corresponding ASG or VM Scale Set tabs.
    • If you have large number of these singleton scale groups, you may notice a decrease in the number of EC2 and Azure VMs reported, while the ASG and VM Scale Set counts will increase accordingly.
    • The Catalog Map will no longer report on ASGs and VM Scale Sets with Max Size =1.
    • No change to the Impact Analysis and Recommendation Report for ASGs, as this report recognized singleton ASG instances and reported on them accordingly.
  • Scaling Recommendations:
    • Singleton scale groups will now have scaling recommendations, as required. Previously, there were no scaling recommendations, only instance type change recommendations.
    • If you need to control the group size and prevent scaling, use the “Group Max Size Override” and “Group Min Size Override” attributes. See Overriding Cloud Recommendations. Contact [email protected] for detail of configuring these attributes.
  • API Results:
  • Licensing:
    • Currently, singleton scale groups are licensed as individual EC2/VMs.
    • With this update, ASG licensing can be based on 1 ASG = 4 licenses OR 1 license per instance, based on the average number of instances in the scale group. VM Scale Sets are licensed using 1 license per instance, based on the average number of instances in the group. See License Compliance Report.
    • The current container licensing model is based on the last known number of containers, on which the scale groups are running. With this change, container licensing will be calculated based on the average number of containers used, over the data range, as defined in the policy.
The public cloud metadata catalogs have been updated to ensure Kubex recommendations are based on the latest vendor pricing:
  • AWS Metadata Updates:
    • The pricing has been updated and is correct as of April 7, 2025.
    • The following new instance type has been added:
      • i8g.48xlarge

  • Azure Metadata Updates:
    • The pricing has been updated and is correct as of April 7, 2025.
    • The following new instance families have been added:
      • Dsv6
      • Ddsv6
      • Dasv6
      • Dadsv6
      • Dlsv6
      • Dldsv6
      • Dalsv6
      • Daldsv6
      • Esv6
      • Edsv6
      • Easv6
      • Eadsv6
      • Fasv6
      • Falsv6
      • Famsv6
    • Retired the following instances:
      • NC
      • NC-Promo instances
      • NCv2 instances
  • GCP Metadata Updates:
    • The pricing has been updated and is correct as of April 14, 2025.
    • The following new instance families have been added:
      • c4a-standard-lssd
      • c4a-highmem-lssd

Container Data Forwarder

This section lists new features and updates to the Container Data Forwarder. A Helm chart bundles all of the components required for container data collection and automates the process. See Kubex Collection Stack for details for a single-cluster configuration. Refer to Github repository for samples and configuration files for multi-cluster configurations. When deploying the Container Data Forwarder ensure that the same version is deployed for all of your clusters. See Data Collection for Containers
This release adds support for collecting data from Google Managed Prometheus (GMP), which is the only supported method for data collection from GKE Autopilot clusters.
  • GKE Autopilot Compatibility—GMP is now fully supported, enabling data collection from GKE Autopilot clusters where traditional Prometheus deployments are not possible.
  • Managed Service Integration—Leverages Google’s managed Prometheus service for seamless data collection without requiring in-cluster Prometheus installations.
  • backoffLimit Support—The data forwarder now supports configuring backoffLimit for both Job and CronJob resources, providing better control over retry behavior when data collection encounters transient failures.
This update collects additional GPU metrics.
  • The following GPU metrics are now collected for Containers:
    • GPU Utilization in GPUs - Average Container*
    • GPU Utilization in Percent - Average Container*
    • GPU Memory Utilization in MB - Average Container*
    • GPU Memory Utilization (%) - Average Container*
    • GPU Power Usage (W) - Average Container*
    • GPU Utilization in GPUs - Busiest Container
    • GPU Utilization in Percent - Busiest Container
    • GPU Memory Utilization in MB - Busiest Container
    • GPU Memory Utilization (%) - Busiest Container
    • GPU Power Usage (W) - Busiest Container
  • The following GPU metrics are now collected for nodes:
    • GPU Utilization( GPUs)
    • GPU Utilization(%)*
    • GPU Memory Utilization (MB)
    • GPU Memory Utilization (%)*
    • GPU Power Usage (W)*
    • GPU Requests (GPU)
    • GPU Limit (GPU)
    Only those metrics indicated with an asterisk(*) are displayed in the current Kubex user interface. The remaining metrics will be covered in a future release.
The Helm charts have been updated as follows:
  • All-In-One Kubex Collection Stack has been updated to version 0.9.8;
  • The chart containing only the data forwarder has been updated to version 4.0.6.
These updates address the following issues:
  • The metrics that are retained in Prometheus are now limited to only the metrics that Kubex requested. This change reduces the resources consumed by Prometheus, in large clusters.
  • Both the pod security context and container security context are now set for all of the AIO chart components, including the Kubex data forwarder, Prometheus, Node exporter etc. This includes using the runtime default seccomp profile, running as a non-root user, no privileges for escalation, mounting the root file system as read-only, etc.
  • A sizing option for Prometheus resources, has been added. The size is based on the cluster size.
  • The Prometheus subchart has been updated to the latest version.
  • The Prometheus scrape configuration has been updated to use the k8s endpointslice. The endpoints that the data forwarder was using were deprecated in the K8s v1.33 API and will be removed in a future release. Endpointslice has been available since K8s v1.21.
  • Installation instructions for offline (or air-gapped) mode are now provided.
See Kubex Collection Stack for details. Ensure that your helm chart has been upgraded to the latest version, using the helm upgrade command.
This update contains an upgrade for Go to 1.24 for all package dependencies. This upgrade eliminates security scan warnings preventing some customers from upgrading their data forwarder to the latest version.
This update adds support for NVIDIA GPU metrics and HPA:
NVIDIA GPU Metrics—With this update, the Kubex data forwarder now collects GPU data from your Kubernetes clusters.The analysis and display of the collected metrics will be provided in a future release of Kubex.Note the following additional prerequisites to collect the GPU data:
  • NVIDIA-device-plugin—This plugin allows containers to access the NVIDIA GPUs. It must be installed on all your Kubernetes clusters to allocate NVIDIA GPU resources to workloads and to provide the GPU data.
  • dcgm-exporter—This Prometheus exporter exposes GPU metrics from the Data Center GPU Manager (DCGM). It is required to collect GPU data such as, utilization, memory usage, and power usage from NVIDIA GPUs, The dcgm-exporter can be deployed as a DaemonSet, where each node with an NVIDIA GPU runs a pod that exposes these metrics in a format that Prometheus can scrape and the Kubex data forwarder then collects.
The GPU data collection is currently supported on the following platforms:
  • AKS
  • EKS
  • GKE
HPA Scales Based on Multiple Metrics—With this update Kubex recommendations are not implemented for any containers enabled for HPA that scale based on single or multiple metrics, that are also enabled for automation using the Kubex Mutating Automation Controller (MAC).Previously Kubex did not support HPA-based on multiple metrics.
The Kubex Collection Stack has been updated to collect data from the NVIDIA DCGM exporter. You must ensure that your helm chart has been upgraded to the latest version (0.9.6) to get GPU data. You will get the updates automatically if you have the following settings, in the data forwarder cronJob specification:
  • image: densify/container-optimization-data-forwarder:4
  • imagePullPolicy: Always
If the settings are not configured as above, then you need to update the image, manually.
This update resolves an issue with reporting a container’s parent node and node groups.
This update collects new container and node-level metrics to provide more accurate recommendations.Upgrade data forwarder v4.1.2 after upgrading to Kubex v2.10.0. New database columns have been added in v2.10.0 that are required by this version of the data forwarder for the new metrics and attributes.When updating the data forwarder, you need to ensure that the same version is deployed for all of your clusters.
The data forwarder has been updated to collect the following:
  • Horizontal Pod Autoscaler (HPA) target metrics—Three new attributes have been added.
    • hpa_target_metric_name
    • hpa_target_metric_type
    • hpa_target_metric_value
  • Node taints—The configuration attribute and new metrics will be collected:
    • Added the multi-value attribute, “Node Taints” (attr_NodeTaints)
  • QoS class—The configuration attribute and new metrics will be collected.
    • Added the attribute, Quality of Service Class” (attr_QOSClass)
  • Node Working Set Memory Metrics—Working set memory is a process (or container) metric, rather than a node metric. It has been added to align with what is already shown in the AKS console. Kubex provides the following additional node memory metrics:
    • working set memory (in bytes)
    • working set memory utilization (percent)
    • memory utilization (percent) (based on memory_bytes metric)
    • memory actual utilization (percent) (based on memory_actual_workload metric)
    • total node memory (in bytes) (configuration attribute)
Kubex’s container analyses will be updated in future releases to utilize this HPA data when generating recommendations.
To improve the identification of a container’s parent node, the provider_id has been introduced as a third identification component. Currently, nodes are identified using cluster_name and node_name.The provider_id is optional and will be used only if required. In Kubernetes clusters without aprovider_id, node identification will continue to rely on cluster_name and node_name, ensuring that node IDs remain unchanged.Specifically, for EKS and OKE clusters, existing node IDs will change and as a result, node history will be lost since these nodes will receive new IDs.
The data forwarder’s configmap.yaml file has been updated. The setting, node_group_list has been updated to add label_karpenter_sh_nodepool. When enabled, Karpenter NodeGroups are discovered and created in Kubex.
This update resolves an issue with reporting CPU and memory reservation percent for nodes, node groups and clusters.
Version 4.1.0 collects new container and node-level metrics to provide more accurate container sizing recommendations.
Node level metrics and some new container metrics are collected, but will not be available in the Kubex Console until the release of Kubex v2.6.
The following container metrics are now collected:
  • max_cpu_throttling_percent—The Linux kernel allocates “CPU periods” (default = 100 milliseconds) to both processes and containers. The percentage is the number of throttled periods out of the total number of 100-ms periods. This value is a percentage of periods that were throttled vs those that were requested. This provides a more accurate indication of resource limitations. For example if the pod was throttled for 8 seconds out of the 5 minutes, this would be 2.7%, but if the CPU was not actually requested for the full 5 minutes, then the 2.7% is not an accurate representation of the state of the container. If the pod was throttled for 8 seconds, and it’s request was less that than the 300 seconds the throttled percentage is higher.
  • avg_cpu_throttling_percent—Indicates the average percentage of the number of 100-ms periods that a container is throttled in terms of CPU usage. The average and maximum values are collected since the container is likely aggregated. Metrics are aggregated at the highest level of the pod owner. For a deployment of one pod in one container the average and maximum will be the same. If it is a deployment of 10 pods then the average and maximum will not be the same.
  • sum_cpu_throttling_seconds—Aggregates the total time during which throttling has been applied.
  • Container Events—These are not metrics that are collected at 5-minute intervals, but are individual events and the time that the event occurred or was detected. In this version of the data forwarder “process exit” is now collected. The exit code and whether or not this is the main process (i.e. PID #1 using a true/false flag) of the container is collected and stored. In this version process exit will be false only when PID #1 exits with code 137 which corresponds to an OOM kill. In all other cases, process exit will be true.
CPU and memory request and limit values at the top of the hour are already collected and stored as attributes, once per day. In this release the data forwarder collects the following metrics at 5-minute intervals:
  • CPU Limit—The defined CPU allocation limit for the container.
  • Memory Limit—The defined memory allocation limit for the container.
  • CPU Request—The defined CPU allocation requested for the container.
  • Memory Request—The defined memory allocation requested for the container.
You must have Kubex v2.5, to see these new metrics in the Container Details report and metrics viewer. See Optimizing Your Containers - Details Tab and Using the Metrics Viewer for Containers.
Kubex currently sets configuration attributes with both the node-level and container-level CPU, memory request and limit values from the top of the hour. This provides a point in time data point that allows you to see how a node’s request and limit values move during the day.In this release the data forwarder collects the following metrics at 5-minute intervals:
  • CPU Limit—The defined CPU allocation limit for the node.
  • Memory Limit—The defined memory allocation limit for the node.
  • CPU Request—The defined CPU allocation requested for the node.
  • Memory Request—The defined memory allocation requested for the node.
  • Pod Count—The number of pods running on the selected node.
If values have not been specified this will be indicated.The following node events are collected at 5-minute intervals:
  • oom_kill_events—The number of kill events that happened on the node in a 5-minute interval;
  • cpu_throttling_events—The number of CPU throttling events that happened on the node in a 5-minute interval;
The following metrics are calculated from the raw data:
  • cpu_reservation_percent—The percentage of the node’s total CPU resources that are reserved or guaranteed for workloads, containers, or virtual machines.
  • memory_reservation_percent—The percentage of the node’s total memory resources that are reserved or guaranteed for workloads, containers, or virtual machines.
The following configuration details are now collected and stored in attributes:
  • provider_ID—The provider_id is available as a Prometheus label of kube_node_info metric which is extracted to its own column in the k8s_node_v0 postgres table. The relevant node data is then used to facilitate link the Kubernetes node to a cloud instance. This linking will be done in postgres, as will the determination of the relationship between cloud instance and ASGs or VM Scale Sets.
  • k8s_version—The Kubernetes version of the node is collected and stored in the attribute, k8s_node. This value is collected for each node and for the cluster. This is currently informational only and is not used not used in the analysis.
The features allowing you to view node details will be provided in Kubex v2.6.
The Kubernetes version of the cluster is collected and stored in the attribute, k8s_cluster. This is the version of the control plane/API server. It can be higher than those of the nodes in cases where the control plane was upgraded and the nodes have not yet been updated. This is used when determining OOM kills for Kubernetes versions older than 1.28.
Containers will now be linked to their node group using a provider ID that is populated by each public cloud provider’s Kubernetes offerings (i.e. EKS, AKS and GKE). The provider_id attribute is passed to the kubelet.The provider_id is available as a Prometheus label of kube_node_info metric which is extracted to its own column in the k8s_node_v0 postgres table. The relevant node data is then used to facilitate linking the Kubernetes node to a cloud instance. This will be done in postgres, as will the relationship between cloud instance and ASGs or VM Scale Sets.
Version 4 introduces significant updates to Kubex’s container data collection solution. This version resides in a new Github repository.While the core PromQL queries remain the same, with the exception of some improvements and bug fixes, the data forwarder has been re-architected to support multi-cluster data collection.You can upgrade from version 2.x or 3.x to version 4.0. You need to update the image tag to “4”:
  • image: densify/container-optimization-data-forwarder:4
  • imagePullPolicy: Always
When updating the data forwarder, you need to ensure that the same version is deployed for all of your clusters.See Data Collection for Containers for details of configuring the connection.
In addition to Prometheus, third party observability platforms are supported. Commercially-available observability platforms do not need to reside within a cluster and can be used to monitor multi-cluster environments. The observability platform must support the Prometheus API and one of the following authentication mechanisms:If you are using Amazon Managed Service for Prometheus (AMP), an AWS role is required to associate the Kubernetes service account with Kubex for the purpose of container data collection. The additional required annotations can be added to your Helm charts.The following Prometheus dependencies are applicable:
  • AWS Managed Prometheus data ingestion requires Prometheus v2.26.0 or higher.
  • Starting with version 16.0, the Prometheus chart requires Helm 3.7 or higher, to install successfully.
The following commercial observability platforms are supported:The following self-hosted OSS solutions are supported:
Version 4 collects data from external Prometheus servers. In this use case Prometheus runs outside the Kubernetes cluster, in which the data collection runs.
Since Prometheus can now be run externally, you can now run a single job to collect data for multiple Kubernetes clusters from a single Prometheus server/observability platform. Using Prometheus labels, the job determines which data belongs to which cluster.
You now have the option to select “container_memory_working_set_bytes” in addition to “container_memory_usage_bytes” or “container_memory_rss” for container data collection. This new metric allows you to determine how aggressively you want to optimize your container memory allocations, when analyzing your Kubernetes environments.Using the metrics, “container_memory_rss” and “container_memory_working_set_bytes” provides more aggressive recommendations that allow you to reclaim more memory. Using “container_memory_usage_bytes” and “container_memory_rss” provide recommendations that will not result in downsizing recommendations that may lead to out-of-memory (OOM) issues.The corresponding workload charts were added to the Containers - Details Tab in v2.2.0. See Optimizing Your Containers - Details TabContact [email protected]for details on selecting memory utilization for your container data collection.
The following updates and bug fixes were made in this release:
  • HTTP retries have been added to the calls to the Prometheus API. This handles observability platform rate limiting.
  • The data forwarder now handles the relabel configs of Node Exporter.
  • Outdated Node Exporter metrics have been addressed.
  • Utilizes improved Horizontal Pod Autoscaler metrics (autoscaling v2).
  • The data forwarder has been upgraded to Go 1.22,
  • Updated examples for both single and multiple cluster configurations are provided in the new Github repository.
Kubex will discontinue support of Container Data Forwarder versions 3.x.x on June 30, 2024. Contact [email protected] for details on migrating to Kubex’s current version of the Container Data Forwarder.

Kubex Automation Controller

This section lists new features and updates to the Kubex Automation Controller, which automatically implements Kubex’s optimization recommendations for Kubernetes workloads. A Helm chart bundles all components required for automated container optimization. See Kubex Automation Controller for deployment details and configuration options.
Streamlined deployment by changing the default certificate method, making cert-manager optional rather than required.
  • Easier Installation—Reduced deployment complexity and external dependencies by no longer requiring cert-manager by default.
  • Flexible Options—Continue to choose from self-signed certificates, cert-manager integration, or bring-your-own-certificate based on your environment requirements.
Optimized default settings for Kubernetes 1.33+ clusters to take advantage of in-place resizing capabilities, significantly reducing time-to-optimization.
In-place resizing operations are now visible in the Kubex UI, providing better insight into automation activities and outcomes.
Added support for overriding the wait-for-Valkey init container image to accommodate environments with restricted access to public container registries.
  • Private Registry Support—Organizations that cannot pull images from public repositories can now configure the automation controller to use images hosted in their private registries.
  • Image Location Override—The waitForValkeyImage configuration option allows specifying a custom image location (default: busybox:latest), enabling deployment in air-gapped or restricted network environments.
  • Flexible Deployment—This enhancement ensures the automation controller can be deployed in security-hardened environments where all container images must be sourced from approved internal registries.
The Kubex Automation Controller now supports Kubernetes In-Place Pod Resizing for eligible workloads running on Kubernetes 1.33 or higher. This capability enables resource adjustments without pod restarts, eliminating downtime during optimization.
  • Zero-Downtime Resizing—Containers are resized in-place without eviction when supported by the cluster, ensuring continuous availability for critical workloads.
  • Automatic Fallback—For clusters that don’t support in-place resizing or workloads that require pod recreation, the controller automatically falls back to the traditional pod eviction method.
  • Kubernetes 1.33+ Required—In-place resizing requires Kubernetes 1.33 or later with the InPlacePodVerticalScaling feature gate enabled.
Pod scan logs have been significantly improved to provide clear summaries of the actions taken by the automation controller. Each log entry now clearly indicates the action taken and the reason:
  • RESIZED—Pod was successfully resized using in-place resizing (zero downtime).
  • EVICTED—Pod was evicted and recreated with new resource specifications (traditional method).
  • BLOCKED—Pod could not be resized, with detailed blocking reasons explaining why (e.g., hpa_conflict_cpu, limit_range_violation, resource_quota_exceeded, node_size_insufficient, manual_pause_infinite).
  • SKIPPED—Pod is already sized at the recommended specification, no action needed.
These enhanced logs make it easier to monitor automation activity, troubleshoot issues, and understand why specific pods were not optimized.
Administrators can now use Kubernetes annotations to control automation behavior at the pod or workload level using the rightsizing.kubex.ai/pause-until: "<RFC3339 timestamp | infinite>" annotation:
  • Learning Periods—Pause automation temporarily after application changes or deployments to allow the system to gather sufficient metrics before optimizing. Specify an RFC3339 timestamp to pause until a specific date.
  • Permanent Exclusions—Permanently exclude specific workloads from automation by setting the annotation value to infinite, providing granular control over which resources are managed by Kubex.
Additionally, the controller now respects the cluster-autoscaler.kubernetes.io/safe-to-evict=false annotation, ensuring that pods marked as unsafe to evict are never evicted by the automation controller, even when optimization recommendations are available.This annotation-based approach integrates seamlessly with GitOps workflows and provides declarative control over automation scope. See Pausing Automation for Specific Pods for detailed usage examples.
The podLabels field in the Helm values scope configuration is now optional, simplifying deployment for users who want to enable automation across entire namespaces without label-based filtering.
  • Previously, podLabels was mandatory and required explicit configuration even when not needed.
  • Now, you can define scope using only namespace for namespace-wide automation, or combine namespace with optional podLabels for more granular control.
The controller’s pod scanning behavior has been enhanced to balance responsiveness with efficiency:
  • Quick Initial Scan—The first scan now executes 2 minutes after controller startup (default), allowing rapid initialization and faster time-to-value.
  • Configurable Regular Interval—Subsequent scans run at the configured interval, providing predictable automation cadence.
  • Reduced Startup Delay—Previously, users had to wait for the full scan interval before the first automation actions occurred.
These improvements ensure new deployments are evaluated quickly while maintaining efficient resource utilization for ongoing operations. See Pod Scan Configuration for guidance on configuring scan intervals based on automation scope and cluster size.
  • Increased API Timeout—API timeout values have been increased to handle slower Kubex API responses during peak loads, preventing unnecessary automation failures.
  • Updated cert-manager Dependency—The cert-manager version has been updated to the latest stable release, ensuring compatibility with modern Kubernetes clusters and improving certificate management reliability.
Optimized the mutating webhook configuration by removing the UPDATE operation, ensuring the webhook only triggers on pod creation events.
  • Reduced Webhook Invocations—The mutating admission controller now only intercepts pod creation events, eliminating unnecessary processing during pod updates.
  • Improved Cluster Performance—Reducing webhook triggers decreases API server load and improves overall cluster responsiveness.
Added Kubernetes RBAC (Role-Based Access Control) permissions required for in-place pod resizing functionality. This update prepares the controller for the in-place resizing feature, which was still in development at the time of this release.
  • Future Feature Preparation—RBAC rules have been added to grant the controller the necessary permissions to perform in-place pod resource updates when the feature becomes generally available.
  • No Functional Changes—This release focuses on infrastructure readiness; the in-place resizing feature itself was introduced in version 1.0.7.
Added support for external secret management tools, providing greater flexibility and security for enterprise deployments.
  • External Secrets Integration—The Helm chart now supports working with secrets created and managed by external secret management tools (e.g., External Secrets Operator, Sealed Secrets), rather than requiring the chart to create secrets internally.
  • Enhanced Security Posture—Organizations can now leverage their existing secret management workflows and tools, ensuring secrets are handled according to enterprise security policies and compliance requirements.
  • Customer-Driven Enhancement—This feature was developed based on customer feedback to support real-world enterprise deployment scenarios where centralized secret management is required.
Removed namespace restrictions, allowing the Kubex Automation Controller to be deployed in any namespace.
  • Arbitrary Namespace Support—The Helm chart can now be deployed in any namespace, not just predefined ones, providing greater flexibility for multi-tenant environments and organizational policies.
  • Improved Secret Management—Namespace flexibility complements external secret management by allowing secrets to reside in different namespaces according to security boundaries and access control requirements.
  • Simplified Multi-Cluster Deployments—Organizations with standardized namespace naming conventions can now deploy the controller consistently across multiple clusters without chart modifications.
Added release namespace information to relevant environment variables, improving namespace-aware operations.
  • Namespace Context—Environment variables now include the release namespace, enabling better multi-namespace deployments and troubleshooting.
  • Improved Configuration—Simplifies configuration in multi-tenant environments where multiple instances of the controller run in different namespaces.
Replaced the persistent volume claim (PVC) requirement with Valkey (Redis-compatible) for storing optimization recommendations, simplifying deployment and improving scalability.
  • Simplified Deployment—Eliminates the need for persistent storage configuration, reducing deployment complexity and storage management overhead.
  • Improved Performance—Valkey provides fast in-memory data access for recommendation retrieval, improving controller response times.
  • Enhanced Scalability—In-memory storage enables better horizontal scaling capabilities for large-scale deployments.
This release introduces the core Kubex Automation Controller deployment, which automatically manages container resource optimization through intelligent pod scanning and eviction.
  • Automated Pod Scanning—The controller continuously scans all pods within the configured scope, evaluating them for optimization opportunities based on Kubex recommendations.
  • Pre-Eviction Safety Checks—Before resizing any pod, the controller runs a comprehensive series of checks to ensure safe eviction, including:
    • Validation of workload owner types and controller compatibility
    • HPA (Horizontal Pod Autoscaler) conflict detection
    • Resource quota and limit range compliance
    • Pod disruption budget (PDB) verification
    • Policy specification compliance to verify automation is allowed
  • Intelligent Pod Eviction—When a pod passes all safety checks, the controller performs controlled eviction, allowing the mutating admission controller to automatically apply optimized resource specifications when the pod is recreated.
  • Scope-Based Control—Administrators define automation scope through Helm values, specifying namespaces and optional pod labels to control which workloads are managed by the automation controller.
This release focuses on security enhancements and infrastructure improvements for the Kubex Automation Controller deployment.
  • Gateway Resource Management—Added resource requests and limits for the gateway container, ensuring predictable resource allocation and preventing resource contention in production environments.
  • Enhanced Secret Security—Improved secret handling by switching from environment variables to volume mounts, reducing the risk of secret exposure through process listings and container inspection.
  • Automated CA Bundle Management—The webhook configuration now automatically extracts the CA bundle from TLS secrets, eliminating the need for manual CA bundle configuration and reducing deployment complexity.
  • Simplified Configuration—Removed the manual caBundle dependency from values-edit.yaml, streamlining the Helm chart configuration process.
  • Improved Documentation—Enhanced certificate generation documentation with detailed examples for multiple certificate creation methods, including OpenSSL, CFSSL, and Bring Your Own Certificate (BYOC) scenarios.
This inaugural release introduces the Kubex Automation Controller, rebranded from the Mutating Admission Controller to reflect broader automation capabilities.
  • Helm Chart Deployment—Packaged as a Helm release for simplified installation and management.
  • Automated Setup Script—Quick deployment script automates setup, including cert-manager installation.
  • Split Architecture—Redesigned with separate containers for core webhook logic and secure API communication.
  • Enhanced Security—Encrypted credentials, non-root container execution, and seccomp profiles enabled by default.
  • Selective Automation—Ability to include or exclude specific workload types (e.g., Deployment, StatefulSet) from automation.
  • Offline Resilience—Mutations are cached locally during connectivity issues and synchronized when connection is restored.