Product Manager, Health Automation and Resilience job opportunity at NVIDIA.



DateMore Than 30 Days Ago bot
NVIDIA Product Manager, Health Automation and Resilience
Experience: 8-years
Pattern: full-time
apply Apply Now
Salary:
Status:

Health Automation and Resilience

Copy Link Report
degreeGeneral
loacation US, CA, Santa Clara, United States Of America
loacation US, CA, Santa ..........United States Of America

NVIDIA DGX Cloud is searching for a highly technical Product Manager to guide Health Automation and Resilience efforts for AI infrastructure. This role is responsible for developing products for fault detection, failure classification, automated repair workflows, and resilience tooling that enables consistent GPU fleet performance. You will build the next generation of health automation capabilities including detection pipelines, classification mechanisms, repair automation, and distributed resilience methods. The position lies at the crossroads of distributed systems, observability, GPU hardware, and cloud operations. You will collaborate with engineering teams to transform signals, telemetry, and operational lessons into automation infrastructure that improves cloud provider efficiency and end-user experience at scale. If you are motivated by building foundational systems that enable large AI clusters to operate dependably and efficiently, we would love to hear from you. What You Will Be Doing: Establish the product vision and strategy for Health Automation and Resilience across DGX Cloud and partner GPU fleets. Partner with engineering on the architecture and delivery of software agents, services, control loops, and distributed health components. Convert hardware signals, telemetry pipelines, and operational insights into automation systems that reduce manual intervention. Work with cloud providers and enterprise operators to understand failure modes and operational challenges. Develop product specifications, technical requirements, and validation criteria for both internal and open-source components. Support go-to-market activities including documentation, demos, partner enablement, and release readiness. Track trends in observability, SRE practices, distributed systems, and automated operations to define long-term strategy. Lead product technical reviews, customer conversations, and planning sessions. What we need to see: Bachelor’s degree in Computer Science, Engineering, or a similar area, or equivalent experience. 8+ years of relevant experience including demonstrated experience leading technical products within cloud infrastructure, distributed systems, reliability engineering, or related fields. Track record defining multi-quarter strategy and leading execution with multiple engineering teams. Ability to craft clear product requirements, work directly with engineering partners on technical decisions, and compose system-level workflows. Strong architectural understanding of control planes, telemetry systems, health monitoring, repair workflows, or automated remediation systems. Understanding of telemetry signals, SLOs, failure modes, and repair workflows in production environments. Experience building automation, resilience, or failure-recovery capabilities for large-scale cloud or HPC environments. Experience working with open-source technologies or products for software developers. Excellent communication skills across engineering, customers, and executives. Ways to Stand Out from the Crowd: Experience with GPU-accelerated compute, HPC systems, or large-scale AI clusters. Knowledge of Kubernetes operators, node health workflows, autoscaling, or control-plane automation. Experience with modern observability and diagnostics technologies such as Prometheus, OpenTelemetry, eBPF, or distributed tracing. Contributions to infrastructure or reliability open-source communities. Experience writing detailed build documents for software agents, distributed services, or platform-level components. NVIDIA is widely considered to be one of the technology world’s most desirable employers! We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you! #LI-Hybrid Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 168,000 USD - 258,750 USD for Level 4, and 208,000 USD - 327,750 USD for Level 5. You will also be eligible for equity and benefits . Applications for this job will be accepted at least until January 13, 2026. This posting is for an existing vacancy.  NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Other Ai Matches

Distinguished Engineer, JAX Applicants are expected to have a solid experience in handling JAX related tasks
Director, Global Network Reliability Engineering Applicants are expected to have a solid experience in handling Global Network Reliability Engineering related tasks
Senior HSIO Validation Engineer Applicants are expected to have a solid experience in handling Job related tasks
Senior System Architect Applicants are expected to have a solid experience in handling Job related tasks
Senior CUDA Driver and DevOps Engineer Applicants are expected to have a solid experience in handling Job related tasks
Technical Program Manager, Learning Innovation - Education Services Applicants are expected to have a solid experience in handling Learning Innovation - Education Services related tasks
remote-jobserver Remote
Security Research Architect Applicants are expected to have a solid experience in handling Job related tasks
Senior Software Engineer, Deep Learning Inference - TensorRT Applicants are expected to have a solid experience in handling Deep Learning Inference - TensorRT related tasks
NVIDIA 2026 Internships: PhD Computer Architecture and Systems Research - US Applicants are expected to have a solid experience in handling Job related tasks
Deep Learning Solution Architect Applicants are expected to have a solid experience in handling Job related tasks
Senior Compiler Engineer - Compute Front-End Applicants are expected to have a solid experience in handling Job related tasks
Senior Developer Relations Manager - Studio Applicants are expected to have a solid experience in handling Job related tasks
Senior DFT ATPG Engineer Applicants are expected to have a solid experience in handling Job related tasks
Senior AI-HPC Cluster Engineer - MLOps Applicants are expected to have a solid experience in handling Job related tasks
APAC Partner Marketing Manager - GeForce Applicants are expected to have a solid experience in handling Job related tasks
Senior Architect- Molecular Dynamics Applicants are expected to have a solid experience in handling Job related tasks
remote-jobserver Remote
Senior Deep Learning Performance Architect Applicants are expected to have a solid experience in handling Job related tasks
Software Architect, Advanced Development Applicants are expected to have a solid experience in handling Advanced Development related tasks
Food and Beverage Manager Applicants are expected to have a solid experience in handling Job related tasks
Senior Technical Marketing Engineer - CAE Applicants are expected to have a solid experience in handling Job related tasks
ASIC Design Engineer - New College Grad 2026 Applicants are expected to have a solid experience in handling Job related tasks
Manager, Next-Generation AI Cluster Architecture Applicants are expected to have a solid experience in handling Next-Generation AI Cluster Architecture related tasks
Senior Manager, Software Monetization Applicants are expected to have a solid experience in handling Software Monetization related tasks