Principal Software Engineer, AIOps and Observability job opportunity at NVIDIA.



DateMore Than 30 Days Ago bot
NVIDIA Principal Software Engineer, AIOps and Observability
Experience: Highly Experienced
Pattern: full-time
apply Apply Now
Salary:
Status:

AIOps and Observability

Copy Link Report
degreeGeneral
loacation US, CA, Santa Clara, United States Of America
loacation US, CA, Santa ..........United States Of America

NVIDIA has been reinventing computer graphics, PC gaming, and accelerated computing for 30 years. It is a unique legacy of innovation that’s fueled by great technology and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, generative AI , robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. We are looking for a highly skilled Principal Software Engineer to design and develop AIOps & Observability platforms at NVIDIA. The platforms are used by internal teams to monitor, diagnose, and optimize the products, millions of assets and services in cloud, on-prem, data centers, supply chain, and edge. You will work with a team of engineers, product managers, and partners to define the observability strategy, roadmap, and standard methodologies for NVIDIA. You will also mentor and coach other engineers on observability, machine learning, tools and techniques. What you will be doing:  Lead the design, development, and deployment of AIOps & Observability platforms, including metrics, logs, traces, events, alerts, dashboards, and visualizations. Drive the technical vision and roadmap for AIOps and Observability initiatives, aligning with business goals and industry best practices. Collaborate with other teams and customers to understand their observability needs and provide solutions that meet their requirements and expectations. Establish and implement observability standards, guidelines, and processes across NVIDIA. Research, evaluate, and adopt new observability technologies and frameworks that can enhance user experience. Provide peer reviews to other engineers including feedback on performance, scalability, security and correctness. Work with Data scientists to implement machine learning models for anomaly detection, forecasting, and root cause analysis on logs, metrics, and events. Handle large volumes of data and ensure data quality, security, and compliance. Develop and operate scalable, reliable, and distributed systems that can handle high traffic and complex workloads. Find opportunities to automate remediation of commonly occurring issues to operate systems reliably and efficiently. What we need to see:  Bachelor’s degree in computer science and engineering, or related field, or equivalent experience. 15+ years of experience in product development and full stack engineering, with 5+ years of experience in developing and operating observability platforms and solutions, preferably in a cloud-native environment. Strong knowledge and experience with observability tools, such as Prometheus, Victoria Metrics, Vector, Loki, Grafana, Alert Manager, Clickhouse, OpenTelemetry, etc. Hands-on knowledge in AIOps tools such as BigPanda, PagerDuty, Datadog, etc. Experience with Kubernetes, Nomad, Docker, and microservices architectures as well as experience with streaming services to ingest billions of events using NATS, Kafka, etc Proficient in one or more programming languages, such as Go, Python, Java, C#, etc. Passionate about observability and delivering high-quality internal platforms. Experience with developing Observability solutions to monitor On-prem and Public Cloud environments. Experience with running large Observability platforms on BareMetal Infrastructure Establish scalable data pipelines and instrumentation for collecting, aggregating, and visualizing telemetry and operational metrics. Ways To Stand Out From The Crowd: Deep understanding of implementing Observability solutions to large scale on-prem Infrastructure and Networking. Hands-on experience with managing large scale Observability Platforms with LLMs & ML Models and building custom services to ingest billions of metrics and logs from wide range of assets. Developed unified cloud observability platform to monitor Network, Compute, Power, Storage, Operating Systems, Security, Applications, SaaS Platforms. Demonstrated experience and expertise in using machine learning and Generative AI to develop solutions such as predictive monitoring, incident diagnosis, summarization and correlation. Demonstrate proficiency in AI/ML systems, generative AI, or agentic AI frameworks. NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative, self-motivated and enjoy having fun, then what are you waiting for apply today! #LI-Hybrid Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 248,000 USD - 391,000 USD. You will also be eligible for equity and benefits . Applications for this job will be accepted at least until January 13, 2026. This posting is for an existing vacancy.  NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Other Ai Matches

Senior Embedded Solutions Architect Applicants are expected to have a solid experience in handling Job related tasks
Senior Deep Learning Kernel Software Performance Architect Applicants are expected to have a solid experience in handling Job related tasks
Implementation Methodology Engineer - GPU Applicants are expected to have a solid experience in handling Job related tasks
Senior Solutions Architect, Infrastructure Applicants are expected to have a solid experience in handling Infrastructure related tasks
Software Manager, ITPE Applicants are expected to have a solid experience in handling ITPE related tasks
Global Head of Business Development, Digital Health Applicants are expected to have a solid experience in handling Digital Health related tasks
Senior Product Marketing Manager - Data Processing Applicants are expected to have a solid experience in handling Job related tasks
Senior Deep Learning Performance Architect Applicants are expected to have a solid experience in handling Job related tasks
Senior Product Manager, AI Infrastructure Software Applicants are expected to have a solid experience in handling AI Infrastructure Software related tasks
Venture Associate MBA Intern - Summer 2026 Applicants are expected to have a solid experience in handling Job related tasks
Senior Deep Learning Compiler Engineer - PyTorch Applicants are expected to have a solid experience in handling Job related tasks
remote-jobserver Remote
Senior Quantum Algorithm Researcher Applicants are expected to have a solid experience in handling Job related tasks
Senior HPC and Quantum Systems Engineer Applicants are expected to have a solid experience in handling Job related tasks
Senior Solutions Architect, NVIS Customer Infrastructure Management Applicants are expected to have a solid experience in handling NVIS Customer Infrastructure Management related tasks
Principal Product Manager, AI Frameworks Applicants are expected to have a solid experience in handling AI Frameworks related tasks
Senior Tools Development Engineer Applicants are expected to have a solid experience in handling Job related tasks
remote-jobserver Remote
Software Architecture Engineer Applicants are expected to have a solid experience in handling Job related tasks
Principal Quantum Error Correction Research Scientist, Applied Research Applicants are expected to have a solid experience in handling Applied Research related tasks
Manager, AI Agents and Applied Research Applicants are expected to have a solid experience in handling AI Agents and Applied Research related tasks
Server Firmware Developer (RDSS Intern) Applicants are expected to have a solid experience in handling Job related tasks
Software Solutions Engineer - Cloud and Graphics Technologies Applicants are expected to have a solid experience in handling Job related tasks
Senior System Software Engineer, GPU Performance Profiling Applicants are expected to have a solid experience in handling GPU Performance Profiling related tasks
Senior Software Engineer, ASIC Design and Verification Tools Applicants are expected to have a solid experience in handling ASIC Design and Verification Tools related tasks