Site Reliability Engineer - Kubernetes Platform job opportunity at xAI.



bot
xAI Site Reliability Engineer - Kubernetes Platform
Experience: 5-years
Pattern: full-time
apply Apply Now
Salary:
Status:

Engineering

Copy Link Report
degreeGeneral
loacation Palo Alto, CA, United States Of America
loacation Palo Alto, CA....United States Of America

About xAI xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All employees are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.  About the Role We are seeking a highly skilled Site Reliability Engineer to join our mission-driven team, focusing on designing, building, and optimizing Kubernetes clusters across multiple regions. In this role, you will leverage your expertise in Kubernetes orchestration and distributed systems to enhance the reliability, performance, and cost-effectiveness of xAI’s infrastructure. You will collaborate closely with engineering teams to deliver robust, scalable solutions that support large-scale AI workloads. The ideal candidate is passionate about automation, observability, and ensuring the integrity of critical systems in a fast-paced, innovative environment. Responsibilities Develop and optimize software to provision and manage Kubernetes clusters on-premises, enabling xAI to scale efficiently. Enhance the reliability, performance, and cost-effectiveness of Kubernetes infrastructure to support large-scale AI and application workloads. Collaborate with xAI engineers to understand workload requirements and design tailored Kubernetes solutions to meet their needs. Implement robust observability, monitoring, and security practices to ensure the integrity, availability, and confidentiality of critical systems. Manage storage infrastructure using Infrastructure-as-Code (IaC) tools such as Pulumi, Terraform, or Ansible. Drive system reliability through incident management, postmortems, and the definition of clear SLAs and SLOs. Contribute to the Kubernetes stack, including expertise in CNI, CRI, CSI, and related components. This is an in-person role based in Palo Alto, CA, with up to 25% travel required. Required Qualifications 5+ years of experience as a Site Reliability Engineer or similar role, with a focus on building and maintaining reliable, scalable systems. Proven expertise in managing Kubernetes infrastructure using tools like Cluster API (CAPI) and kubeadm. Proficiency in managing storage infrastructure with IaC tools such as Pulumi, Terraform, or Ansible. Deep understanding of the Kubernetes stack, including CNI, CRI, CSI, and related components. Demonstrated ability to improve system reliability through incident management, postmortems, and defining SLAs/SLOs. Preferred Qualifications Experience with high-traffic web or mobile application workloads, including optimizing Kubernetes for large-scale deployments. Familiarity with chaos engineering, capacity planning, or similar practices for ensuring system resilience. Proficiency with tools such as Kyverno, ArgoCD, or Go programming for infrastructure automation. Strong sense of ownership, curiosity, and enthusiasm for tackling complex technical challenges. Passion for problem-solving and a proactive drive to deliver impactful results. A sense of adventure and humor to navigate challenges with a positive mindset. Annual Salary Range $180,000 - $440,000 USD Benefits Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.xAI is an equal opportunity employer. For details on data processing, view our Recruitment Privacy Notice.

Other Ai Matches

React Frontend Engineer - grok.com & API Applicants are expected to have a solid experience in handling Product related tasks
Investigations Analyst, Suspicious Activity Reports - X Payments Applicants are expected to have a solid experience in handling Product related tasks
Member of Technical Staff, Mid-training Applicants are expected to have a solid experience in handling Foundation Model related tasks
Member of Technical Staff - Reasoning Post-training Applicants are expected to have a solid experience in handling Foundation Model related tasks
Software Engineer, X Product Applicants are expected to have a solid experience in handling Engineering related tasks
remote-jobserver Remote
Bookkeeping Specialist Applicants are expected to have a solid experience in handling Human Data related tasks
remote-jobserver Remote
RL Environments Specialist Applicants are expected to have a solid experience in handling Human Data related tasks
Mobile iOS Engineer Applicants are expected to have a solid experience in handling Product related tasks
Member of Technical Staff - RL Data Scaling Applicants are expected to have a solid experience in handling Foundation Model related tasks
remote-jobserver Remote
Video Games Tutor Applicants are expected to have a solid experience in handling Human Data related tasks
Member of Technical Staff, Video Generation - Agent, RL Applicants are expected to have a solid experience in handling Foundation Model related tasks
remote-jobserver Remote
Model Behavior Tutor - Social Cognition & EQ Applicants are expected to have a solid experience in handling Human Data related tasks
Software Engineer - Reliability Applicants are expected to have a solid experience in handling Infrastructure related tasks
SOC Manager Applicants are expected to have a solid experience in handling Security related tasks
Member of Technical Staff - Government Applicants are expected to have a solid experience in handling Engineering related tasks
Hardcore Engineer - Multimodal Infrastructure Applicants are expected to have a solid experience in handling Foundation Model related tasks
Security Engineer - Azure Government Applicants are expected to have a solid experience in handling Engineering related tasks
Lead Grok Engineer Applicants are expected to have a solid experience in handling Engineering related tasks
Supervisor, Safety Applicants are expected to have a solid experience in handling Data Center Operations related tasks
remote-jobserver Remote
Materials Science Tutor Applicants are expected to have a solid experience in handling Human Data related tasks
remote-jobserver Remote
Biology Tutor Applicants are expected to have a solid experience in handling Human Data related tasks
Dispute Analyst, X Payments Applicants are expected to have a solid experience in handling Product related tasks
SOC Operator Applicants are expected to have a solid experience in handling Security related tasks