Senior Site Reliability Engineer, AI Factory job opportunity at Jobgether.



bot
Jobgether Senior Site Reliability Engineer, AI Factory
Experience: 10 Years
Pattern: Remote
apply Apply Now
Salary:
Status:

Security & IT,IT

Copy Link Report
degreeBachelor's (B.Sc.)
loacation United States Of America, United States Of America
loacation United States ..........United States Of America

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Site Reliability Engineer, AI Factory in the United States.This role focuses on designing, operating, and optimizing next-generation GPU-accelerated data centers at scale, ensuring performance, reliability, and efficiency for AI workloads. You will lead the end-to-end lifecycle of critical infrastructure, from provisioning and commissioning to day-to-day operations, while collaborating across hardware, software, and operational teams. Success in this position requires deep technical expertise, hands-on problem solving, and a passion for open-source solutions and automation. You will help define operational standards for large-scale AI facilities, drive continuous improvement, and implement processes that maintain uptime while enabling cutting-edge innovation. This role offers the opportunity to impact global AI infrastructure and work in a high-performance, collaborative environment with engineers tackling unique telemetry, orchestration, and reliability challenges.\nAccountabilities:Architect, commission, and provision GPU systems at large scale, ensuring supported firmware and component versions are maintained across operations.Lead Day-2 operations, monitoring cluster hardware, identifying bottlenecks, and optimizing efficiency, performance, and availability.Triage hardware break-fix issues, develop automated solutions, and continuously improve operational workflows.Collaborate with hardware, software, and technical teams to define repeatable procedures and operational strategies aligned with SLAs.Develop and enforce quality control procedures to minimize downtime and maintain high reliability for mission-critical AI infrastructure.Provide documentation and operational guidance to support global AI data center deployments and internal teams.Feed hardware and software requirements into engineering pipelines and coordinate with remote hands and field teams.Requirements:Bachelor’s or Master’s degree in Computer Engineering, Computer Science, or a related field, or equivalent experience.10+ years of experience in data center operations, site reliability, or critical infrastructure management.Proven experience managing GPU fleets and large-scale computing environments.Expertise in BMS, power management, and commissioning/provisioning processes.Hands-on experience with configuration management, Packer, QCOW2 images, and Datacenter Inventory Management Systems (Netbox, Nautilus, or similar).Strong track record of cross-team collaboration to deliver operational excellence and reliability improvements.Knowledge of automated break-fix solutions, message bus systems, workflow engines, and Zero Touch Provisioning is highly desirable.Excellent problem-solving skills, attention to detail, and the ability to implement robust processes for uptime and performance optimization.Benefits:Competitive base salary: $176,000–$276,000 (Level 4) or $208,000–$333,500 (Level 5), based on experience and location.Equity participation and bonus eligibility.Comprehensive medical, dental, and vision coverage.Paid leave, holidays, and flexible work arrangements.Professional development opportunities and access to learning platforms.Retirement plans and financial wellness programs.Collaborative environment with exposure to cutting-edge AI and open-source data center technologies.\nWhy Apply Through Jobgether?We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.We appreciate your interest and wish you the best! Why Apply Through Jobgether? Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.#LI-CL1

Other Ai Matches

remote-jobserver Remote
VP, Engineering - REMOTE Applicants are expected to have a solid experience in handling General – General related tasks
remote-jobserver Remote
Remote Sr. Product Manager Applicants are expected to have a solid experience in handling General – General related tasks
remote-jobserver Remote
Assistant Marketing Manager - REMOTE Applicants are expected to have a solid experience in handling General – General related tasks
remote-jobserver Remote
RVP, Strategic Account Management Applicants are expected to have a solid experience in handling Sales – Account Executive related tasks
remote-jobserver Remote
Remote Software Engineer (Staff) Applicants are expected to have a solid experience in handling General – General related tasks
remote-jobserver Remote
Lead Education Research Consultant - REMOTE Applicants are expected to have a solid experience in handling General – General related tasks
remote-jobserver Remote
Staff Software Engineer - REMOTE Applicants are expected to have a solid experience in handling General – General related tasks
remote-jobserver Remote
Creative Director (Mobile Ads & UA) Applicants are expected to have a solid experience in handling Marketing – Marketing related tasks
remote-jobserver Remote
Senior SRE DevOps Engineer (Remote from Portugal) Applicants are expected to have a solid experience in handling Security & IT – IT related tasks
remote-jobserver Remote
Senior UX/UI Designer Applicants are expected to have a solid experience in handling Research & Development – Design related tasks
remote-jobserver Remote
Sr. Data Warehouse Engineer (Azure, Snowflake) Applicants are expected to have a solid experience in handling Security & IT – IT related tasks
remote-jobserver Remote
Legal Analyst - REMOTE Applicants are expected to have a solid experience in handling General – General related tasks
remote-jobserver Remote
Manager, Accounting Controls Applicants are expected to have a solid experience in handling Finance related tasks
remote-jobserver Remote
Remote Brokerage Coordinator Position Applicants are expected to have a solid experience in handling General – General related tasks
remote-jobserver Remote
Senior Backend Engineer (Remote from India) Applicants are expected to have a solid experience in handling Security & IT – IT related tasks
remote-jobserver Remote
Staff FullStack Engineer - Javascript Applicants are expected to have a solid experience in handling Security & IT – IT related tasks
remote-jobserver Remote
Marketing Coordinator (Remote from Spain) Applicants are expected to have a solid experience in handling Marketing – Marketing related tasks
remote-jobserver Remote
Remote Member of Technical Staff Applicants are expected to have a solid experience in handling General – General related tasks
remote-jobserver Remote
Customer Success Manager - REMOTE Applicants are expected to have a solid experience in handling General – General related tasks
remote-jobserver Remote
Creative Director (Mobile Ads & UA) Applicants are expected to have a solid experience in handling Marketing – Marketing related tasks
remote-jobserver Remote
Walmart Marketplace Specialist Applicants are expected to have a solid experience in handling Marketing – Marketing related tasks
remote-jobserver Remote
Head Of Customer Success Applicants are expected to have a solid experience in handling Sales – Business Development related tasks
remote-jobserver Remote
VP Sales Remote Applicants are expected to have a solid experience in handling General – General related tasks