LLM Training Resilience Engineer job opportunity at Together AI.



bot
Together AI LLM Training Resilience Engineer
Experience: 5-years
Pattern: full-time
apply Apply Now
Salary:
Status:

Engineering

Copy Link Report
degreeVocational
loacation San Francisco, United States Of America
loacation San Francisco....United States Of America

#Develop #systems to identify, isolate, and recover from failures in large-scale distributed training workloads. Implement proactive error-detection #mechanisms, including straggler detection and fault prediction #algorithms. __ Ensure stability and consistency across distributed training clusters (e.g., GPU/TPU clusters). Optimize recovery time and throughput in the face of hardware or software failures. __ Design and maintain observability systems for monitoring cluster health, training performance, and failure patterns. Leverage telemetry data to improve incident response and automate mitigation strategies. __ Build resilience-focused tooling, such as job health monitors, distributed checkpoint systems, and automated recovery workflows. Enhance debugging and diagnosis frameworks for distributed training jobs. __ Collaborate with platform engineers, researchers, and ML practitioners to identify pain points and resilience requirements. Document and communicate best practices for fault-tolerant AI training.
ai summary

Other Ai Matches

Sr. Recruiter, Physical Infrastructure Applicants are expected to have a solid experience in handling Infrastructure Recruitment related tasks
Senior Backend Engineer, Inference Platform Applicants are expected to have a solid experience in handling Backend Engineering related tasks
Senior Software Engineer - Together Cloud Infrastructure Applicants are expected to have a solid experience in handling Software Engineer related tasks
Infrastructure Engineer, Data Platform Applicants are expected to have a solid experience in handling Engineering related tasks
Account Executive Europe (Net New Logo) Applicants are expected to have a solid experience in handling Sales related tasks
GPU Cluster Resource Scheduling and Optimization Engineer Applicants are expected to have a solid experience in handling Engineering related tasks
Senior Systems Administrator Applicants are expected to have a solid experience in handling System Administrator related tasks
Staff Brand Marketing Manager Applicants are expected to have a solid experience in handling Brand Manager related tasks
Customer Support Engineer, India Applicants are expected to have a solid experience in handling Customer service related tasks
AI Researcher, Core ML Applicants are expected to have a solid experience in handling AI Researcher related tasks
Senior Developer Productivity Engineer Applicants are expected to have a solid experience in handling Productivity Engineer related tasks
Senior Software Engineer, Observability Applicants are expected to have a solid experience in handling Software Engineer related tasks
Product Marketing Director Applicants are expected to have a solid experience in handling Marketing Director related tasks
Senior Manager, Data Center Strategy & Compute Supply Applicants are expected to have a solid experience in handling Data Manager related tasks
Project Manager, Compute & Business Operations Applicants are expected to have a solid experience in handling Project Manager related tasks
Senior Software Engineer - Together Cloud Platform Applicants are expected to have a solid experience in handling Software Engineer related tasks
Research Scientist, Large-Scale Learning Applicants are expected to have a solid experience in handling Research Scientist related tasks
Machine Learning Engineer Applicants are expected to have a solid experience in handling Engineering related tasks
Lead DX Engineer - Documentation (SF / NYC) Applicants are expected to have a solid experience in handling Production related tasks
Senior Strategic Sourcing & Procurement Lead, Compute Applicants are expected to have a solid experience in handling Procurement related tasks
Rust Systems Engineer - Inference Applicants are expected to have a solid experience in handling System Engineer related tasks
Senior Director, Capital Markets & Corporate Development Applicants are expected to have a solid experience in handling Corporate Finance related tasks
Infrastructure Engineer, Data Platform Applicants are expected to have a solid experience in handling Engineering related tasks