Machine Learning Engineer / Senior Machine Learning Engineer job opportunity at C3 AI.



bot
C3 AI Machine Learning Engineer / Senior Machine Learning Engineer
Experience: 6-years
Pattern: full-time
apply Apply Now
Salary:
Status:

Engineering

Copy Link Report
degreeMaster's (M.Eng.)
loacation Redwood City, California, United States Of America
loacation Redwood City, ..........United States Of America

#C3 #AI #Data #Science team is dedicated to pushing the boundaries of what is possible with large-scale #AI. We are seeking a hands-on #Machine #Learning #Engineer to design, build, and operate a bespoke, next-generation research platform dedicated to training novel, large-scale foundation models far beyond conventional #LLM recipes. __ This is a critical systems role. You will create the orchestration, secure data pathways, and frictionless developer experience that empowers our researchers to move fast, experiment securely, and scale complex training jobs on heterogeneous GPU clusters. __ Responsibilities: We are looking for an expert who can solve infrastructure problems where off-the-shelf cloud tools are insufficient. Design and manage the core research compute cluster, including node layouts, queues/partitions, preemption/fair-share policies, and multi-tenant isolation. Implement secure access controls for all users and services across the cluster using Kubernetes and/or SLURM. Build robust branch-to-experiment CI/CD workflows, encompassing templated job creation, config promotion, and integrated version control. Implement an experiment and metrics tracking system (runs, configs, checkpoints, logs) with searchable lineage to enable frictionless cross-team collaboration and sharing. Design and integrate auto-checkpointing, artifact retention, and necessary rollout/rollback mechanisms. Stand up robust dataset registries, ensuring data lineage, versioning, and secure access. Implement sharding, streaming, and prefetch mechanisms to support efficient TB-scale data corpora access and long-term archival with reproducible rehydration. Profile NCCL/I/O hotspots, optimize training throughput (mixed precision/AMP, ZeRO/FSDP, kernel fusion, caching). Harden training pipelines for scale and resilience, including checkpoint recovery, and tolerance for spot/preemptible instances. Build opinionated templates, job specifications, and guardrails to ensure researchers can focus on modifying custom training code and recipes without fighting infrastructure bottlenecks. __ Qualifications: BS/MS in Computer Science/Electrical Engineering or equivalent deep, practical experience. 5+ years of work experience (8+ years for Senior Machine Learning Engineer) Proven track record building custom ML/HPC platforms for specialized research (e.g., novel model architectures, time-series, multimodal AI) where commercial cloud tools were insufficient. Deep expertise with #Kubernetes and/or SLURM on GPU clusters, including proficiency with containers, images, and multi-node scheduling. Strong software development skills in #Python and one of Go, C++, or Rust. Comfortable developing controllers/operators, high-performance services, and CLI tooling on #Linux. Practical, hands-on knowledge of distributed ML frameworks (PyTorch DDP/FSDP/ZeRO, DeepSpeed, or JAX/TPU) and performance profiling (NCCL, CUDA basics, I/O performance). Experience with object stores, Parquet format, dataset version control, streaming/sharding techniques, and efficient artifact management for checkpoints and logs. Practical experience with observability (Prometheus/Grafana/OpenTelemetry) and infra-as-code (Terraform/Helm/Ansible). __ Preferred Qualifications: Experience with high-speed networking and storage, including InfiniBand/RDMA, GPUDirect-RDMA, NVLink topology, and high-throughput file/object systems. Direct experience modifying or working with K8s device plugins, custom schedulers/quotas, or SLURM internals (fair-share/preemption). Expertise in implementing true reproducibility at scale: seeding, deterministic builds, environment capture, and building robust dataset & experiment lineage that guarantee re-runnability months later. Experience with advanced performance work such as kernel fusion, custom CUDA operations, and fine-tuning complex FSDP/ZeRO configurations. A pragmatic, product-focused approach to researcher ergonomics, demonstrated by platforms you have shipped that materially increased experiment throughput and velocity.
ai summary

Other Ai Matches

Site Reliability Engineer Applicants are expected to have a solid experience in handling Engineering related tasks
QA Automation Manager Applicants are expected to have a solid experience in handling Automation Manager related tasks
Senior Director, Strategic Solutions -DHS and Federal Law Enforcement Applicants are expected to have a solid experience in handling Sales related tasks
QA Engineer - Intern (Summer 2026) Applicants are expected to have a solid experience in handling Internship related tasks
Senior Director, Strategic Solutions – Intelligence Community Applicants are expected to have a solid experience in handling Sales related tasks
Vice President of Sales, North America (Northeast) Applicants are expected to have a solid experience in handling Sales related tasks
Software Engineer - Intern (Summer 2026) Applicants are expected to have a solid experience in handling Internship related tasks
Senior Technical Writer Applicants are expected to have a solid experience in handling Writer related tasks
AI Engagement Manager / Director - Federal Applicants are expected to have a solid experience in handling Engagement Manager related tasks
Software Engineer, Full-Stack Applicants are expected to have a solid experience in handling Engineering related tasks
Industry Solution Leader – Property Appraisal Applicants are expected to have a solid experience in handling Consultant related tasks
Forward Deployed Engineer Applicants are expected to have a solid experience in handling Engineering related tasks
Data Scientist (Federal) Applicants are expected to have a solid experience in handling Data science related tasks
Senior Solution Engineer Applicants are expected to have a solid experience in handling Engineering related tasks
Senior Solution Engineer - Federal Applicants are expected to have a solid experience in handling Engineering related tasks
Senior/Lead Site Reliability Engineer – Federal Applicants are expected to have a solid experience in handling Engineering related tasks
Product Manager, Asset Performance Applicants are expected to have a solid experience in handling Product management related tasks
Solution Engineer - Federal Applicants are expected to have a solid experience in handling Solution Engineer related tasks
Lead Software Engineer - Generative AI Applicants are expected to have a solid experience in handling Software Engineer related tasks
Manager, Software Engineering - Federal Applications Applicants are expected to have a solid experience in handling Engineering related tasks
Machine Learning Engineer / Senior Machine Learning Engineer Applicants are expected to have a solid experience in handling Engineering related tasks
Solution Engineer Applicants are expected to have a solid experience in handling Engineering related tasks
Solution Engineer Applicants are expected to have a solid experience in handling Engineering related tasks