Research Engineer (LLM Training and Performance) job opportunity at JetBrains.



bot
JetBrains Research Engineer (LLM Training and Performance)
Experience: General
Pattern: full-time
apply Apply Now
Salary:
Status:

JCP Core Machine Learning

Copy Link Report
degreeHigh School (S.S.C.E)
loacation Amsterdam, ; Berlin, Germany; Limassol, Cyprus; London, United Kingdom; Munich, Germany; Paphos, Cyprus; Prague, Czech Republic; Warsaw, Poland; Yerevan, Armenia, Netherlands
loacation Amsterdam, ; B..........Netherlands

At JetBrains, code is our passion. Ever since we started back in 2000, we have been striving to make the strongest, most effective developer tools on earth. By automating routine checks and corrections, our tools speed up production, freeing developers to grow, discover, and create. We’re looking for a Research Engineer who will own the training stack and model architecture for our Mellum LLM family. Your job is easier said than done: make training faster, cheaper, and more stable at a large scale. You’ll profile, design, and implement changes to the training pipeline – from architecture to custom GPU kernels, as needed. As part of our team, you will: Be responsible for improving end-to-end performance for multi-node LLM pre-training and post-training pipelines. Profile hotspots (Nsight Systems/Compute, NVTX) and fix them using compute/comm overlap, kernel fusion, scheduling, etc. Design and evaluate architecture choices (depth/width, attention variants including GQA/MQA/MLA/Flash-style, RoPE scaling/NTK, and MoE routing and load-balancing). Implement custom ops (Triton and/or CUDA C++), integrate via PyTorch extensions, and upstream when possible. Push memory/perf levers: FSDP/ZeRO, activation checkpointing, FP8/TE, tensor/pipeline/sequence/expert parallelism, NCCL tuning. Harden large runs by building elastic and fault-tolerant training setups, ensuring robust checkpointing, strengthening reproducibility, and improving resilience to preemption. Keep the data path fast using streaming and sharded data loaders and tokenizer pipelines, as well as improve overall throughput and cache efficiency. Define the right metrics, build dashboards, and deliver steady improvements. Run both pre-training and post-training (including SFT, RLHF, and GRPO-style methods) efficiently across sizable clusters. We’ll be happy to bring you on board if you have: Strong PyTorch and PyTorch Distributed experience, having run multi-node jobs with tens to hundreds of GPUs. Hands-on experience with Megatron-LM/Megatron-Core/NeMo, DeepSpeed, or serious FSDP/ZeRO expertise. Real profiling expertise (Nsight Systems/Compute, nvprof) and experience with NVTX-instrumented workflows. GPU programming skills with Triton and/or CUDA, and the ability to write, test, and debug kernels. A solid understanding of NCCL collectives, as well as topology and fabric effects (IB/RoCE), and how they show up in traces. Our ideal candidate would have experience with: FlashAttention-2 and 3, CUTLASS and CuTe, TransformerEngine and FP8, Inductor, AOTAutograd, and torch.compile. MoE at scale (expert parallel, router losses, capacity management) and long-context tricks (ALiBi/YaRN/NTK scaling). Kubernetes or SLURM at scale, placement and affinity tuning, as well as AWS, GCP, and Azure GPU fleets. Web-scale data plumbing (streaming datasets, Parquet and TFRecord, tokenizer perf), eval harnesses, and benchmarking. Safety and post-training methods, such as DPO, ORPO, GRPO, and reward models. Inference ecosystems such as vLLM and paged KV. #LI-KP1 We process the data provided in your job application in accordance with the Recruitment Privacy Policy.
ai summary

Other Ai Matches

Head of Corporate Security Applicants are expected to have a solid experience in handling Security related tasks
remote-jobserver Remote
ML Solutions Architect (Data Agents) Applicants are expected to have a solid experience in handling Data Context Engine related tasks
remote-jobserver Remote
Data Analyst (JetBrains AI) Applicants are expected to have a solid experience in handling JCP Core Analytics related tasks
Office Coordinator Applicants are expected to have a solid experience in handling Office Management related tasks
remote-jobserver Remote
Support Engineer (IDE Services) Applicants are expected to have a solid experience in handling IDE Services Customer Relations related tasks
General Ledger Manager Applicants are expected to have a solid experience in handling Accounting Prague related tasks
Country HR Lead Applicants are expected to have a solid experience in handling Human Resources Munich related tasks
Growth Manager - Educational Partnership (US) Applicants are expected to have a solid experience in handling JetBrains Academy - Marketing related tasks
Senior Software Developer (IntelliJ AI) Applicants are expected to have a solid experience in handling IntelliJ AI related tasks
Machine Learning Evaluation Engineer (Agentic Mobile App Generator) Applicants are expected to have a solid experience in handling Ignite related tasks
Project Lead (Amper) Applicants are expected to have a solid experience in handling Amper related tasks
remote-jobserver Remote
(Senior) Backend Developer (Java/Kotlin) - Business Application Development Applicants are expected to have a solid experience in handling Business Applications Product Engineering related tasks
remote-jobserver Remote
YouTrack Internal Automation Engineer Applicants are expected to have a solid experience in handling YouTrack - Internal Development related tasks
Software Developer (Platform/ Remote Development) Applicants are expected to have a solid experience in handling Remote Development Technology related tasks
SDET Engineer in Kotlin Performance QA team Applicants are expected to have a solid experience in handling Kotlin Performance QA related tasks
remote-jobserver Remote
Head of Prototyping (InnovationHub Tech Track) Applicants are expected to have a solid experience in handling Innovation Hub related tasks
Software Developer (Station/Toolbox App) Applicants are expected to have a solid experience in handling Toolbox App Infrastructure related tasks
HR Generalist Applicants are expected to have a solid experience in handling Human Resources Czech Republic related tasks
remote-jobserver Remote
Security Compliance Specialist Applicants are expected to have a solid experience in handling Security GRC (Governance, Risk, and Compliance) related tasks
IT Support specialist Applicants are expected to have a solid experience in handling IT Hardware Engineers related tasks
ML Researcher (JetBrains Research) Applicants are expected to have a solid experience in handling AI Agents & Planning related tasks
General Ledger Manager Applicants are expected to have a solid experience in handling Accounting Serbia related tasks
remote-jobserver Remote
Senior Technical Product Manager (Kotlin Ecosystem - AI Value Stream) Applicants are expected to have a solid experience in handling Kotlin Ecosystem related tasks