Yotta Labs Jobs

GPU Cloud Platform Engineer

Yotta Labs

GPU Cloud Platform Engineer

Reposted 7 Days Ago

In-Office or Remote

Hiring Remotely in Canada

Senior level

In-Office or Remote

Hiring Remotely in Canada

Senior level

The GPU Cloud Platform Engineer designs and operates multi-cluster GPU infrastructures for AI workloads, ensuring performance and efficiency across cloud environments.

The summary above was generated by AI

Location: Remote (Global)

Type: Full-time

Company: Yotta Labs

Apply: [email protected]

🧠 About Yotta Labs

Yotta Labs is building the next generation multi-silicon AI cloud and runtime platform to power the world’s most demanding AI workloads. We enable training and inference across NVIDIA GPUs, AMD GPUs, and AWS Trainium, helping AI companies achieve the best performance and economics across heterogeneous hardware. Our mission is to provide high-performance AI computing and Model API services, enabling AI companies, research labs, and enterprises to train, deploy and integrate cutting-edge models at scale.

🛠️ Role Overview

We are seeking a GPU Cloud Platform Engineer to join our core infrastructure team and help build the next-generation AI compute cloud. In this role, you will design, deploy, and operate large-scale, multi-cluster GPU infrastructure across data centers and cloud environments. You will be responsible for ensuring high availability, performance, and efficiency of containerized AI workloads—ranging from LLMs to generative models—deployed in Kubernetes-based GPU clusters. If you're passionate about high-performance systems, distributed orchestration, and scaling real-world AI infrastructure, this role offers a unique opportunity to shape the backbone of our AI cloud platform.

🎯 Responsibilities

Build and operate large-scale, high-performance GPU clusters; ensure stable operation of compute, network, and storage systems; monitor and troubleshoot online issues.
Conduct performance testing and evaluation of multi-node GPU clusters using standard benchmarking tools to identify and resolve performance bottlenecks.
Deploy and orchestrate large models (e.g., LLMs, video generation models) across multi-cluster environments using Kubernetes; implement elastic scaling and cross-cluster load balancing to ensure efficient service response under high concurrency for global users.
Participate in the design, development, and iteration of GPU cluster scheduling and optimization systems. Define and lead Kubernetes multi-cluster configuration standards; Optimize scheduling strategies (e.g., node affinity, taints/tolerations) to improve GPU resource utilization.
Build a unified multi-cluster management and monitoring system to support cross-region resource monitoring, traffic scheduling, and fault failover. Collect key metrics such as GPU memory usage, QPS, and response latency in real time; configure alert mechanisms.
Coordinate with IDC providers for planning and deploying large-scale GPU clusters, networks, and storage infrastructure to support internal cloud platforms and external customer needs.

✅ Qualifications

Bachelor's degree or higher in Computer Science, Software Engineering, Electronic Engineering, or related fields; 3+ years of experience in system engineering or DevOps.
5+ years of experience in cloud-native development or AI engineering, with at least 2 years of hands-on experience in Kubernetes multi-cluster management and orchestration.
Familiarity with the Kubernetes ecosystem; hands-on experience with tools such as kubectl, Helm, and expertise in multi-cluster deployment, upgrade, scaling, and disaster recovery.
Proficient in Docker and containerization technologies; knowledge of image management and cross-cluster distribution.
Experience with monitoring tools such as Prometheus and Grafana; Has practical experience in GPU fault monitoring and alerting.
Hands-on experience with cloud platforms such as AWS, GCP, or Azure; understanding of cloud-native multi-cluster architecture.
Experience with cluster management tools such as Ray, Slurm, KubeSphere, Rancher, Karmada is a plus.
Familiarity with distributed file systems such as NFS, JuiceFS, CephFS, or Lustre; ability to diagnose and resolve performance bottlenecks.
Understanding of high-performance communication protocols such as IB, RoCE, NVLink, and PCIe.
Strong communication skills, self-motivation, and team collaboration

🌟 Preferred Experience

Experience in developing and operating MaaS platforms or large-scale model inference clusters. Proven track record of leading multi-cluster system development or performance optimization projects.
Proficiency in CUDA programming and the NCCL communication library; understanding of high-performance GPUs like H100.
Ability to develop standardized inference APIs (RESTful/gRPC) and automation tools using Golang or Python.
Hands-on experience with optimization techniques such as model quantization, static compilation, and multi-GPU parallelism; capable of profiling inference processes in multi-cluster setups and identifying bottlenecks like memory fragmentation and low compute efficiency.
Active engagement with open-source communities such as Hugging Face and GitHub; deep understanding of the design principles of inference frameworks like Triton, vLLM, and SGLang; ability to perform secondary development and optimization based on open-source projects and quickly translate cutting-edge techniques into production-ready multi-cluster solutions.

🌐 Why Join Yotta Labs?

Be part of a visionary team aiming to redefine AI infrastructure.
Work on cutting-edge technologies that bridge AI and decentralized computing.
Collaborate with experts from leading institutions and tech companies.
Enjoy a flexible, remote work environment that values innovation and autonomy.

📩 How to Apply

Interested candidates should apply directly or send their resume and a brief cover letter to [email protected]. Please include links to any relevant projects or contributions.

Similar Jobs

Rubrik

Join Our Engineering Talent Community

3 Hours Ago

Remote

Canada

Entry level

Artificial Intelligence • Big Data • Cloud • Information Technology • Software • Cybersecurity • Data Privacy

Rubrik invites engineers to join their talent community to be part of a team dedicated to data security, innovation, and inclusion. Candidates should be motivated to tackle challenges and contribute to product development.

Rubrik

Join Our Talent Community

3 Hours Ago

Remote

Canada

Entry level

Artificial Intelligence • Big Data • Cloud • Information Technology • Software • Cybersecurity • Data Privacy

Rubrik invites individuals to join their talent community to stay connected and explore employment opportunities, fostering inclusion and equal opportunity.

Dropbox

Staff Data Engineer

3 Hours Ago

Remote

Canada

Expert/Leader

Artificial Intelligence • Cloud • Consumer Web • Productivity • Software • App development • Data Privacy

Lead design and implementation of shared, reusable data models and a certified metrics layer. Standardize pipeline patterns, CI/CD, and governance; modernize orchestration and observability; partner with Data Science, Infrastructure, and Product to deliver reliable analytics pipelines and enable AI-native data development.

Top Skills: AirflowAtlanDatabricksDatabricks Metric ViewsDbtDbt MetricflowDelta LakeGreat ExpectationsMonte CarloPythonSpark SqlSQLUnity Catalog

What you need to know about the Montreal Tech Scene

With roots dating back to 1642, Montreal is often recognized for its French-inspired architecture and cobblestone streets lined with traditional shops and cafés. But what truly sets the city apart is how it blends its rich tradition with a modern edge, reflected in its evolving skyline and fast-growing tech industry. According to economic promotion agency Montréal International, the city ranks among the top in North America to invest in artificial intelligence, making it le spot idéal for job seekers who want the best of both worlds.

Key Facts About Montreal Tech

Number of Tech Workers: 255,000+ (2024, Tourisme Montréal)
Major Tech Employers: SAP, Google, Microsoft, Cisco
Key Industries: Artificial intelligence, machine learning, cybersecurity, cloud computing, web development
Funding Landscape: $1.47 billion in venture capital funding in 2024 (BetaKit)
Notable Investors: CIBC Innovation Banking, BDC Capital, Investissement Québec, Fonds de solidarité FTQ
Research Centers and Universities: McGill University, Université de Montréal, Concordia University, Mila Quebec, ÉTS Montréal