Andromeda (andromeda.ai) Jobs

Customer Reliability Engineer

Andromeda (andromeda.ai)

Customer Reliability Engineer

Reposted 9 Days Ago

In-Office or Remote

Hiring Remotely in Canada

Senior level

In-Office or Remote

Hiring Remotely in Canada

Senior level

The Site Reliability Engineer will provision and manage Kubernetes clusters, build automation tools, debug customer issues, and improve infrastructure reliability.

The summary above was generated by AI

Customer Reliability EngineerLocation: Remote/SF-Hybrid · Full-Time

About Andromeda

Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers.

We began with a single managed cluster — but it filled almost instantly. Since then, we’ve been quietly building the systems, network, and orchestration layer that makes the world’s AI infrastructure more accessible.

Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our platform routes training and inference jobs across global supply, unlocking flexibility and efficiency in one of the fastest-growing markets on earth.

Our long-term vision is to build the liquidity layer for global AI compute. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering.

The Role

Our customers run large AI training and inference workloads on GPU clusters we source from providers worldwide. When a node goes dark or a job dies eight hours into a run, the Customer Reliability Engineer is who they hear from, and who gets it sorted.

The job has three parts. You triage incoming issues and debug them at the Linux and Kubernetes layer. You work provider-side to figure out whose fault something actually is and push external providers to fix it. And you build the monitoring and scripts that catch problems before a customer has to tell us.

You need to be comfortable in a Linux shell and know how Kubernetes works. You don't need GPU or HPC experience. Most people pick that up here.

What You’ll Do

Triage and fix customer issues

Own issues start to finish: reproduce, diagnose, fix or escalate, close the loop
Debug at the Linux layer: processes, networking, storage, kernel logs, resource contention, systemd, journald
Dig into Kubernetes problems like pods stuck pending or crash-looping, node conditions, scheduling failures, resource limits
Work GPU failures: driver and device-plugin issues, XID errors, thermal throttling, nodes that need cordoning or draining, jobs failing across multiple nodes
Escalate when you're past your depth, with the evidence already gathered

Handle incidents

Take part in a 24/7 on-call rotation
First response on alerts and customer-reported outages: assess impact, set severity, pull in the right people
Keep customers updated during incidents. Clear status, honest unknowns, no silence
Write up what happened, then turn it into a runbook, an alert, or a fix so it costs less next time

Push providers to resolution

Work out whether a fault is provider-side, ours, or the customer's before it gets handed anywhere
Open tickets with compute providers and chase them down rather than waiting
Track recurring provider failures and flag the patterns to the people making sourcing decisions

Build the tooling

Write Python or Bash to automate the checks you'd otherwise run by hand
Build and improve monitoring: cluster and node health checks, GPU telemetry, dashboards, alerts that fire on real problems
Keep runbooks and customer docs current as you go

What We’re Looking For

Real Linux troubleshooting ability from the command line. You can work a problem through logs, processes, networking, and disk without a script to follow
Working knowledge of Kubernetes: pods, nodes, deployments, services, scheduling, and how to investigate when one of those breaks
Can write a script in Python or Bash to automate something repetitive
Strong writing. You can explain a technical problem to a frustrated customer clearly and without condescension
Good judgment under pressure. You know what to check first, when to escalate, and how to keep people informed while you're still working it out
Willing to join a 24/7 on-call rotation

Strong Candidates May Have

Hands-on time with NVIDIA GPUs in production: drivers, CUDA, DCGM, the Kubernetes device plugin
Experience with high-performance networking (InfiniBand, RoCE) or NCCL
Experience with HPC or batch schedulers like Slurm
A previous customer-facing technical role: support engineering, TAM, solutions, professional services
Knowledge of Prometheus, Grafana, Datadog, or similar
IAC: Terraform, Ansible, or Helm
Genuine interest in AI infrastructure and how big training jobs behave

Why You’ll Love It Here

High-growth environment: Get in early at a company at the center of the AI infrastructure boom
Competitive compensation: + meaningful equity
Comprehensive benefits: for you and your dependents, including healthcare, dental, and vision coverage, 401(k), and unlimited PTO

Andromeda Cluster is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Similar Jobs

Tailscale

Reliability Engineer

20 Days Ago

Remote

Canada

Senior level

Software

Lead and resolve high-severity post-sales technical escalations through deep-dive investigation, reproduction, and engineering handoff. Own customer calls, collect logs/evidence, file bug reports, monitor patterns, and improve tooling and runbooks in partnership with Support and Engineering.

Top Skills: Ci/CdCloud IaasContainerized ApplicationsDnsFirewallsGoJSONKubernetesLinuxLoad BalancersmacOSNatPacket-Level AnalysisPprofPythonRoutingStunTcp/IpUdp Hole PunchingVpnWindowsWireguard

Cash App

Program Manager

42 Minutes Ago

Remote or Hybrid

Senior level

Blockchain • Fintech • Mobile • Payments • Software • Financial Services

Senior individual-contributor program manager owning issuer and processor partner relationships end-to-end. Drive partner onboarding, technical integration, BIN setup, compliance, SLAs, incident response, and launch readiness while coordinating Product, Engineering, Legal, Compliance, and Operations to deliver scalable card experiences.

Top Skills: Ai/LlmMastercardVisa

Cash App

Regulatory Examination Manager

42 Minutes Ago

Remote or Hybrid

Senior level

Blockchain • Fintech • Mobile • Payments • Software • Financial Services

Own and lead federal bank examinations for Block affiliates, act as primary regulator contact, manage exam cycles end-to-end, coordinate cross-functional responses, and maintain institutional regulatory knowledge and post-exam remediation.

What you need to know about the Montreal Tech Scene

With roots dating back to 1642, Montreal is often recognized for its French-inspired architecture and cobblestone streets lined with traditional shops and cafés. But what truly sets the city apart is how it blends its rich tradition with a modern edge, reflected in its evolving skyline and fast-growing tech industry. According to economic promotion agency Montréal International, the city ranks among the top in North America to invest in artificial intelligence, making it le spot idéal for job seekers who want the best of both worlds.

Key Facts About Montreal Tech

Number of Tech Workers: 255,000+ (2024, Tourisme Montréal)
Major Tech Employers: SAP, Google, Microsoft, Cisco
Key Industries: Artificial intelligence, machine learning, cybersecurity, cloud computing, web development
Funding Landscape: $1.47 billion in venture capital funding in 2024 (BetaKit)
Notable Investors: CIBC Innovation Banking, BDC Capital, Investissement Québec, Fonds de solidarité FTQ
Research Centers and Universities: McGill University, Université de Montréal, Concordia University, Mila Quebec, ÉTS Montréal