Tecsys Jobs

Site Reliability Engineer

Tecsys

Site Reliability Engineer

Reposted 14 Days Ago

In-Office or Remote

Hiring Remotely in Montréal, QC, CAN

Senior level

In-Office or Remote

Hiring Remotely in Montréal, QC, CAN

Senior level

The Site Reliability Engineer will maintain and optimize the reliability of cloud infrastructure, focusing on automation, observability, and incident management in SaaS environments.

The summary above was generated by AI

Having recognized the advantages of remote work, including employee morale, productivity, reduced commuting on employee wellbeing and the environment, we are proud to be a digital-first company. The technologies and programs in which we invested have provided a fantastic foundation to this end. Our digital-first work environment, together with our conveniently located offices and collaborative workspaces, provide our team with the freedom and flexibility to work in the way that makes our employees most productive.

About us

Tecsys is a fast-growing innovator offering supply chain solutions to industry leading healthcare systems, hospitals, and pharmacy businesses to distributors, retailers, and 3PLs. We work with industry leaders to transform their supply chains through technology. If you thrive on tackling interesting challenges with continuous learning opportunities, then Tescys could be a good fit for you!

About the Role

We are looking for a Site Reliability Engineer to join our Network and Security Operations Center (NOC), a team at the heart of platform reliability for mission-critical SaaS environments. You will help maintain, optimize, and ensure the reliability and performance of the systems that power our cloud infrastructure across AWS and Kubernetes, with a strong focus on automation, observability, and continuous improvement. This role blends reliability engineering with incident command, giving you real ownership over uptime, performance, and innovation. You will be part of a highly skilled team that values creative problem-solving, operational excellence, and continuous improvement through automation and resilience engineering.

Your responsibilities

Collaborate with other Engineering teams to support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
Innovate relentlessly: Identify pain points, propose creative solutions, and drive initiatives that simplify, scale, and strengthen the platform.
Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
Own observability: Enhance and expand monitoring and alerting using Datadog; define SLOs/SLIs and create actionable dashboards that drive reliability outcomes.
Drive automation: Develop and improve internal tooling, IaC frameworks, and pipelines (Terraform, GitLab CI/CD) to reduce manual intervention and enable self-healing systems.
Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity.
Act as an agent orchestrator using Amazon Kiro: run multiple activities in parallel by leveraging AI agents to accelerate execution, while personally validating results and completing selected tasks manually when needed.
Be on-call.
Practice sustainable incident response and blameless postmortems. Lead post-incident reviews (RCAs) and identify long-term fixes that improve stability, reliability, and developer experience.
Implement monitoring, Logging, alerting, and SLA Reporting.
Create and maintain technical documentation.
Implement, maintain and mature SRE best practices.
Lead incidents: Act as Incident Commander for Incidents; coordinate cross-team response, manage communications, and ensure rapid service restoration.
Provide support for our planning and deployment teams to enable stability, predictability, and scale in our continued growth.
Collaborate with members of the Platform Engineering team to implement and support far-reaching strategic efforts, provide constructive feedback, and foster a collaborative environment.
Work cross-functionally with internal teams and vendors to manage our growth around the globe, with a strong focus on maintaining the high level of performance, availability, and reliability for our users.

RequirementsTools used

AWS (multi-account, VPC, EC2, EKS)

Kubernetes

Datadog

Terraform

GitLab CI/CD (Jenkins acceptable)

Amazon Kiro (licenses provided) - expected to be used proactively and heavily in day-to-day engineering tasks, with human validation of outputs.

Python, Bash, Java or equivalent for automation and diagnostics.

Qualifications

5+ years in Site Reliability, Cloud, or DevOps Engineering, ideally in SaaS or large-scale production environments.

Experience designing and deploying large scale systems, multi-vendor platforms and globally distributed infrastructure.

Proven experience managing cloud infrastructure in AWS (multi-account, VPC, EC2, EKS) and Kubernetes at scale.

Strong hands-on experience with IaC and automation (Terraform, Ansible, or similar).

Familiarity with CI/CD pipelines and release automation (GitLab preferred, Jenkins acceptable).

Deep understanding of monitoring and observability using Datadog (or equivalent), including metric design, log pipelines, alerting, and dashboards.

Experience with incident management, on-call participation, escalation, and structured postmortems.

Scripting skills in Python, Bash, Java or equivalent for automation and diagnostics.

Curiosity, ownership, and a bias for action; you see a problem, you solve it, and you share the lessons learned.

Experience with Fedramp (The Federal Risk and Authorization Management Program) compliance is a strong asset.

Basic knowledge of Java- or .Net-based development required.

Strong English communication skills, both written and spoken, are essential for effective correspondence with customers, business partners and colleagues beyond the province of Quebec.

Additional requirements:

Escalation on-call rotation
Occasional travel (quarterly offsites, conferences – less than 10%)

We understand that experience comes in many forms and that careers are not always linear. If you don't meet every requirement in this posting, we still encourage you to apply.

At Tecsys, we are committed to fostering a diverse and inclusive workplace where all employees feel valued, respected, and empowered. We believe that diversity drives innovation and strengthens our ability to deliver exceptional solutions. We welcome and encourage applicants from all backgrounds, experiences, and perspectives to join our team.

Tecsys is an equal opportunity employer. Accommodation is available for applicants selected for an interview.

NB: if you are applying to this position, you must be a Canadian Citizen or a Permanent Resident of Canada, OR, have a valid Canadian work permit.

***

A Note on Our Hiring Process: We do not use AI to automatically screen or reject candidates. However, we do use specific screening questions to prioritize the most relevant applications for human review.
At Tecsys, we welcome the thoughtful use of AI tools to help you prepare your application, for example, to improve clarity, organize your resume, or practice interview responses. However, we ask that all information you provide reflects your real experience, and that any assessments or written submissions represent your own work and thinking.

During interviews, we expect candidates to engage without the use of AI tools, scripts, or real-time assistance. Authentic, direct conversation helps us get to know how you think, collaborate, and communicate. AI can support your preparation, but it shouldn’t speak or act on your behalf. We genuinely want to meet you.

Laval, Quebec, Canada

Montréal, Quebec, Canada

Similar Jobs

Dropbox

Site Reliability Engineer

7 Days Ago

Remote

Canada

Senior level

Artificial Intelligence • Cloud • Consumer Web • Productivity • Software • App development • Data Privacy

The Staff Site Reliability Engineer will develop Dropbox's reliability strategy, enhance operational excellence, and lead cross-team initiatives. Responsibilities include improving monitoring and incident response systems, mentoring engineers, and aligning stakeholders on reliability priorities.

Top Skills: Ai-Enabled Software DeliveryDebugging ToolsDistributed SystemsIncident ResponseObservability

CMG (Capital Markets Gateway)

Site Reliability Engineer

11 Days Ago

Remote

Canada

Mid level

Fintech • Financial Services

Responsible for ensuring the reliability and performance of infrastructure and applications through monitoring, alerting, and incident management. The role includes developing observability solutions and optimizing system performance, while collaborating with cross-functional teams to address technical challenges.

Top Skills: .NetAzureBashDatadogDockerGrafanaKubernetesOpentelemetryPostgresPrometheusPythonTerraformTypescript

Andromeda (andromeda.ai)

Site Reliability Engineer

19 Days Ago

In-Office or Remote

Canada

Senior level

Artificial Intelligence • Cloud • Information Technology • Software

As a Staff SRE, you will ensure the reliability and performance of Andromeda's GPU infrastructure, lead incident responses, build observability systems, and mentor engineers, while collaborating closely with engineering and customers.

Top Skills: AnsibleCudaGoHelmKubernetesLinuxNcclNvidiaPythonRustSlurmTerraform

What you need to know about the Montreal Tech Scene

With roots dating back to 1642, Montreal is often recognized for its French-inspired architecture and cobblestone streets lined with traditional shops and cafés. But what truly sets the city apart is how it blends its rich tradition with a modern edge, reflected in its evolving skyline and fast-growing tech industry. According to economic promotion agency Montréal International, the city ranks among the top in North America to invest in artificial intelligence, making it le spot idéal for job seekers who want the best of both worlds.

Key Facts About Montreal Tech

Number of Tech Workers: 255,000+ (2024, Tourisme Montréal)
Major Tech Employers: SAP, Google, Microsoft, Cisco
Key Industries: Artificial intelligence, machine learning, cybersecurity, cloud computing, web development
Funding Landscape: $1.47 billion in venture capital funding in 2024 (BetaKit)
Notable Investors: CIBC Innovation Banking, BDC Capital, Investissement Québec, Fonds de solidarité FTQ
Research Centers and Universities: McGill University, Université de Montréal, Concordia University, Mila Quebec, ÉTS Montréal

Tecsys

Site Reliability Engineer

Tecsys Laval, Québec, CAN Office

Tecsys Montréal, Québec, CAN Office

Similar Jobs

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

What you need to know about the Montreal Tech Scene

Key Facts About Montreal Tech