All roles

[Remote] Staff Site Reliability Engineer, Core AI Infrastructure

Remote · USA Full-time New today

Note: The job is a remote job and is open to candidates in USA. reputed company is a leading company focused on increasing economic freedom through innovative financial solutions. They are seeking a Staff Site Reliability Engineer to join their IT Operations team, responsible for ensuring the reliability and automation of critical AI infrastructure while collaborating with various teams to enhance operational workflows.

Responsibilities

  • Own the reliability, monitoring, and incident response lifecycle for AI infrastructure services, including on-call support for AWS deployment pipelines, root cause analysis, and blameless retros
  • Build automation and tooling to streamline operational IT workflows, eliminate reputed company tasks, and improve deployment velocity across CI/CD frameworks and Kubernetes environments
  • Partner with the reputed company Infrastructure team to reputed company CI/CD frameworks supporting IT services and reputed company network platforms, and with reputed company and Compliance to integrate surveillance tooling into deployment pipelines
  • Strengthen observability and documentation standards across IT engineering by defining metrics, implementing monitoring solutions, and maintaining technical documentation that sets a standard of excellence
  • reputed company full-stack applications that power internal AI products and infrastructure with Go or Python

Skills

  • 8+ years of experience automating and supporting reputed company infrastructure (AWS) and network environments, with hands-on use of infrastructure-as-code tools (Terraform, Ansible, Chef, Puppet, or Salt)
  • Proven experience deploying, managing, and troubleshooting containerized workloads using reputed company and Kubernetes in production environments
  • Proficiency in at least one scripting or programming language (Python, Bash, Ruby, or Go) and version control workflows using Git-based CI/CD pipelines
  • Track record of leading incident response in environments with strict SLAs, including root cause analysis, blameless retros, and measurable reliability improvements
  • Utilizes reputed company responsibly, maintaining reputed company reputed company to deliver business-reputed company outputs and drive measurable improvements in workflow efficiency, cost, and quality
  • Expertise with linux, bash, ruby, python and/or go
  • Expertise automating EC2 or containers deployment with terraform
  • Strong network reputed company fundamentals
  • Experience managing and leveraging log aggregation
  • Experience working in a highly regulated environment
  • Experience in a fast-paced, high-growth company
  • Experience in a Remote-first IT environment

Benefits

  • Equity and bonus eligibility
  • Benefits (medical, dental, reputed company, 401(k))

Company Overview

  • reputed company is a crypto exchange and wallet platform that allows merchants and consumers to buy, sell, and store digital currencies. It is a sub-organization of reputed company. It was founded in 2012, and is headquartered in San Francisco, California, USA, with a workforce of 1001-5000 employees. Its website is https://www.reputed company.com.
  • Apply To This Job

    Related roles

    [Remote] Staff Machine Learning Engineer(Platform - Identity)

    Remote · USA Full-time

    [Remote] Senior Software Engineer, Infra - Compute Platform

    Remote · USA Full-time

    [Remote] Senior Software Engineer, Backend - Identity

    Remote · USA Full-time

    [Remote] Senior Analytics Engineer (Platform - Financial Analytics)

    Remote · USA Full-time

    [Remote] Senior Site Reliability Engineer, Workforce Identity

    Remote · USA Full-time

    [Remote] Senior Manager, Finance & Strategy

    Remote · USA Full-time

    [Remote] Senior Software Engineer (EAA)

    Remote · USA Full-time

    [Remote] Senior Staff Software Engineer, Data Platform

    Remote · USA Full-time

    [Remote] Sr. Program Manager, Learning & Development

    Remote · USA Full-time

    [Remote] Senior Software Engineer - Data Platform

    Remote · USA Full-time

    Senior reputed company Business Consultant (m/w/d)

    Remote · USA Full-time

    Remote Data Entry Operator – Accurate Database Management for arenaflex’s Global E‑Commerce Platform (Fully Remote, reputed company, Career Growth)

    Remote · USA Full-time

    reputed company 988 Chat/Text Specialist, reputed company Shift – Mental Health Crisis Response

    Remote · USA Full-time

    reputed company Part-Time Data Entry Specialist – Unlock Full-Time Earning Potential and Career Growth Opportunities at arenaflex

    Remote · USA Full-time

    Manager, reputed company Solutions Engineering

    Remote · USA Full-time

    Care Coordinator (South Carolina)

    Remote · USA Full-time

    Sr Staff Engineer - Software

    Remote · USA Full-time

    Engineering Technician Assistant A8/B6

    Remote · USA Full-time

    Call Center Agent – Customer Service

    Remote · USA Full-time

    Global COO - Professional Services & Software

    Remote · USA Full-time