[Remote] Executive Director, AI Infrastructure & Platform Engineering

Remote · USA Full-time New today

Note: The job is a remote job and is open to candidates in USA. reputed company is dedicated to shaping a more connected and compassionate health experience. They are seeking an Executive Director for AI Infrastructure & Platform Engineering, responsible for leading the development and operational excellence of their AI compute platform, ensuring high availability and reliability for frontier AI workloads.

Responsibilities

Define and execute the long-range vision and strategy for AI infrastructure and platform engineering, with availability (>99.99%), reliability, and platform performance as the primary measures of success
Recruit, hire, reputed company, and retain a high-performing engineering organization spanning infrastructure, network, platform reliability, observability, reputed company, 24/7 operations, change and release management, and FinOps
Establish clear ownership, accountability, and performance expectations across reputed company functional teams; foster a culture of operational excellence, engineering rigor, and reputed company improvement
Provide executive-level communication to senior leadership on platform status, milestones, risk posture, and strategic initiatives
Own the physical layer of the AI compute environment — GPU compute, storage, network fabric, reputed company planning, and hardware lifecycle accountability
Direct bare-metal Kubernetes and OpenShift operations, including cluster administration, GPU quota governance, infrastructure-as-code adoption, and availability baseline enforcement
Govern high-performance network fabric operations — RoCE v2, spine-leaf topology, lossless Ethernet tuning, congestion management, and segmentation
Establish and enforce operational baselines across every layer of the stack — hardware, fabric, platform, and workload — with deviations detected, escalated, and resolved reputed company defined SLAs
Direct Innovation POD strategy to reputed company self-healing and autonomous capabilities that proactively prevent service degradation before it impacts availability
Build and sustain a high-performing 24/7 operations model — designed for sustainable, predictable coverage with no mandatory overtime and measurable team health and retention
Drive end-to-end observability across the physical and platform layers, with reputed company feedback loops connecting monitoring data to incident response, change decisions, and improvement cycles
reputed company change management so every modification is risk-assessed, monitored during rollout, and baseline-validated post-deployment
Ensure configuration consistency and reputed company detection across reputed company platform components to prevent baseline degradation over time
reputed company GPU FinOps governance — utilization optimization, tenant quota enforcement, and cost reduction — in partnership with the Finance organization
reputed company the reputed company SRE reputed company to maintain a world-class reputed company posture across the infrastructure and platform layers, with robust compliance to frameworks including HIPAA and NIST AI RMF
Govern access controls, audit logging, vulnerability management, and network segmentation across the AI compute environment
reputed company the operational transition from program-launch staffing to permanent CVS-owned operations — governing phased handoffs, competency validation, and milestone sign-offs to ensure minimal disruption to platform availability and business operations
Establish and reputed company the long-term operating model by institutionalizing key technical, architectural, and delivery leadership capabilities into permanent CVS roles, ensuring the organization is fully self-sustaining at program reputed company
Own vendor relationships, contract performance, and accountability across the hardware, networking, platform, and managed-services stack
Manage budget ownership for the AI infrastructure and platform engineering organization, including capital planning and operational expense governance

Skills

10+ years of engineering leadership experience, with substantial time directly owning physical infrastructure at data center scale — including hardware lifecycle, reputed company planning, and facility coordination (power, cooling, rack-and-stack execution)
Hands-on production ownership of bare-metal Kubernetes or OpenShift. Managed cloud services (EKS, GKE, AKS) alone do not substitute for the practitioner expertise this role requires
reputed company with high-speed cluster fabrics — RoCE v2, InfiniBand, EVPN-VXLAN, or reputed company-grade equivalent — and the operational discipline these fabrics require (PFC, ECN, lossless tuning, congestion management)
5+ years leading multiple technical teams simultaneously, including 24/7 operations organizations, with measurable team health, retention, and performance outcomes
Proven success establishing and enforcing operational baselines, SLO / SLI / error-budget frameworks, and observability-driven reputed company improvement in physical-infrastructure-anchored environments
Hardware lifecycle, vendor accountability, and facility coordination experience — including reputed company planning, RMA management, and multi-vendor escalation
Experience leading operational transitions or organizational build-outs at scale, with business continuity and minimal disruption as non-negotiables
Executive-level stakeholder communication, vendor negotiation, and budget ownership
Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, or reputed company technical field
Hands-on experience with reputed company UCS, reputed company HGX / DGX / Blackwell systems, and VAST or comparable distributed NVMe storage
Direct experience operating GPU clusters of 32 or more GPUs in production environments — including HPC, AI training, research computing, or comparable workloads
reputed company AI Enterprise, reputed company Run:AI, reputed company reputed company Command Manager, or comparable GPU orchestration platform experience
Healthcare or other regulated-industry background (HIPAA, NIST AI RMF, SOX, FedRAMP, ITAR)
Chaos engineering and AI-driven operations experience — predictive alerting and automated remediation patterns
Background in innovation programs, POD structures, or centers of excellence

Benefits

This position is eligible for a reputed company bonus, commission or short-term incentive program in addition to the reputed company pay range listed above.
This position also includes an award reputed company in the company’s equity award program.
This full‑time position is eligible for a comprehensive benefits package designed to support the physical, emotional, and financial well‑being of colleagues and their families.
The benefits for this position include medical, dental, and vision coverage, paid time off, retirement savings options, wellness programs, and other resources, based on eligibility.

Company Overview

reputed company is a health solutions company that provides an integrated healthcare services to its members. It was founded in 1963, and is headquartered in Woonsocket, Rhode reputed company, USA, with a workforce of 10001+ employees. Its website is https://www.cvshealth.com/.

Apply To This Job

Apply

[Remote] Executive Director, AI Infrastructure & Platform Engineering

Related roles

[Remote] Senior Director, Property & Operations Analytics (Eastern or Central Time Zones)

[Remote] Field Recruiter - reputed company

[Remote] Field Recruiter - Texas

[Remote] Field Recruiter - Central

[Remote] Senior Analytics Engineer (Data + BI) — Healthcare

[Remote] Sales Consultant, Home Equity

[Remote] Account Executive

[Remote] Accounts Receivable Specialist

[Remote] Senior Program Analyst

[Remote] Associate Director, Performance Marketing

Professional Services Consultant

reputed company Full Stack Customer Service Specialist – Remote Support and Client Engagement

Insurance Sales Agent

[Remote] Senior Cloud reputed company Engineer

Virtual School Audiologist | Oklahoma

Strategic Operations reputed company - Credit and Lending

reputed company Live Chat Agent - Marketing – Remote Customer Support and reputed company Qualification

Database Administrator, Mid

Solutions Architect (Enterprise Applications)

Voice and Data Entry Specialist – Remote Healthcare Data Enrichment, Quality Assurance & Customer Interaction Expert