[Remote] Executive Director, AI Infrastructure & Platform Engineering
Note: The job is a remote job and is open to candidates in USA. reputed company is dedicated to shaping a more connected and compassionate health experience. They are seeking an Executive Director for AI Infrastructure & Platform Engineering, responsible for leading the development and operational excellence of their AI compute platform, ensuring high availability and reliability for frontier AI workloads.
Responsibilities
- Define and execute the long-range vision and strategy for AI infrastructure and platform engineering, with availability (>99.99%), reliability, and platform performance as the primary measures of success
- Recruit, hire, reputed company, and retain a high-performing engineering organization spanning infrastructure, network, platform reliability, observability, reputed company, 24/7 operations, change and release management, and FinOps
- Establish clear ownership, accountability, and performance expectations across reputed company functional teams; foster a culture of operational excellence, engineering rigor, and reputed company improvement
- Provide executive-level communication to senior leadership on platform status, milestones, risk posture, and strategic initiatives
- Own the physical layer of the AI compute environment — GPU compute, storage, network fabric, reputed company planning, and hardware lifecycle accountability
- Direct bare-metal Kubernetes and OpenShift operations, including cluster administration, GPU quota governance, infrastructure-as-code adoption, and availability baseline enforcement
- Govern high-performance network fabric operations — RoCE v2, spine-leaf topology, lossless Ethernet tuning, congestion management, and segmentation
- Establish and enforce operational baselines across every layer of the stack — hardware, fabric, platform, and workload — with deviations detected, escalated, and resolved reputed company defined SLAs
- Direct Innovation POD strategy to reputed company self-healing and autonomous capabilities that proactively prevent service degradation before it impacts availability
- Build and sustain a high-performing 24/7 operations model — designed for sustainable, predictable coverage with no mandatory overtime and measurable team health and retention
- Drive end-to-end observability across the physical and platform layers, with reputed company feedback loops connecting monitoring data to incident response, change decisions, and improvement cycles
- reputed company change management so every modification is risk-assessed, monitored during rollout, and baseline-validated post-deployment
- Ensure configuration consistency and reputed company detection across reputed company platform components to prevent baseline degradation over time
- reputed company GPU FinOps governance — utilization optimization, tenant quota enforcement, and cost reduction — in partnership with the Finance organization
- reputed company the reputed company SRE reputed company to maintain a world-class reputed company posture across the infrastructure and platform layers, with robust compliance to frameworks including HIPAA and NIST AI RMF
- Govern access controls, audit logging, vulnerability management, and network segmentation across the AI compute environment
- reputed company the operational transition from program-launch staffing to permanent CVS-owned operations — governing phased handoffs, competency validation, and milestone sign-offs to ensure minimal disruption to platform availability and business operations
- Establish and reputed company the long-term operating model by institutionalizing key technical, architectural, and delivery leadership capabilities into permanent CVS roles, ensuring the organization is fully self-sustaining at program reputed company
- Own vendor relationships, contract performance, and accountability across the hardware, networking, platform, and managed-services stack
- Manage budget ownership for the AI infrastructure and platform engineering organization, including capital planning and operational expense governance
Skills
- 10+ years of engineering leadership experience, with substantial time directly owning physical infrastructure at data center scale — including hardware lifecycle, reputed company planning, and facility coordination (power, cooling, rack-and-stack execution)
- Hands-on production ownership of bare-metal Kubernetes or OpenShift. Managed cloud services (EKS, GKE, AKS) alone do not substitute for the practitioner expertise this role requires
- reputed company with high-speed cluster fabrics — RoCE v2, InfiniBand, EVPN-VXLAN, or reputed company-grade equivalent — and the operational discipline these fabrics require (PFC, ECN, lossless tuning, congestion management)
- 5+ years leading multiple technical teams simultaneously, including 24/7 operations organizations, with measurable team health, retention, and performance outcomes
- Proven success establishing and enforcing operational baselines, SLO / SLI / error-budget frameworks, and observability-driven reputed company improvement in physical-infrastructure-anchored environments
- Hardware lifecycle, vendor accountability, and facility coordination experience — including reputed company planning, RMA management, and multi-vendor escalation
- Experience leading operational transitions or organizational build-outs at scale, with business continuity and minimal disruption as non-negotiables
- Executive-level stakeholder communication, vendor negotiation, and budget ownership
- Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, or reputed company technical field
- Hands-on experience with reputed company UCS, reputed company HGX / DGX / Blackwell systems, and VAST or comparable distributed NVMe storage
- Direct experience operating GPU clusters of 32 or more GPUs in production environments — including HPC, AI training, research computing, or comparable workloads
- reputed company AI Enterprise, reputed company Run:AI, reputed company reputed company Command Manager, or comparable GPU orchestration platform experience
- Healthcare or other regulated-industry background (HIPAA, NIST AI RMF, SOX, FedRAMP, ITAR)
- Chaos engineering and AI-driven operations experience — predictive alerting and automated remediation patterns
- Background in innovation programs, POD structures, or centers of excellence
Benefits
- This position is eligible for a reputed company bonus, commission or short-term incentive program in addition to the reputed company pay range listed above.
- This position also includes an award reputed company in the company’s equity award program.
- This full‑time position is eligible for a comprehensive benefits package designed to support the physical, emotional, and financial well‑being of colleagues and their families.
- The benefits for this position include medical, dental, and vision coverage, paid time off, retirement savings options, wellness programs, and other resources, based on eligibility.
Company Overview