Senior MLOps Engineer - SRE | DevOps

Work from home Full-time role Hiring

THE ROLE We're looking for a Senior MLOps Engineer who can set the standard for how we build, ship, and operate ML and AI systems at scale. You sit at the intersection of ML infrastructure and SRE — you'll own the path from model and pipeline to reliable production service, and you'll bring DevOps rigor to systems that are historically under-engineered. This is not a ticket-processing role, and it's not a research role. You'll tackle hard problems — model serving reliability, inference cost and latency, reproducible pipelines, agentic workload operations — and have the scope to solve them properly. Seniors here identify problems before they're asked, and raise the ceiling on what the platform can do. WHAT YOU'LL WORK ON Build and operate model and inference serving infrastructure — managing latency, throughput, autoscaling, and reliability for real-time and batch inference across multiple tenants. Own the ML deployment lifecycle — model registry, versioning, promotion workflows, rollout strategies (canary, shadow, A/B), and safe rollback. Operate agentic and LLM workloads in production — managing inference providers and gateways, quota and throttling behavior (TPS/TUPS limits), guardrails, prompt/version management, and graceful degradation under load. Build reproducible, automated ML pipelines — training, evaluation, and deployment pipelines as code, with lineage and reproducibility built in. Extend infrastructure-as-code to ML systems — Terraform patterns and multi-account design that bring ML infrastructure under the same standards as the rest of the platform. Operate GitOps for ML workloads — ArgoCD configuration and promotion workflows across environments and tenants. Run ML and AI workloads on multi-tenant Kubernetes (AWS EKS) — managing GPU/accelerator scheduling, workload placement, tenant isolation, and cost-aware capacity. Own ML reliability and observability — SLOs for inference services, model and data drift detection, performance regression monitoring, alert quality, on-call ergonomics, and runbook culture. Drive ML cost efficiency — right-sizing accelerators, managing reserved/spot capacity, and attributing inference cost across tenants and workloads. Use agentic coding tools for infrastructure and pipeline work — scaffolding environments, generating and reviewing IaC and pipeline code, and accelerating automation. MUST HAVE 5+ years in platform engineering, SRE, MLOps, or infrastructure — with meaningful time operating production systems at scale. Hands-on experience deploying and operating ML or AI workloads in production — serving, inference, or training infrastructure that real users depended on. Strong SRE/DevOps foundation — you've owned reliability for production services, defined and measured SLOs, run post-mortems, and driven measurable improvements. Deep IaC expertise — you actively manage complex Terraform state and multi-account configurations in production. Strong GitOps background — you understand declarative infrastructure management at depth and have opinions on how to do it well. Deep Kubernetes knowledge — you've operated clusters in production, dealt with real failure modes, and understand the system at the control plane level. Strong AWS background — networking, compute, IAM, storage, multi-account design. Hands-on experience building and operating CI/CD pipelines — GitHub Actions, CircleCI, GitLab CI, or equivalent — and an understanding of how ML pipelines differ from standard application CI/CD. Automation-first thinking at a senior level — you implement systems that eliminate entire categories of manual work. Active user of agentic coding tools — you know how to direct them effectively, review their output critically, and use them to multiply your output. Strong communicator — you can articulate operational decisions, model performance trade-offs, and incident summaries clearly to engineers and leadership alike. NICE TO HAVE Experience with GPU/accelerator scheduling and node lifecycle management in production (e.g., Karpenter). Experience operating LLM inference at scale — managing provider quotas/throttling (TPS/TUPS), gateways, caching, and guardrails (e.g., AWS Bedrock or equivalent). Experience with ML pipeline and orchestration tooling — Argo Workflows, Kubeflow, Airflow, SageMaker Pipelines, or equivalent. Experience with model registries, feature stores, and experiment tracking (e.g., MLflow, Feast, or equivalent). Familiarity with model and data drift monitoring and ML-specific observability. Background in FinOps — inference cost attribution, reserved capacity planning, Familiarity with data infrastructure — object storage, CDC pipelines, or lakehouse patterns. Experience with multi-tenant infrastructure — isolation patterns, noisy neighbor mitigation, and tenant lifecycle management. Prior experience scaling ML or platform infrastructure at a startup moving toward enterprise-grade requirements. WHAT YOU WON'T FIND HERE A platform team that maintains the status quo. We're actively building — new scale requirements, new architectural domains, and an ML/AI footprint that's growing fast. Senior engineers here shape how the platform evolves, and the tools available to do it are better than they've ever been. Type: Full-Time, remote Work hours aligned with EST or PST

Apply Now

Senior MLOps Engineer - SRE | DevOps

More open positions

Workday Reward SME

Working Student Influencer Marketing (f/m/d)

Psychiatric Mental Health Nurse Practitioner

Regional Sales Manager - Video Projectors

Manager, Solution Consulting - Cytora

Part‑Time Remote Customer Service Representative – Flexible Home‑Based Role at careerzynith

Urgently Need PRN RN Home Health in USA

Data Platform Architect

CSR - Work From Home - Great Benefits

Experienced Client Services Officer – Trust Administration Support

Remote Data Entry Specialist – Entry‑Level Healthcare Data Management Role at careerzynith – Flexible Work‑From‑Home Opportunity

Scrum master / agile coach, digital business solutions strategy & operations (remote)

Bilingual Spanish Customer Service Representative – Remote Temp-to-Hire Position Supporting Dental Benefits at careerzynith

[PART_TIME Remote] Looking for Math Tutor in New Braunfels, TX

Account Executive – Electronic Security Systems

Product Designer II, Pinner Actionability

Environmental Program Manager (HTRW)

Directors on Foundation Board of Directors (multiple)

Extrusion Blow Mold Process Engineer

Director of State Policy, Southeast

Managing Director of Career and Technical Education job at IDEA Public Schools in Houston, TX