Carbon3.ai

DevOps Engineer

Reposted 20 Days Ago

Remote

Hiring Remotely in Office, Machaze, Manica

Mid level

Remote

Hiring Remotely in Office, Machaze, Manica

Mid level

Design, operate, and improve Kubernetes-based AI infrastructure, manage GPU environments, ensure reliability, implement automation, and enhance customer experience.

The summary above was generated by AI

Era4 develops, owns and operates AI infrastructure across the UK, powered by renewable energy. Converting legacy industrial and energy sites into modern data-centre facilities, Era4 is combining brownfield regeneration opportunities with cleaner, efficient, scalable compute capacity for healthcare, research, finance, enterprise, and public-sector organisations

Role Summary:

We are seeking a DevOps Engineer to design, operate, and continuously improve our Kubernetes-based AI infrastructure. This role focuses on cloud-native platform engineering, GPU-accelerated workloads, reliability, automation, and customer enablement.

You will play a key role in delivering a production-grade AI platform that enables ML engineers, data scientists, and enterprise customers to build and run AI workloads at scale.

You will be responsible for the reliability, scalability, and performance of our Kubernetes-based GPU platforms. You will ensure our AI platform operates securely and efficiently while delivering an exceptional customer experience. This is a hands-on platform engineering position focused on systems reliability, automation, and continuous improvement.

Key Responsibilities:

Kubernetes Platform Operations:

Operate and evolve a production Kubernetes environment supporting GPU-accelerated AI workloads.
Manage cluster lifecycle (deployment, upgrades, scaling, resilience, multi-node operations).
Implement high availability, failover, and maintenance strategies to minimize disruption.
Enable aaS capabilities and segmentation for multi-tenant workloads.
Infrastructure as code tooling and lifecycle.
Network Overlays, Storage: Block, File and Object.
Experience with Ansible, YAML, Terraform, Python, Jenkins and GitOps.

GPU & AI Infrastructure Engineering:

Manage NVIDIA GPU infrastructure within Kubernetes (device plugins, drivers, CUDA compatibility).
Implement GPU partitioning and workload isolation strategies (e.g., MIG, quotas, namespaces).
Monitor and optimize GPU utilization, workload efficiency, and cluster capacity.
Support AI/ML training and inference workloads with performance tuning and best practices.

Reliability, Monitoring & Automation:

Design and maintain observability frameworks (metrics, logs, tracing).
Implement proactive monitoring, alerting, and capacity planning.
Lead incident response for platform-level events and drive root cause analysis.
Automate operational workflows and infrastructure provisioning (IaC, configuration management).
Contribute to platform reliability engineering practices (SLOs, SLAs, error budgets).

Security & Governance:

Implement RBAC, network policies, and security hardening.
Ensure secure multi-tenant workload isolation.
Maintain compliance, data protection, and access governance standards.

Customer & Platform Enablement:

Support customer lifecycle of onboarding, provisioning and operations.
Provide guidance on workload configuration, scaling strategies, and best practices.
Collaborate with engineering and vendor teams to resolve complex platform issues.
Produce high-quality technical documentation and operational playbooks.

Required Experience & Skills:

Strong hands-on experience operating production Kubernetes clusters.
Experience with GPU-enabled Kubernetes environments.
Solid Linux system administration, networking, storage and security skills.
Experience with Infrastructure as Code and automation.
Strong understanding of distributed systems, APIs, and cloud-native architectures.
Experience implementing monitoring and observability solutions (e.g., Prometheus, Grafana.
Proven incident management and root cause analysis experience.
Strong communication skills and ability to work cross-functionally.

Desirable Experience:

Experience operating AI/HPC infrastructure.
Deep understanding of Kubernetes scheduling, networking, and storage.
Experience with high-performance datacentre networking and tuning.
Background in DevOps or Site Reliability Engineering (SRE).

Why Join Era4:

You’ll be joining a mission-driven start-up building critical national infrastructure, where operational excellence directly enables growth. This role offers high visibility with leadership, real autonomy, and the chance to shape how a next-generation company operates at scale.

Diversity & Inclusion:

Era4 is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

Top Skills

Ansible

Gitops

Gpu

Grafana

Jenkins

Kubernetes

Prometheus

Python

Terraform

Yaml

Similar Jobs

Lineate

Site Reliability Engineer

24 Days Ago

Remote

Mid level

Software

The Site Reliability Engineer will design, build, and maintain scalable infrastructure, manage containerized environments, automate provisioning, develop CI/CD pipelines, and enhance monitoring systems while collaborating with engineering teams.

Top Skills: AnsibleDockerGithub ActionsJavaJenkinsPython

Sensorfact

Devops Engineer

8 Days Ago

Remote or Hybrid

Mid level

Information Technology • Software • Manufacturing

The DevOps Engineer will support development through CI/CD automation, manage AWS infrastructure, and collaborate with developers and product managers to enhance solutions using modern cloud technologies.

Top Skills: ArgocdAWSClickhouseGitlab CiGrafanaKubernetesLokiMqttNode.jsPrometheusRds PostgresTempoTerraformTerragrunt

VicRoads

Devops Engineer

11 Days Ago

Remote

Senior level

Transportation • Travel

The DevOps Engineer will maintain cloud infrastructure, create CI/CD pipelines, support operations, and collaborate with cross-functional teams to ensure effective project delivery.

Top Skills: AnsibleAWSAzure DevopsBambooBashCloudFormationDatadogJenkinsNew RelicPowershellPythonSplunkTerraform

What you need to know about the Manchester Tech Scene

Home to a £5 billion digital ecosystem, including MediaCity, which consists of major players like the BBC, ITV and Ericsson, Manchester is one of the U.K.'s top digital tech hubs, at the forefront of advancements in film, television and emerging sectors like as e-sports, while also fostering a community of professionals dedicated to pushing creative and technological boundaries.