Carbon3.ai Logo

Carbon3.ai

DevOps Engineer

Posted Yesterday
Be an Early Applicant
Remote
Hiring Remotely in United Kingdom
Mid level
Remote
Hiring Remotely in United Kingdom
Mid level
Design, operate, and improve Kubernetes-based AI infrastructure, manage GPU environments, ensure reliability, implement automation, and enhance customer experience.
The summary above was generated by AI

Carbon3 develops, owns and operates AI infrastructure across the UK, powered by renewable energy. Converting legacy industrial and energy sites into modern data-centre facilities, Carbon3 is combining brownfield regeneration opportunities with cleaner, efficient, scalable compute capacity for healthcare, research, finance, enterprise, and public-sector organisations


Role Summary:

We are seeking a DevOps Engineer to design, operate, and continuously improve our Kubernetes-based AI infrastructure. This role focuses on cloud-native platform engineering, GPU-accelerated workloads, reliability, automation, and customer enablement.

You will play a key role in delivering a production-grade AI platform that enables ML engineers, data scientists, and enterprise customers to build and run AI workloads at scale.

 

You will be responsible for the reliability, scalability, and performance of our Kubernetes-based GPU platforms. You will ensure our AI platform operates securely and efficiently while delivering an exceptional customer experience. This is a hands-on platform engineering position focused on systems reliability, automation, and continuous improvement.


Key Responsibilities:

 

Kubernetes Platform Operations:

  • Operate and evolve a production Kubernetes environment supporting GPU-accelerated AI workloads.
  • Manage cluster lifecycle (deployment, upgrades, scaling, resilience, multi-node operations).
  • Implement high availability, failover, and maintenance strategies to minimize disruption.
  • Enable aaS capabilities and segmentation for multi-tenant workloads.
  • Infrastructure as code tooling and lifecycle.
  • Network Overlays, Storage: Block, File and Object.
  • Experience with Ansible, YAML, Terraform, Python, Jenkins and GitOps.

 

GPU & AI Infrastructure Engineering:

  • Manage NVIDIA GPU infrastructure within Kubernetes (device plugins, drivers, CUDA compatibility).
  • Implement GPU partitioning and workload isolation strategies (e.g., MIG, quotas, namespaces).
  • Monitor and optimize GPU utilization, workload efficiency, and cluster capacity.
  • Support AI/ML training and inference workloads with performance tuning and best practices.

 

Reliability, Monitoring & Automation:

  • Design and maintain observability frameworks (metrics, logs, tracing).
  • Implement proactive monitoring, alerting, and capacity planning.
  • Lead incident response for platform-level events and drive root cause analysis.
  • Automate operational workflows and infrastructure provisioning (IaC, configuration management).
  • Contribute to platform reliability engineering practices (SLOs, SLAs, error budgets).

 

Security & Governance:

  • Implement RBAC, network policies, and security hardening.
  • Ensure secure multi-tenant workload isolation.
  • Maintain compliance, data protection, and access governance standards.

 

 Customer & Platform Enablement:

  • Support customer lifecycle of onboarding, provisioning and operations.
  • Provide guidance on workload configuration, scaling strategies, and best practices.
  • Collaborate with engineering and vendor teams to resolve complex platform issues.
  • Produce high-quality technical documentation and operational playbooks. 


Required Experience & Skills:

  • Strong hands-on experience operating production Kubernetes clusters.
  • Experience with GPU-enabled Kubernetes environments.
  • Solid Linux system administration, networking, storage and security skills.
  • Experience with Infrastructure as Code and automation.
  • Strong understanding of distributed systems, APIs, and cloud-native architectures.
  • Experience implementing monitoring and observability solutions (e.g., Prometheus, Grafana.
  • Proven incident management and root cause analysis experience.
  • Strong communication skills and ability to work cross-functionally.
  •  

Desirable Experience:

  • Experience operating AI/HPC infrastructure.
  • Deep understanding of Kubernetes scheduling, networking, and storage.
  • Experience with high-performance datacentre networking and tuning.
  • Background in DevOps or Site Reliability Engineering (SRE).

 

Why Join Carbon3.ai:

You’ll be joining a mission-driven start-up building critical national infrastructure, where operational excellence directly enables growth. This role offers high visibility with leadership, real autonomy, and the chance to shape how a next-generation company operates at scale.

 

Diversity & Inclusion:

Carbon3.ai is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

 

Top Skills

Ansible
Gitops
Gpu
Grafana
Jenkins
Kubernetes
Prometheus
Python
Terraform
Yaml

Similar Jobs

10 Days Ago
Remote
United Kingdom
Senior level
Senior level
Healthtech • Logistics • Pharmaceutical
Lead the DevOps function to enhance software delivery through Azure infrastructures, CI/CD pipelines, and team mentoring, ensuring alignment with business goals.
Top Skills: Azure DevopsBashGoHelmKubernetesAzurePowershellPythonTerraform
22 Days Ago
In-Office or Remote
Luton, Bedfordshire, England, GBR
Mid level
Mid level
Aerospace • Travel • Analytics
You will manage CI/CD pipelines, optimize Azure systems, apply Infrastructure as Code techniques, and support production environments with a focus on reliability and observability.
Top Skills: Azure CloudAzure DevopsAzure MonitoringCentosDatadogDockerElk StackGithub ActionsGrafanaHelm ChartsKubernetesPowershellTerraformWindows Server
22 Days Ago
Remote
United Kingdom
Senior level
Senior level
Fintech • Analytics • Financial Services
The DevOps Engineer will create and maintain CI/CD processes, automate cloud infrastructure, and support software deployment in a multi-tenant SaaS environment.
Top Skills: .NetCi/CdInfrastructure As CodeAzureMs Sql

What you need to know about the Manchester Tech Scene

Home to a £5 billion digital ecosystem, including MediaCity, which consists of major players like the BBC, ITV and Ericsson, Manchester is one of the U.K.'s top digital tech hubs, at the forefront of advancements in film, television and emerging sectors like as e-sports, while also fostering a community of professionals dedicated to pushing creative and technological boundaries.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account