Design and implement resilient system architectures, develop automation tools to improve efficiency, track SLOs and SLIs, analyze incidents, troubleshoot performance issues, and maintain documentation.
- Design and implement resilient system architectures that support high availability and scalability.
- Develop automation tools and scripts to enhance operational efficiency and reduce manual effort.
- Define, track, and analyze SLOs and SLIs to ensure reliability and performance meet business needs.
- Conduct thorough post-mortem analyses following incidents, driving continuous improvement through root cause identification and solution implementation.
- Collaborate with development and operations teams to establish best practices in system reliability and incident management.
- Troubleshoot and resolve issues related to database performance, network connectivity, and deployment failures, including diagnosing problems at the underlying platform level (e.g., Kubernetes, virtual machines).
- Ensure that issues are resolved within the stipulated Service Level Agreements (SLAs), maintaining high standards of service delivery.
- Identify and troubleshoot performance bottlenecks across systems, providing actionable recommendations for enhancements.
- Maintain detailed documentation of processes and incident responses to support knowledge sharing and compliance.
- Proficiency in programming languages such as Python, Golang, Java, or similar, focusing on operational efficiency.
- Demonstrated experience in system architecture and design, prioritizing reliability, and scalability.
- Strong understanding of SRE principles, including SLOs, SLIs, toil reduction, and incident post-mortems.
- Experience with cloud environments (e.g., AWS, Azure, Google Cloud) and their operational management.
- Strong expertise in Linux system administration.
- Proven experience in troubleshooting application support issues with a focus on performance and connectivity.
- Familiarity with networking concepts and effective troubleshooting techniques.
- Excellent problem-solving abilities and a proactive approach to operational challenges.
- Ability to work independently while effectively collaborating within a team environment.
Top Skills
Go
Java
Python
Similar Jobs
Be an Early Applicant
As a Senior DevOps Engineer focused on AI Innovation, you will design, develop, and maintain complex software systems. Your responsibilities include system architecture design, managing CI/CD pipelines, and deploying cloud infrastructure while improving the system's reliability and performance.
Be an Early Applicant
As a Senior DevOps Engineer, you will design and maintain scalable applications, contributing to system architecture and automating operations. Responsibilities include developing monitoring tools, deploying cloud infrastructure, and managing CI/CD pipelines, while collaborating with various teams for problem-solving and innovative solution development.
Be an Early Applicant
The Senior Project Engineer is responsible for defining project scope, coordinating testing activities, and managing client communications to ensure compliance with UL requirements. The role includes developing testing procedures, resolving technical client issues, and directing the work of assigned staff. Frequent travel is required to conduct tests and communicate project updates.
What you need to know about the Manchester Tech Scene
Home to a £5 billion digital ecosystem, including MediaCity, which consists of major players like the BBC, ITV and Ericsson, Manchester is one of the U.K.'s top digital tech hubs, at the forefront of advancements in film, television and emerging sectors like as e-sports, while also fostering a community of professionals dedicated to pushing creative and technological boundaries.