Stacklok

Senior Site Reliability Engineer (SRE)

Posted 3 Days Ago

Be an Early Applicant

London, Greater London, England

Senior level

Security

The Role

The Senior Site Reliability Engineer at Stacklok will focus on automation, system monitoring, configuration management, and incident response, ensuring service performance and reliability. They will collaborate within the OSS insights product team and contribute to unifying platform automation and reliability practices across products.

Summary Generated by Built In

Stacklok is an innovative software supply chain security startup founded by Kubernetes co-founder, Craig McLuckie and Sigstore founder, Luke Hinds. Our mission is to make it easier to securely develop software. With our deep expertise in open source technologies and commitment to enhancing software security, we are seeking highly skilled and motivated individuals to join our team. This is a rare opportunity to join a startup at an early stage, and to be part of a team that is committed to building something truly innovative and impactful. Learn more about Stacklok’s mission, virtues, and leadership, HERE.

Location

This is a hybrid role that requires on-site work at our London office three (3) days a week. Our office is conveniently located in WeWork at 1 Mark Square, London, EC2A 4EG.

Elevator Pitch

Stacklok Cloud is a comprehensive security platform that combines open source package intelligence with a policy platform built on the open source project, Minder, allowing developers to securely consume open source software while enabling security teams to effectively manage and maintain a robust security posture across the entire software supply chain.
We are seeking a Senior Site Reliability Engineer (SRE) to support Trusty, our package intelligence service that empowers developers to make safer open source dependency choices. (Trusty Demo)

Embedded within the OSS insights product team, this role focuses on driving essential initiatives in automation, system monitoring, configuration management, continuous delivery improvements, and incident response to ensure exceptional service performance and reliability.

In addition, this role will be part of a company-wide guild dedicated to unifying platform automation, observability, and reliability practices across all product lines, building a cohesive, high-performance SaaS platform with seamless observability and reliability throughout the Stacklok ecosystem.

If site reliability engineering is your passion and you’re ready to make a lasting impact on the future of open source security, we want to hear from you!

Role Success: 6-12 Months Expectations

Acclimatize to the Team: Familiarize yourself with our engineering processes. Build connections with team members, immerse yourself in our company culture, understand our virtues, and learn the way we work and collaborate.
Solid Understanding of Our Products and Services: Gain a solid understanding of Stacklok Cloud products and services, our vision of the platform as well as short and long-term goals to align your contributions to our objectives and lead improvement initiatives.
Deep Dive Into Stacklok Cloud Architecture: Become comfortable with the current infrastructure-as-code environment using Terraform to deploy SaaS software to Kubernetes on AWS.
Build Proficiency in Go: Develop strong proficiency in Go, alongside our primary Python programming language, focusing on best practices, idiomatic design patterns, and effective error handling, and unit testing.
Hybrid Contribution: Be an integral part of our product engineering team, contributing time to accelerating feature development and delivery, improving platform functionalities, automation efforts that enhance the efficiency and reliability of our systems.
Technical Guidance and Documentation: Document and maintain production infrastructure solutions such as playbooks and architecture diagrams. Mentor team members in production processes and on-call procedures. Foster a culture of operational excellence.
Mentor and Empower Your Peers: Guide and nurture junior engineers, cultivating a culture rooted in empathy, curiosity, and psychological safety. Share your “know-hows” of infrastructure with the team, helping them build their knowledge and skills. Through thoughtful code reviews, sharing technical insights, and engaging in the hiring process, your leadership will be pivotal in fostering professional growth and expanding our team’s capabilities.
On-Call Rotation Responsibilities: Responsible for on-call duties every 5-6 weeks with a 2-week on-call rotation. During each rotation, you will alternate with the other engineer on-call between primary and secondary roles. The primary role involves leading incident resolution and communication, while the secondary role provides support with troubleshooting and monitoring.

In This Role You Will Have The Opportunity To

Shape The Future of Stacklok Cloud: As a senior site reliability engineer, you’ll be instrumental in developing innovative solutions that enhance our platform’s reliability and performance. Your focus will include regular platform upgrades and the instrumentation of production systems to ensure active reliability and performance monitoring. By bringing fresh ideas, challenging assumptions, and working collaboratively, you’ll help advance our platform and shape strategies for the future of software supply chain security.
Embrace an Automate Everything Mindset: Champion a culture of automation across all operational tasks. You’ll lead initiatives for environment automation and incident management tooling to streamline response improvements and enhance operational efficiency. Implementing application autoscaling and recovery processes, you’ll ensure our systems are resilient and adaptable to changing demands. Working alongside a skilled team, you’ll drive the automation of playbooks, continuous delivery pipelines, and GitHub Terraform processes to elevate our service delivery and incident response capabilities.
Monitor and Improve Service Performance: Take charge of end-to-end service KPI monitoring to drive continuous improvements and ensure optimal performance. You’ll conduct thorough reviews of logs and performance metrics, leveraging shared tooling and incident response automations to enhance our systems. Your analytical mindset will be key to identifying areas for KPI improvements, ensuring that we consistently meet and exceed our performance goals.
Uphold Standards of Excellence: Champion the reliability and quality of our systems by establishing clear Service Level Objectives and advocating for robust monitoring and incident management strategies. You’ll empower the team to maintain high standards, delivering consistent and reliable user experiences that reflect our dedication to excellence.

We understand that not everyone will meet every requirement listed, and that’s perfectly okay! We encourage you to apply regardless of your self-assessment. We value a diverse range of skills and experiences and believe that your unique attributes can make a significant impact. We want to hear from you!

Desired Skills & Experience

Strong background in site reliability engineering, with a robust understanding of observability tools and distributed tracing like Jaeger, Prometheus and Grafana to ensure high availability and optimal performance
Proficient in programming languages, particularly Python (with a big plus having Go experience), demonstrating the ability to write clean, efficient, and maintainable code.
Comprehensive knowledge of Infrastructure as Code (IaC) principles, with proficiency in automation tools like Terraform for environment provisioning and configuration management.
Experience with at least one major cloud provider (AWS, Azure, Google), preferably AWS.
In-depth understanding of cloud-native application deployment and management using technologies like Docker and Kubernetes, including application autoscaling and recovery strategies.
Extensive experience in automating incident response processes using platforms such as PagerDuty to improve response times and incident management efficiency.
Proficient in log aggregation and analysis tools such as AWS Athena and Cloudwatch enabling thorough performance reviews and proactive issue identification.
Experience in defining and implementing Service Level Objectives (SLOs) and key performance indicators (KPIs) to drive service quality and operational excellence.
Knowledge of security best practices in site reliability, with an emphasis on operational security measures and maintaining a secure software supply chain.
Impact-Driven and Collaborative: Track record of delivering solutions that drive business outcomes; excellent written and verbal communication skills for engaging diverse stakeholders. Committed to fostering growth and continuous improvement within teams.
Versatile and Self-Starting: Adaptable in dynamic, startup environments, comfortable in varied roles—from individual contributor to conference presenter—and skilled at making technical topics accessible to broad audiences.

#LI-Hybrid

Why Join Us?

At Stacklok, you will be a part of a culture that values open communication, collaboration, and innovation. We offer a competitive salary package and flexible work hours. If you’re a self-motivated and result-driven individual with a passion for designing and building secure, scalable, distributed systems, and you want to be part of the most exciting startup in the secure supply chain space, come and join us!

Stacklok Inc, is proud to be an equal opportunity employer. We are committed to providing equal employment opportunities for all people and place great value in both diversity and inclusiveness. All qualified applicants will be considered for employment without regard to their, or any other person's, perceived or actual race, color, religion, sex, gender, gender identity, gender expression, sexual orientation, national origin, ancestry, citizenship, age, physical or mental disability, medical condition, family care status, or any other basis protected by law.