Shield is a global startup, with offices in Tel-Aviv, New-York, London, and Lisbon.
We’re growing and looking for another important piece of the puzzle.
Is it you?
Let’s get down to business:
What you will do
Key Responsibilities:
Establish and nurture a culture of excellence within the SRE team, promoting best practices, effective work processes, and methodologies. Lead by example and mentor the team to foster a collaborative and high-performing environment.
Set clear team goals and priorities in alignment with organizational objectives. Ensure resources are available and allocated efficiently to meet project timelines and deliverables.
Recruit, train, and develop team members, providing guidance and support to enhance their skills and career progression. Encourage continuous learning and adaptability to new technologies and methodologies.
Design, implement, and maintain scalable and reliable infrastructure solutions.
Develop and deploy monitoring, alerting, and logging systems to proactively identify and mitigate operational issues.
Review and refine existing alerts, working closely with developers to automate responses and enable self-healing.
Develop and maintain monitoring dashboards that provide clear and actionable insights into application reliability and system performance.
Conduct capacity planning and performance tuning to optimize system performance and resource utilization.
Automate repetitive tasks and processes to streamline operations and improve efficiency.
Lead incident response and resolution, including rapid troubleshooting, coordinating cross-functional teams, root cause analysis, and post-mortem reviews.
Develop and maintain incident response procedures and runbooks to ensure efficient and effective handling of incidents.
Communicate effectively with stakeholders during incidents, providing timely updates and managing expectations.
Continuously evaluate and adopt new technologies and methodologies to enhance our infrastructure and operations.
Oversee and optimize our cloud infrastructure on AWS, ensuring scalability, reliability, and cost-effectiveness.
Regularly analyze cloud service usage and expenses, implementing strategies to optimize costs.
Minimum Qualifications:
Bachelor’s degree in Computer Science, Information Technology, or a related field.
6+ years of experience as a site reliability or platform engineer, preferably in a fast-scaling environment.
At least 2 years in a leadership role, demonstrating effective team management, mentorship, and strategic planning.
Hands-on experience with Terraform and Terragrunt.
Extensive knowledge of Kubernetes and containerization technologies.
Hands-on experience with the Prometheus stack.
Ability to design and develop code using Python or Go.
Strong inclination toward automating manual tasks and processes to improve operational efficiency.
Excellent troubleshooting abilities with a methodical approach to diagnosing and resolving issues.
In-depth knowledge of cloud services, particularly AWS, including best practices in security and compliance.
Excellent communication abilities to coordinate effectively with both technical and non-technical stakeholders.
#J-18808-Ljbffr
Sre Team Lead,
Free
Sre Team Lead,
Portugal,
Modificado May 7, 2025
Descrição
Detalhes do trabalho
⇐ Trabalho anterior |
Próximo trabalho ⇒ |
Propaganda