

Staff Cloud Ops Engineer
Job Description
As a Staff Cloud Ops Engineer at Classy, you will lead the Platform Infrastructure and Operations team, playing a pivotal role in architecting, building, and maintaining our advanced cloud infrastructure. This infrastructure is crucial to our online fundraising platform that supports nonprofits globally. Your leadership will ensure our infrastructure consistently achieves 99.999% availability, catering to the demands of our sophisticated global payments platform.
Responsibilities
Architect and implement scalable, fault-tolerant cloud solutions that process billions of dollars annually, ensuring operational excellence and security.
Lead and mentor a team of cloud engineers; promote a culture of continuous improvement, innovation, and learning.
Make strategic decisions on cloud architecture and spearhead adoption of cutting-edge practices and technologies.
Drive enhancements in system performance through advanced observability and reliability practices.
Oversee sophisticated testing and validation processes to ensure the robustness of the cloud infrastructure.
Develop and refine real-time monitoring and logging systems, setting industry benchmarks in operational excellence.
Implement and report on DORA (DevOps Research and Assessment) metrics to measure and enhance the effectiveness of development processes and practices across the team.
Design and manage CI/CD pipelines to ensure rapid, reliable, and repeatable deployment of our cloud-based applications.
Job Requirements
Bachelor’s Degree in Computer Science, related field, or 12+ years equivalent practical experience.
8+ years of experience designing and managing scalable cloud-based infrastructure, with a preference for SaaS environments.
Demonstrated leadership in managing engineering teams and projects.
Expert knowledge of AWS, proficiency in container technologies such as Docker and Kubernetes, and Infrastructure as Code (IaC) practices.
Advanced understanding of software architecture, including asynchronous event-driven architecture and microservices.
Extensive experience in performance and reliability testing using advanced tools such as K6 and Artillery.
Expertise in application performance management (APM) with tools like NewRelic, DataDog, and Splunk.
Strong programming skills in scripting languages Bash, PHP, and NodeJS.
Experience managing distributed data systems and troubleshooting complex issues under high pressure and load.
In-depth knowledge of high-volume transaction systems and familiarity with compliance regulations like PCI, SOC2, and GDPR.
Exceptional leadership and collaborative skills, with a track record of leading initiatives and mentoring teams.
Ability to excel in a dynamic, fast-paced startup environment.
Superb communication skills, capable of effectively collaborating and influencing across diverse teams and cultural backgrounds.