This role is responsible for ensuring the performance, scalability, and reliability of cloud-native systems handling billions of daily transactions. You will lead the design of robust, high-throughput architectures and establish performance benchmarks in collaboration with product and engineering leadership.
Key Responsibilities
- Lead the analysis of complex distributed systems to identify performance bottlenecks and architectural constraints
- Design and implement cloud-native solutions built to scale efficiently under extreme load
- Facilitate service-level objective (SLO) and service-level indicator (SLI) planning sessions with stakeholders
- Develop comprehensive test strategies for scalability, stress, availability, and system longevity
- Collaborate with developers, product teams, and quality engineers to align performance goals with business requirements
- Define and communicate testing timelines, risks, and mitigation plans to ensure project delivery
- Utilize tools such as JMeter, Locust, and Gremlin to simulate real-world traffic and failure scenarios
- Review test automation frameworks and promote best practices in performance engineering
- Analyze test results and support capacity planning for upcoming releases
- Provide technical guidance to production support teams on performance tuning, patching, and system sizing
- Deliver training and technical sessions on capacity planning methodologies
- Report on performance trends and progress toward key project milestones
Qualifications
- Degree in Computer Science or related field, or equivalent professional experience
- Proven expertise in Java or .NET, SQL and NoSQL databases, and distributed systems
- Deep understanding of event-driven architectures and asynchronous processing
- Hands-on experience with performance testing tools including JMeter, LoadRunner, and custom frameworks
- Familiarity with observability platforms such as Splunk, NewRelic, Prometheus, and Grafana
- Strong grasp of HTTP, REST, JSON, AJAX, and web application performance fundamentals
- Experience tuning databases like Oracle and DB2 using diagnostic tools such as AWR and STATSPACK
- Working knowledge of operating systems including Linux and Solaris, and performance profiling tools like OProfile and VTune
- Ability to interpret system metrics including throughput, latency, CPU and memory utilization
- Understanding of JVM internals, garbage collection, multi-threading, and caching strategies
- Excellent written and verbal communication skills, with ability to document technical findings clearly
- Strong analytical and problem-solving abilities, with attention to detail
