TCGplayer is hiring a Frontline Engineer to serve as the first line of defense for incident response and problem management, ensuring the reliability and performance of our platform. In this role, you will lead real-time incident response, collaborate with engineering teams for root cause analysis, and drive operational improvements.
What You'll Do
- Serve as Incident Commander, leading real-time response efforts, managing communication across teams, triaging issues, and driving resolution of high-priority incidents.
- Execute documented runbooks for troubleshooting and resolving production incidents involving AWS services and Kubernetes clusters.
- Collaborate closely with engineering teams post-incident, performing root cause analysis, documenting lessons learned, and driving the implementation of durable solutions.
- Drive operational excellence by measuring and analyzing critical metrics to identify improvement opportunities and implement impactful solutions.
- Continuously refine and update operational runbooks and procedures, ensuring alignment with evolving technologies and business needs.
- Proactively contribute to long-term strategic initiatives to improve incident management practices.
What We're Looking For
- A Bachelor’s degree in a technical field or 5+ years of equivalent experience in system administration, infrastructure engineering, or related roles.
- Direct experience as an incident commander, including managing live incident calls, coordinating triage efforts, and driving communications during high-pressure situations.
- Strong communication skills with the ability to clearly articulate technical details and strategies to both technical and non-technical stakeholders.
- Excellent problem-solving capabilities, able to stay composed and decisive under pressure during high-impact incidents.
- Hands-on operational experience with AWS in a production environment, specifically executing runbooks, restarting EC2 instances, checking alarms, and pulling logs from CloudWatch.
- Proficiency with Kubernetes, including troubleshooting containerized workloads, understanding pod health, managing deployments, and interacting directly with Kubernetes clusters.
- Experience with scripting (Python, PowerShell, or Bash) to automate operational tasks or assist in incident resolution workflows.
Nice to Have
- Relevant certifications are a plus.
Technical Stack
- AWS, EC2, CloudWatch, IAM, Kubernetes, Python, PowerShell, Bash
Team & Environment
You will report to the Incident and Problem Management leader and collaborate closely with Cloud Operations, Site Reliability Engineering (SRE), Engineering teams, and Product stakeholders.
Benefits & Compensation
- Compensation range: $103,200 - $178,400 + equity in the form of Restricted Stock Units (as applicable).
- Comprehensive medical, financial, and other benefits.
- 401(k) eligibility.
- Various paid time off benefits, including PTO and parental leave.
Work Mode
This role is fully remote.
eBay is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, national origin, sex, sexual orientation, gender identity, veteran status, and disability, or other legally protected status.





