Dearborn, Michigan, United States On-site

Ford is hiring a SRE Engineer

Responsibilities

  • Engage in a 24/7 on-call schedule to deliver immediate responses during critical system outages and maintain uptime for the North American eCommerce platform.
  • Serve as a key contributor in incident triage, diagnosing and resolving complex production issues to reduce recovery time.
  • Maintain and enhance operational runbooks and standard procedures to ensure consistent and effective incident handling.
  • Lead and take part in blameless post-incident reviews and root cause analyses to uncover systemic issues and prevent recurrence.
  • Work closely with development and platform teams to design scalable reliability improvements based on incident learnings.
  • Establish and monitor service level indicators and objectives to quantify system performance and availability.
  • Collaborate with product leadership to define service expectations and manage error budgets that align speed with system resilience.
  • Analyze monthly release cycles for potential risks to system health and compliance with service level targets.
  • Utilize and refine observability tools such as Dynatrace and GCP Logging to monitor system behavior and detect issues early.
  • Identify gaps in monitoring coverage and implement technical enhancements for full system visibility.
  • Develop, manage, and improve metrics, dashboards, and alerts using Terraform in alignment with organizational standards.
  • Create effective alerting strategies with thresholds based on service level violations and error budget consumption.
  • Drive automation by building scripts and tools to eliminate repetitive manual operations.
  • Build self-healing systems that automatically detect and correct common failures, minimizing human intervention.
  • Deploy and oversee AI-powered observability platforms to enable predictive monitoring and maintenance.
  • Collaborate with engineering teams to address performance bottlenecks and improve operational workflows.
  • Produce clear, data-backed reports on system reliability, incident patterns, and SRE program progress for leadership review.

Work Arrangement

On-site

Other

  • Grade 7 or 8.
  • #LI-On-Site
  • #LI-DS2
Required Skills
GCPTerraformJavaNode.jsPythonGoDevOpsInfrastructure as Code
About company
Ford
Ford Motor Company is an established global automotive manufacturer building a better world through innovative, exciting, and sustainable products and services. The company advances technologies in autonomy, electrification, and smart mobility.
All jobs at Ford Visit website
Job Details
Department Engineering
Category infrastructure
Posted 4 months ago