Remote (Country) Full-time

Articul8 AI is hiring a Senior Site Reliability Engineer (SRE) - (Brazil)

About the Role

Articul8 AI is hiring a Senior Site Reliability Engineer (SRE) in Brazil to ensure the reliability, performance, and scalability of our GenAI SaaS platform. You will bridge the gap between development and operations, implementing automation and best practices to maintain service reliability objectives while supporting rapid innovation.

What You'll Do

  • Architect and maintain scalable, highly available infrastructure for our GenAI platform.
  • Design and implement robust monitoring, alerting, and observability solutions to proactively ensure system health and performance.
  • Automate deployment, scaling, and management of our cloud-native infrastructure, reducing toil and improving efficiency.
  • Define, measure, and improve Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to deliver outstanding service quality.
  • Participate in on-call rotations and provide rapid response to production incidents, minimizing downtime and user impact.
  • Collaborate closely with development teams to build reliable, scalable, and efficient systems for complex AI workloads.
  • Lead incident response efforts, conduct thorough post-mortems, and champion continuous improvement initiatives.
  • Optimize infrastructure for performance, scalability, and cost-effectiveness—especially for high-demand AI workloads.
  • Implement and enforce security best practices across all systems and environments.
  • Create and maintain comprehensive documentation, including runbooks and knowledge base articles, to foster a culture of shared knowledge.

What We're Looking For

  • Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience.
  • 5+ years of experience in DevOps, SRE, or similar roles.
  • Strong experience with cloud platforms (AWS, GCP, or Azure).
  • Proficiency in at least one programming/scripting language (Python, Go, Bash, etc.).
  • Hands-on experience with infrastructure as code tools (Terraform, CloudFormation, etc.).
  • Solid background in containerization technologies (Docker, Kubernetes).
  • Proven experience with monitoring and observability tools (Prometheus, Grafana, ELK stack, etc.).
  • Strong understanding of CI/CD pipelines and automation.
  • Exceptional troubleshooting and problem-solving skills and ability to troubleshoot complex systems.

Nice to Have

  • Experience supporting AI/ML systems in production.
  • Knowledge of GPU infrastructure management and optimization.
  • Familiarity with distributed systems and high-performance computing.
  • Experience with database systems (SQL and NoSQL).
  • Certifications in cloud platforms (AWS, GCP, Azure).
  • Experience with chaos engineering and resilience testing.
  • Knowledge of security best practices and compliance requirements.

Technical Stack

  • Cloud: AWS, GCP, Azure
  • Languages: Python, Go, Bash
  • Infrastructure as Code: Terraform, CloudFormation
  • Containerization: Docker, Kubernetes
  • Monitoring/Observability: Prometheus, Grafana, ELK stack

Work Mode

This position follows a local-country work mode and is based in Brazil.

Articul8 AI is an equal opportunity employer.

Required Skills
AWSGCPAzurePythonGoBashTerraformCloudFormationDockerKubernetesSite Reliability EngineeringInfrastructure as CodeCloud InfrastructureMonitoringIncident Response
Your first international client?

Don't lose them over invoicing

Clients ghost freelancers with unprofessional invoicing. Glopay gives you a real EU company partnership so they take you seriously from invoice #1.

Instant EU company partnership
Invoice builder with your branding
Automated payment reminders
Real-time payment tracking
Get EU company now
Ready in 24 hours
About company
Articul8 AI

Articul8 AI creates exceptional AI products that exceed customer expectations.

Visit website
Job Details
Category infrastructure
Posted 8 months ago