Responsibilities
- Lead deployment strategies and CI/CD pipelines across multiple environments
- Architect and maintain multi-cloud infrastructure (Azure, AWS, GCP) and on-premise deployments
- Own infrastructure as code using Terraform to automate provisioning and configuration
- Build comprehensive observability systems: monitoring, metrics, logging, and alerting
- Implement security controls, compliance frameworks, and data governance policies
- Develop automation tools, APIs, and scripts (Python) to improve operational efficiency
- Ensure system reliability, performance, and scalability
- Drive incident response, postmortems, and continuous improvement
- Troubleshoot infrastructure and application issues across multiple environments
- Design and implement deployment pipelines for multi-environment releases (dev, staging, production)
- Own the full deployment lifecycle: build, test, release, and rollback strategies
- Implement blue-green deployments, canary releases, and progressive rollouts
- Build automated deployment tooling and workflows
- Ensure zero-downtime deployments and rollback capabilities
- Optimize build and deployment performance
- Manage artifact repositories and container registries
- Design and operate multi-cloud infrastructure across Azure, AWS, and GCP
- Architect and deploy on-premise solutions for enterprise customers (Linux-based)
- Manage Kubernetes clusters, container orchestration, and networking
- Implement disaster recovery, backup strategies, and business continuity
- Optimize cloud costs and resource utilization
- Define and track SLIs, SLOs, and error budgets for critical services
- Write and maintain Terraform modules for infrastructure provisioning
- Implement GitOps workflows for infrastructure changes
- Automate infrastructure scaling, updates, and operations
- Ensure reproducible and version-controlled infrastructure
- Design comprehensive monitoring, logging, and alerting (Prometheus, Grafana, Datadog, or similar)
- Build dashboards for system health, performance, and business metrics
- Implement distributed tracing for microservices
- Conduct capacity planning and performance analysis
- Drive reliability improvements through data-driven insights
- Implement security best practices: identity management, secrets management, network policies
- Work towards or maintain security certifications (SOC 2, ISO 27001, or similar)
- Conduct security audits and vulnerability remediation
- Implement data governance policies for AI pipelines and user data
- Ensure compliance with data privacy regulations (GDPR, CCPA)
- Write automation scripts and tools in Python for operational tasks
- Build internal tooling for deployments, monitoring, and incident response
- Develop runbooks, automation, and self-healing systems
- Create APIs for infrastructure operations when needed
- Maintain high code quality and testing standards for tooling
- Participate in on-call rotation and lead incident response
- Conduct blameless postmortems and drive action items
- Build and maintain incident response playbooks
- Improve system resilience and failure modes
- Partner with engineering teams on deployment strategies and architecture
- Work with security team on compliance and governance
- Mentor engineers on operational best practices
- Document systems, procedures, and runbooks
Requirements
- 7+ years of experience in Platform Engineering, Site Reliability Engineering, DevOps, or Infrastructure Engineering roles
- Deployment expert: Deep experience with CI/CD pipelines, release strategies, and production deployments at scale
- Multi-cloud expertise: Hands-on experience with Azure and AWS required (GCP is a plus)
- On-premise deployment experience: Linux system administration, bare-metal provisioning, networking
- Terraform expert: Deep experience writing and maintaining infrastructure as code
- Observability systems: Proven track record building monitoring, alerting, and metrics platforms
- Security mindset: Experience implementing security controls and best practices
- Data governance: Understanding of data privacy, residency requirements, and governance frameworks
- Backend/scripting skills: Python (preferred) or Go for automation, tooling, and operational scripts
- Kubernetes and container orchestration in production
- Strong Linux/Unix administration and scripting (Bash, Python)
- CI/CD platforms: GitHub Actions, GitLab CI, Jenkins, or similar
- Version control and GitOps practices
- Strong problem-solving and debugging skills
- Fluent in English (Spanish is a plus)
Nice to Have
- Python proficiency for automation and internal tooling
- Experience with cloud AI platforms (Vertex AI, Azure ML, AWS SageMaker)
- Service mesh experience (Istio, Linkerd) or API gateways
- Experience with GPU workloads and ML infrastructure
- FinOps and cloud cost optimization
- Compliance frameworks experience (SOC 2, ISO 27001, HIPAA, FedRAMP)
- Database operations: PostgreSQL, Redis administration
- Experience with FastAPI or similar frameworks for internal tools
- Contributions to open-source infrastructure projects
- Background in hardware or semiconductor industries
Work Arrangement
Hybrid — Boston, US
Additional Information
- Fluent in English (Spanish is a plus)
- Participate in on-call rotation and lead incident response