Boston, United States of America Hybrid Full-time

Axiomatic AI is hiring a Senior Platform Engineer

Responsibilities

  • Lead deployment strategies and CI/CD pipelines across multiple environments
  • Architect and maintain multi-cloud infrastructure (Azure, AWS, GCP) and on-premise deployments
  • Own infrastructure as code using Terraform to automate provisioning and configuration
  • Build comprehensive observability systems: monitoring, metrics, logging, and alerting
  • Implement security controls, compliance frameworks, and data governance policies
  • Develop automation tools, APIs, and scripts (Python) to improve operational efficiency
  • Ensure system reliability, performance, and scalability
  • Drive incident response, postmortems, and continuous improvement
  • Troubleshoot infrastructure and application issues across multiple environments
  • Design and implement deployment pipelines for multi-environment releases (dev, staging, production)
  • Own the full deployment lifecycle: build, test, release, and rollback strategies
  • Implement blue-green deployments, canary releases, and progressive rollouts
  • Build automated deployment tooling and workflows
  • Ensure zero-downtime deployments and rollback capabilities
  • Optimize build and deployment performance
  • Manage artifact repositories and container registries
  • Design and operate multi-cloud infrastructure across Azure, AWS, and GCP
  • Architect and deploy on-premise solutions for enterprise customers (Linux-based)
  • Manage Kubernetes clusters, container orchestration, and networking
  • Implement disaster recovery, backup strategies, and business continuity
  • Optimize cloud costs and resource utilization
  • Define and track SLIs, SLOs, and error budgets for critical services
  • Write and maintain Terraform modules for infrastructure provisioning
  • Implement GitOps workflows for infrastructure changes
  • Automate infrastructure scaling, updates, and operations
  • Ensure reproducible and version-controlled infrastructure
  • Design comprehensive monitoring, logging, and alerting (Prometheus, Grafana, Datadog, or similar)
  • Build dashboards for system health, performance, and business metrics
  • Implement distributed tracing for microservices
  • Conduct capacity planning and performance analysis
  • Drive reliability improvements through data-driven insights
  • Implement security best practices: identity management, secrets management, network policies
  • Work towards or maintain security certifications (SOC 2, ISO 27001, or similar)
  • Conduct security audits and vulnerability remediation
  • Implement data governance policies for AI pipelines and user data
  • Ensure compliance with data privacy regulations (GDPR, CCPA)
  • Write automation scripts and tools in Python for operational tasks
  • Build internal tooling for deployments, monitoring, and incident response
  • Develop runbooks, automation, and self-healing systems
  • Create APIs for infrastructure operations when needed
  • Maintain high code quality and testing standards for tooling
  • Participate in on-call rotation and lead incident response
  • Conduct blameless postmortems and drive action items
  • Build and maintain incident response playbooks
  • Improve system resilience and failure modes
  • Partner with engineering teams on deployment strategies and architecture
  • Work with security team on compliance and governance
  • Mentor engineers on operational best practices
  • Document systems, procedures, and runbooks

Requirements

  • 7+ years of experience in Platform Engineering, Site Reliability Engineering, DevOps, or Infrastructure Engineering roles
  • Deployment expert: Deep experience with CI/CD pipelines, release strategies, and production deployments at scale
  • Multi-cloud expertise: Hands-on experience with Azure and AWS required (GCP is a plus)
  • On-premise deployment experience: Linux system administration, bare-metal provisioning, networking
  • Terraform expert: Deep experience writing and maintaining infrastructure as code
  • Observability systems: Proven track record building monitoring, alerting, and metrics platforms
  • Security mindset: Experience implementing security controls and best practices
  • Data governance: Understanding of data privacy, residency requirements, and governance frameworks
  • Backend/scripting skills: Python (preferred) or Go for automation, tooling, and operational scripts
  • Kubernetes and container orchestration in production
  • Strong Linux/Unix administration and scripting (Bash, Python)
  • CI/CD platforms: GitHub Actions, GitLab CI, Jenkins, or similar
  • Version control and GitOps practices
  • Strong problem-solving and debugging skills
  • Fluent in English (Spanish is a plus)

Nice to Have

  • Python proficiency for automation and internal tooling
  • Experience with cloud AI platforms (Vertex AI, Azure ML, AWS SageMaker)
  • Service mesh experience (Istio, Linkerd) or API gateways
  • Experience with GPU workloads and ML infrastructure
  • FinOps and cloud cost optimization
  • Compliance frameworks experience (SOC 2, ISO 27001, HIPAA, FedRAMP)
  • Database operations: PostgreSQL, Redis administration
  • Experience with FastAPI or similar frameworks for internal tools
  • Contributions to open-source infrastructure projects
  • Background in hardware or semiconductor industries

Work Arrangement

Hybrid — Boston, US

Additional Information

  • Fluent in English (Spanish is a plus)
  • Participate in on-call rotation and lead incident response
Required Skills
DevOpsAzureFastAPI
About company
Axiomatic AI
Axiomatic AI is building a new class of AI systems designed to reason with the rigor of the scientific method. By combining deep learning with formal logic and physics-based modeling, we create verifiable, interpretable AI systems that collaborate with and support human researchers in high-stakes scientific and engineering workflows. Our mission, 30×30, is to deliver a 30× improvement in the speed, accessibility, and cost of semiconductor and photonic hardware development by 2030.
All jobs at Axiomatic AI Visit website
Job Details
Category infrastructure
Posted a day ago