LineVision is looking for a Site Reliability Engineer to establish our dedicated SRE practice. You will ensure our grid intelligence platform delivers exceptional reliability for utility customers by owning the development of critical systems observability, deployment processes, and incident response protocols.
What You'll Do
- Establish and maintain Service Level Objectives and observability frameworks for critical services supporting utility grid operations.
- Implement CI/CD guardrails including canary deployments, automated rollbacks, and pre-production validation to improve deployment reliability.
- Develop comprehensive incident response procedures with documented runbooks, escalation paths, and blameless post-incident review processes.
- Partner with platform, engineering, and customer support teams to instrument systems and build reliability capabilities.
- Design and implement monitoring dashboards tracking SLA compliance, reliability metrics, and error budgets.
- Complete a comprehensive assessment of current infrastructure, identifying critical services requiring immediate observability improvements.
- Establish baseline SLOs for top-priority services and implement initial monitoring dashboards.
- Document current deployment processes and incident response procedures, identifying gaps and quick-win improvements.
- Deploy a production-ready observability framework covering all critical customer-facing services, with alerts configured for key reliability signals.
- Implement CI/CD improvements including automated testing gates, canary deployments, and rollback capabilities for core platform services.
- Lead 3+ blameless post-incident reviews, establishing templates and processes that become standard practice.
- Achieve measurable improvements in deployment success rates and mean time to recovery through implemented SRE practices.
- Build strong cross-functional partnerships resulting in proactive reliability improvements identified through error budget monitoring.
- Establish LineVision's SRE practice as a recognized capability, with documentation, runbooks, and processes that can scale with company growth.
What We're Looking For
- Strong experience with core AWS services including EC2, RDS, Lambda, and networking/VPC configuration for production environments.
- Hands-on proficiency with observability tools like Datadog, Prometheus, Grafana, or CloudWatch for instrumenting distributed systems.
- Experience with Infrastructure as Code tools like Terraform, CloudFormation, or Pulumi for managing and versioning infrastructure.
- Python and TypeScript experience for automation, tooling, and system instrumentation.
- Demonstrated experience establishing Service Level Objectives and tracking error budgets.
- Critical Thinking: Lead problem-solving efforts around complex reliability challenges, consistently applying critical thinking to identify root causes and prevent future incidents.
- Taking Ownership: Lead reliability projects with minimal supervision, taking full ownership of SRE practice development and system observability outcomes.
- Stakeholder Management: Manage relationships across engineering, platform, and support teams, providing clear updates on reliability metrics and leveraging influence to align on SRE priorities.
- Delivering Innovative Solutions: Lead implementation of modern SRE practices, inspiring teams to think creatively about reliability challenges in utility infrastructure context.
Nice to Have
- Background in energy, utility, or critical infrastructure sectors where reliability directly impacts public services.
- AWS certifications demonstrating deep platform expertise.
- Experience with security compliance frameworks relevant to utility operations.
- Track record of building SRE practices from the ground up in fast-growing technical organizations.
Technical Stack
- AWS, EC2, RDS, Lambda, VPC
- Datadog, Prometheus, Grafana, CloudWatch
- Terraform, CloudFormation, Pulumi
- Python, TypeScript
Team & Environment
You will partner with platform, engineering, and customer support teams. You will work in a communicative, collaborative environment with high autonomy and trust.
Benefits & Compensation
- Impactful work accelerating our mission of providing utilities with grid intelligence.
- Ownership with high autonomy and trust.
- Flexibility with trust-based PTO and a flexible work schedule.
- Real world innovation working with patented technology.
Work Mode
This role operates on a hybrid work model based out of our Boston, MA headquarters.





