Reliability Engineer
Role details
Job location
Tech stack
Job description
As a Reliability Engineer reporting to the VP, Technical Operations, you will own the reliability, performance, and operational excellence of Proscia's installations at customer sites. Our platform powers high-resolution digital pathology and AI-assisted workflows/diagnostics in clinical and research environments, often running on customer-managed infrastructure. You'll ensure these deployments are stable, performant, secure, and continuously improving. AI tools are a natural part of how you diagnose, automate, and operate.
This is a hands-on role with a focus on container-based deployments, systems performance, and real-world operational problem-solving in environments where rigor matters.
What You'll Do
Working at a startup like Proscia means wearing many hats, but when you come to work you can expect to focus on the following:
- Deploy, configure, and support Proscia's container based application stack in on-premise customer environments.
- Own system reliability across customer installations, including uptime, performance, backup/recovery, and upgrade workflows.
- Diagnose and resolve production incidents-deep root cause analysis across application, container, host, storage, and networking layers, using AI alongside traditional debugging to correlate signals and cut through noise.
- Optimize performance for large image datasets and AI workloads running on customer-managed compute infrastructure.
- Improve installation automation, configuration management, and repeatability across diverse environments integrating agentic workflows in your day-to-day to keep pace with demands from Engineering.
- Develop and refine monitoring, logging, and alerting patterns appropriate for customer-hosted deployments.
- Collaborate closely with Engineering, Customer Success, and Support to translate field learnings into product and operational improvements.
- Create operational playbooks-written with the clarity and structure that makes them useful to teammates, customers, and the AI-augmented workflows the team relies on.
- Contribute to Proscia's technical presence-whether through internal demos, engineering blog posts, or operational knowledge sharing that raises the bar for how the team works.
Requirements
Do you have experience in Tooling?, You think in systems-you reason about how applications, infrastructure, and customer environments interact, not just the layer you're working in. AI tools are part of how you operate: troubleshooting, writing automation, navigating the complexity of diverse on-premise environments.
- Deep hands-on experience deploying and operating containerized applications using container orchestration in production environments.
- Strong Linux systems expertise (process management, networking, storage, security hardening, performance tuning)
- Expert troubleshooting skills in distributed systems across application, container, and infrastructure layers.
- Experience with enterprise networking-you can troubleshoot and recommend corrections in customer infrastructure. Comfortable operating software in customer-managed and on-premise environments.
- Experience supporting data-intensive systems, ideally involving large image files or compute-heavy workloads.
- Working knowledge of observability practices (logs, metrics, tracing) and pragmatic monitoring approaches in non-cloud-native environments.
- Comfort working directly with customers or customer-facing teams to resolve high-impact issues.
- You already use AI tools in your operational work, in troubleshooting, writing automation, analyzing logs, or however it fits your practice.
- A mindset aligned with Proscia's values: ownership, speed, simplification, and a willingness to challenge the status quo.
- Experience building with or on top of LLMs, AI agents, or agentic pipelines.
- Demonstrated fluency applying AI tools to real operational problems beyond basic code completion.
- Familiarity with prompt engineering, tool use patterns, and evaluation of AI systems-you know when AI output is production-ready and when it needs different guardrails.
Nice to Have
- Experience with healthcare or regulated environments.
- Exposure to Kubernetes (for hybrid or future-state deployments).
- Experience with infrastructure automation or configuration management tools.
- Familiarity with database performance tuning for large datasets.
- Experience supporting GPU-enabled workloads.
- Open-source contributions, side projects, or a portfolio that shows how you think and build.
- Background that spans multiple domains or disciplines
- Active in technical communities, forums, or meetups.