Senior Software Engineer - Agentic Runtime Safety & Observability
Role details
Job location
Tech stack
Job description
As a Senior Engineer - Agentic Runtime Safety, Stability & Observability, you will design and own the runtime safety and reliability layer of Keysight's agentic orchestration platform.
Your mission is to ensure that AI-driven orchestration remains aligned with human intent, observable, auditable, and recoverable. You will architect guardrails, rollback mechanisms, and observability pipelines that allow autonomous systems to act powerfully-without sacrificing trust, control, or predictability.
This role bridges AI systems, runtime engineering, and safety-critical design, working closely with AI architects, ML engineers, and simulation teams., Runtime Safety & Execution Control
-
Design runtime guardrails ensuring agent actions remain aligned with intent, policies, and system constraints.
-
Implement intent validation, semantic checks, and execution contracts before orchestration runs.
-
Define safety boundaries, escalation paths, and rollback conditions within agent workflows. Fault Isolation, Rollback & Recovery
-
Architect deterministic rollback, checkpointing, and recovery mechanisms for multi-agent systems.
-
Design fault-isolation boundaries to prevent local failures from cascading system-wide.
-
Build sandboxed execution environments for validating AI-generated orchestration logic. Observability & Diagnostics
-
Implement end-to-end observability capturing agent decisions, execution traces, and system health.
-
Develop anomaly detection and confidence-based safety gating for runtime behavior.
-
Build introspection APIs and dashboards exposing rationale, safety metrics, and performance signals. Adaptive Governance
-
Establish feedback loops that adjust orchestration behavior based on performance and safety signals.
-
Contribute to continuous safety validation and runtime certification pipelines.
-
Collaborate across teams to embed transparency and traceability into every orchestration cycle.
Requirements
-
PhD or 5+ years of experience in systems engineering, runtime reliability, or safety-critical software.
-
Strong proficiency in Python and C/C++.
-
Proven experience designing fault-tolerant, observable, and recoverable systems.
-
Hands-on experience with agentic orchestration frameworks (e.g., LangGraph, LangChain, or similar).
-
Solid understanding of execution control, intent alignment, and policy enforcement in automated systems.
-
Experience building telemetry, monitoring, or diagnostics pipelines in complex runtimes. Desired Qualifications
-
Background in safety-critical or regulated domains (e.g. aerospace, industrial systems, EDA, HPC).
-
Experience with semantic validation, policy modeling, or goal disambiguation.
-
Familiarity with rollback strategies, dynamic gating, or safety scoring in distributed systems.
-
Experience with Python/C++ interoperability (e.g. PyBind11, gRPC, ZeroMQ).
-
Exposure to simulation-driven systems or hybrid AI-physics environments.