Senior Software Engineer - Agentic Runtime Safety & Observability

Keysight Technologies
5 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Tech stack

API
Artificial Intelligence
Systems Engineering
C++
Distributed Systems
Fault Tolerance
Interoperability
Python
ZeroMQ (Concurrent Programming Libraries)
Multi-Agent Systems
gRPC

Job description

As a Senior Engineer - Agentic Runtime Safety, Stability & Observability, you will design and own the runtime safety and reliability layer of Keysight's agentic orchestration platform.

Your mission is to ensure that AI-driven orchestration remains aligned with human intent, observable, auditable, and recoverable. You will architect guardrails, rollback mechanisms, and observability pipelines that allow autonomous systems to act powerfully-without sacrificing trust, control, or predictability.

This role bridges AI systems, runtime engineering, and safety-critical design, working closely with AI architects, ML engineers, and simulation teams., Runtime Safety & Execution Control

  • Design runtime guardrails ensuring agent actions remain aligned with intent, policies, and system constraints.

  • Implement intent validation, semantic checks, and execution contracts before orchestration runs.

  • Define safety boundaries, escalation paths, and rollback conditions within agent workflows. Fault Isolation, Rollback & Recovery

  • Architect deterministic rollback, checkpointing, and recovery mechanisms for multi-agent systems.

  • Design fault-isolation boundaries to prevent local failures from cascading system-wide.

  • Build sandboxed execution environments for validating AI-generated orchestration logic. Observability & Diagnostics

  • Implement end-to-end observability capturing agent decisions, execution traces, and system health.

  • Develop anomaly detection and confidence-based safety gating for runtime behavior.

  • Build introspection APIs and dashboards exposing rationale, safety metrics, and performance signals. Adaptive Governance

  • Establish feedback loops that adjust orchestration behavior based on performance and safety signals.

  • Contribute to continuous safety validation and runtime certification pipelines.

  • Collaborate across teams to embed transparency and traceability into every orchestration cycle.

Requirements

  • PhD or 5+ years of experience in systems engineering, runtime reliability, or safety-critical software.

  • Strong proficiency in Python and C/C++.

  • Proven experience designing fault-tolerant, observable, and recoverable systems.

  • Hands-on experience with agentic orchestration frameworks (e.g., LangGraph, LangChain, or similar).

  • Solid understanding of execution control, intent alignment, and policy enforcement in automated systems.

  • Experience building telemetry, monitoring, or diagnostics pipelines in complex runtimes. Desired Qualifications

  • Background in safety-critical or regulated domains (e.g. aerospace, industrial systems, EDA, HPC).

  • Experience with semantic validation, policy modeling, or goal disambiguation.

  • Familiarity with rollback strategies, dynamic gating, or safety scoring in distributed systems.

  • Experience with Python/C++ interoperability (e.g. PyBind11, gRPC, ZeroMQ).

  • Exposure to simulation-driven systems or hybrid AI-physics environments.

About the company

Keysight is at the forefront of technology innovation, delivering breakthroughs and trusted insights in electronic design, simulation, prototyping, test, manufacturing, and optimization. Our ~15, employees create world-class solutions in communications, 5G, automotive, energy, quantum, aerospace, defense, and semiconductor markets for customers in over countries. Learn more about what we do. Our award-winningculture embraces a bold vision of where technology can take us and a passion for tackling challenging problems with industry-first solutions. We believe that when people feel a sense of belonging, they can be more creative, innovative, and thrive at all points in their careers. About the Team Keysight's Applied AI Autonomy Initiative is building a next-generation agentic orchestration framework that enables AI agents to reason, adapt, and coordinate across complex engineering workflows. The platform combines LLM-based reasoning, reinforcement-inspired feedback loops, and simulation-driven validation to automate and optimize engineering decisions at scale. This role sits at the core of the initiative, defining how autonomy can be deployed safely, transparently, and predictably in high-assurance engineering environments.

Apply for this position