Technical Program Manager IRC294357
Role details
Job location
Tech stack
Job description
We are looking for a highly technical Lead Platform Engineer to architect the observability, cost governance, and security framework for our enterprise AI agent ecosystem. You will be responsible for ensuring our agentic workflows-built on AWS Bedrock, AgentCore, and MCP servers-are scalable, observable, and cost-efficient. The ideal candidate bridges the gap between traditional DevOps and the emerging world of LLMOps, with a deep focus on distributed tracing for non-deterministic AI workloads. Salary Verbiage: "GlobalLogic estimates the starting pay range for this role to be performed hybrid in Reading, PA to be $130K to $140K. This reflects base salary only and does not include additional performance-linked variable compensation, benefits etc that may apply to the role. This pay range is provided as a good faith estimate and the amount offered may be higher or lower. GlobalLogic takes many factors into consideration in making an offer, including candidate qualifications, work experience, operational needs, travel and onsite requirements, internal peer equity, prevailing wage, responsibilities, and other market and business considerations., * Assess CloudWatch, X-Ray, Bedrock logging, AgentCore traces vs. agentic workflow requirements; produce gap analysis, Setup observability in Dynatrace
- Design post-deployment validation pipeline for agents & MCP servers (deployment health + tool registration checks)
- Implement distributed tracing & structured logging: LLM decisions, tool selections, sub-agent calls, MCP interactions
- Evaluate LangFuse / LiteLLM proxy vs. AWS-native; deliver target-state observability architecture recommendation
- Cost Tracking & TCO
- Extend tagging taxonomy to cover agent runtimes, MCP servers, vector DBs, Bedrock token consumption per namespace
- Design cost visibility model: aggregate agent, MCP, vector DB, and Bedrock token costs per team/department
- Build CloudWatch (or equivalent) dashboards for per-team spend; configure AWS Budgets with alerting thresholds
- Automate cost reports delivered via email / Microsoft Teams; implement anomaly detection rules
- Monitoring & Alerting
- Define P1-P4 alerting rules: deployment failures, runtime errors, tool invocation failures, MCP connectivity issues
- Integrate alert notifications to Microsoft Teams channels and email; route by resource ownership tags
- Author runbooks linked to every alert; publish in Confluence for developer self-service resolution
- Evaluate AWS-native vs. third-party monitoring stack; deliver recommendation aligned to observability architecture
- Security & Access Control
- Assess current IAM + tagging approach for multi-team isolation; identify scalability gaps and risks
- Evaluate Cedar policy engine (AgentCore) for fine-grained tool access control; document enterprise-scale gaps
- Design scalable ABAC-based identity model for multi-team isolation without IAM policy sprawl; deliver Terraform modules
Requirements
Experience: 8+ years in Platform Engineering, DevOps, or Site Reliability Engineering (SRE). Cloud Expertise: Deep proficiency in AWS (IAM, CloudWatch, Bedrock, Lambda). Observability Tools: Proven experience with Dynatrace, Jaeger, or Honeycomb, and distributed tracing standards. AI/LLM Interest: Familiarity with the LLM lifecycle, including prompt execution, token usage, and frameworks like LangChain or AgentCore. Automation: Advanced experience with Terraform and CI/CD pipeline design. Collaboration: Experience working in an Agile environment with integrated tools like Microsoft Teams and Confluence.