Senior Infrastructure Engineer

Graswald GmbH
3 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Remote

Tech stack

Agile Methodologies
Artificial Intelligence
Amazon Web Services (AWS)
Bash
Cloud Computing
Linux
DevOps
Identity and Access Management
Python
Reliability Engineering
Prometheus
Datadog
Grafana
Containerization
Kubernetes
Information Technology
Machine Learning Operations
Terraform
Docker
Go

Job description

The Senior Infrastructure Engineer designs and operates the systems that power Graswald's platform. This role focuses on reliability, scalability, security, and cost efficiency, while enabling product teams to move quickly and safely. As a senior member of the team, you'll own core infrastructure architecture, make long-term technical trade-offs, and mentor others through example., Infrastructure Design & Development: Design, build, and maintain scalable, resilient, and secure infrastructure systems. Implement automation and Infrastructure-as-Code (IaC) practices to ensure consistency, reliability, and maintainability of environments.

  • Technical Contribution & Operational Excellence: Actively contribute to the architecture, deployment, and ongoing improvement of cloud infrastructure and platform services. Perform rigorous peer reviews of infrastructure code, CI/CD pipelines, and system configurations to uphold quality, efficiency, and adherence to best practices.
  • Reliability & Operations: Own the stability, performance, and observability of production systems. Lead incident response, root cause analysis, and long-term improvements to prevent recurrence. Help defining a sustainable on-call culture.
  • Performance & Cost Optimization: Regularly review resource usage and optimize infrastructure for performance and cost efficiency. Propose architectural improvements where needed.
  • Collaboration & Enablement: Partner closely with product and engineering teams to design reliable infrastructure solutions, participate in architectural discussions and postmortems, and provide guidance on best practices for scalability, cost optimization and security.
  • Continuous Learning & AI-Driven Operations: Stay current with evolving cloud, DevOps, and infrastructure technologies. Explore and apply AI-driven capabilities in areas like monitoring, incident detection, and automated remediation to enhance operational excellence and productivity. Experiment with and champion modern practices to drive innovation within the infrastructure team.
  • Documentation & Knowledge Sharing: Create and maintain clear, comprehensive documentation for infrastructure designs, operational runbooks, and processes. Ensure that knowledge is easily accessible for current and future team members, reducing operational risk and onboarding time.
  • Security and Compliance: Implement and enforce security controls, access management policies, and compliance requirements across infrastructure environments.

Requirements

Do you have experience in Terraform?, Do you have a Bachelor's degree?, + Several years of professional experience in infrastructure engineering, DevOps, or site reliability engineering (SRE) roles.

  • Experience operating within agile software development teams and modern DevOps practices.
  • Bachelor's degree in Computer Science, Engineering, or equivalent professional experience.
  • Technical Expertise:
  • Extensive hands-on experience with at least one of the cloud providers AWS or GCP.
  • Proven ability to design and implement Infrastructure-as-Code (IaC) using tools such as Terraform.
  • Proficiency in scripting and automation (e.g., Python, Bash, Go) to streamline operations and reduce manual tasks.
  • Solid understanding of Linux systems, containerization (Docker), and orchestration platforms (Kubernetes, ECS, or similar).
  • Nice to Have
  • Experience operating ML inference or training infrastructure at scale.
  • Familiarity with MLOps tooling (SageMaker, Vertex AI, Kubeflow, MLflow, Argo Workflows)
  • Operational Excellence:
  • Experience building and operating highly available, reliable, and scalable systems in production environments.
  • Strong background in monitoring, observability, and incident response, with tools such as Prometheus, Grafana, Datadog, ELK, or similar.
  • Knowledge of security best practices, including identity and access management, secrets management, compliance, and secure system design.
  • Collaboration & Leadership:
  • Demonstrated ability to work effectively in cross-functional teams, partnering with product engineers, security, and data teams.
  • Strong communication skills with the ability to explain complex technical concepts clearly to both technical and non-technical audiences.
  • Problem-Solving & Adaptability:
  • Track record of diagnosing and resolving complex infrastructure issues under pressure.
  • Ability to balance short-term fixes and long-term architectural improvements.
  • Proactive and curious mindset, with a drive for continuous improvement and innovation.

About the company

At Graswald AI, we're building the AI operating system for fashion brands and retailers, starting with AI image and content generation. Our engineering team tackles rapid scaling challenges, GPU-intensive workloads, and enterprise-grade infrastructure to deliver fast, pixel-perfect results for global brands. In the past year alone, we've signed 50 enterprise fashion clients who rely on us to reduce costs and accelerate creative timelines. Backed by leading VCs and strategic investors including Lakestar and Orendt Studios, and preparing for a Series A later this year, we're growing fast and building technology that is already reshaping how fashion content is produced

Apply for this position