AI Research Engineer (Multi-Modal & Vision) - 100% Remote Worldwide
Role details
Job location
Tech stack
Job description
As a member of the AI model team, you will drive innovation in training and optimizing vision-language models with a focus on real-world deployment. Your work will span the full model development lifecycle - from data curation and training pipeline design to model evaluation and optimization - with the goal of building models that are both highly capable and practical to deploy at scale.
You will work across a wide spectrum of multimodal architectures integrating text and vision, applying state-of-the-art research to improve model quality, efficiency, and domain-specific performance. We expect you to bring a research-driven mindset combined with strong engineering discipline - someone who can identify the right technique for a given problem, implement it rigorously, and measure its impact clearly.
You will work closely with a small, high-caliber team where your contributions will have direct and meaningful impact. If you are passionate about pushing the boundaries of what multimodal AI can achieve in production environments, this is your opportunity.
Responsibilities
- Conduct end-to-end research and engineering on vision-language models, covering training, evaluation, and optimization across the full model development lifecycle.
- Design and implement post-training pipelines including supervised fine-tuning, knowledge distillation, and reinforcement learning from human feedback.
- Develop and maintain high-quality multimodal datasets, including data curation, filtering, and balancing for domain-specific tasks.
- Drive model efficiency and deployability, adapting models for resource-constrained environments using compression and optimization techniques.
- Design and implement evaluation frameworks and benchmarks to measure model performance, robustness, and real-world task success.
- Build and scale training workflows across distributed GPU infrastructure.
- Identify and resolve bottlenecks in training pipelines to achieve state-of-the-art model quality on target benchmarks.
- Contribute to and leverage open-source ecosystems including models, datasets, and tooling to accelerate development.
- Stay current with the latest research in multimodal learning and vision-language systems, translating relevant findings into practical improvements.
- Publish research findings in top-tier AI conferences and journals where applicable., Recruitment scams have become increasingly common. To protect yourself, please keep the following in mind when applying for roles:
- Apply only through our official channels. We do not use third-party platforms or agencies for recruitment unless clearly stated. All open roles are listed on our official careers page: https://tether.recruitee.com/
- Verify the recruiter's identity. All our recruiters have verified LinkedIn profiles. If you're unsure, you can confirm their identity by checking their profile or contacting us through our website.
- Be cautious of unusual communication methods. We do not conduct interviews over WhatsApp, Telegram, or SMS. All communication is done through official company emails and platforms.
- Double-check email addresses. All communication from us will come from emails ending in @tether.to or @tether.io
- We will never request payment or financial details. If someone asks for personal financial information or payment at any point during the hiring process, it is a scam. Please report it immediately.
Requirements
If you have excellent English communication skills and are ready to contribute to the most innovative platform on the planet, Tether is the place for you.
Are you ready to be part of the future?, * Degree in Computer Science, Machine Learning, or a related field; MS/PhD preferred.
- Strong experience with multimodal post-training workflows including supervised fine-tuning, knowledge distillation, and reinforcement learning from feedback.
- Hands-on experience with parameter-efficient fine-tuning and distributed training frameworks.
- Demonstrated ability to build and improve vision-language models with measurable results on standard benchmarks or real-world tasks.
- Experience adapting models for resource-constrained environments.
- Proven open-source contributions in multimodal AI on GitHub or HuggingFace.
- Publications at top AI conferences (NeurIPS, ICML, ICLR, CVPR, ECCV etc.)