Data et IA

Institut Polytechnique de Paris

6 days ago

Role details

Contract type

Temporary contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Job location

Tech stack

Encodings

Computer Engineering

Large Language Models

Information Technology

Job description

Embedding transformation: Given a prompt x, we first transform it into an embedding vector v = E(x) using a pretrained embedding model.
Noise addition: Noise is added directly in the embedding space according to a Laplace-like distribution consistent with metric privacy
Reconstruction: The noisy embedding is then decoded back into a sanitized prompt. This reconstruction process aims to preserve as much semantic content as possible while ensuring the targeted privacy guarantee.
Privacy budget management and reuse: Sanitized prompts may be reused or combined with information from other sanitized prompts. By the post-processing property of metric privacy, these operations do not affect privacy guarantees. In addition, a user-facing monitoring component will indicate when a cumulative privacy threshold is approached, helping users manage their interactions while maintaining privacy.

Challenges

Reconstruction from embeddings: While the pipeline transforms prompts to embeddings and back, only a few works have considered the reverse process from embedding space to discrete word space [7, 8]. Moreover, none have addressed reconstruction from a noisy embedding representation. It is therefore necessary to fine-tune the pretrained embedding model so that it can both encode prompts and reconstruct them from noisy embeddings while preserving as much semantic content as possible. Importantly, this fine-tuning should be performed on public data to avoid extra privacy costs.
Choice of distance metric: A central component of metric privacy is the distance function, which determines how much noise is injected and how privacy guarantees propagate through the embedding space. The metric must meaningfully reflect semantic similarity while matching the geometry of embeddings. A poorly chosen metric may either destroy utility (too much noise) or weaken privacy (too little noise). Identifying and validating an appropriate distance is therefore a critical challenge.
Utility improvement: Even though metric privacy allows for a better optimization of the privacy-utility tradeoff compared to classical differential privacy, a noticeable utility gap remains. Noisy embeddings inevitably introduce semantic drift, and the reconstructed prompts may still lose important information. Improving utility thus requires exploring better noise distributions, improved decoding strategies, structure-aware distances, or regularization schemes that make the embedding space more robust to perturbations. Designing such enhancements while preserving formal privacy guarantees is a significant and open problem.
Privacy budget management: As users interact with the model, more sensitive information is shared, which increases the privacy budget and weakens privacy guarantees. A key challenge is how to optimally reuse sanitized prompts, either as they are or by combining information from multiple sanitized prompts, so as to maximize utility and reduce redundant noise injection, while ensuring that the cumulative privacy budget is not exceeded. Designing strategies for effective reuse and combination of sanitized prompts, and for alerting users when a threshold is approached, is critical for enabling more interactions with the LLM without compromising privacy.