MLhardDesign tradeoff~15m

LLM Context Window Memory Leak

Problem

Your production LLM chat system supports 128k token context windows and maintains conversation history. Users report that responses become increasingly irrelevant and repetitive after 20-30 exchanges, but individual response quality remains high when tested in isolation. Memory usage grows linearly with conversation length, and you're hitting GPU memory limits during peak hours. Your current approach truncates oldest messages when approaching the context limit. Design a solution that maintains conversation coherence while staying within computational constraints.

Reference solution

Reference solution available after you attempt the question.

Ready to solve it?

Start a session on Mockbit #95. You'll get graded with specific critique when you submit.