MLhardDesign tradeoff~15m
LLM Context Window Memory Leak
Problem
Your production LLM chat system supports 128k token context windows and maintains conversation history. Users report that responses become increasingly irrelevant and repetitive after 20-30 exchanges, but individual response quality remains high when tested in isolation. Memory usage grows linearly with conversation length, and you're hitting GPU memory limits during peak hours. Your current approach truncates oldest messages when approaching the context limit. Design a solution that maintains conversation coherence while staying within computational constraints.
Reference solution
Reference solution available after you attempt the question.
Ready to solve it?
Start a session on Mockbit #95. You'll get graded with specific critique when you submit.
Related ML questions
← Back homemockbit.io/q/95