Mercurial
view asyncio_threads/inference/README.md @ 64:a30944e5719e
Added vibe coded markdown to html script since it is useful for me. Updated Dowa so that it can be compiled without dirnet for windows.
| author | June Park <parkjune1995@gmail.com> |
|---|---|
| date | Tue, 23 Dec 2025 15:18:46 -0800 |
| parents | 46daba6e3cf4 |
| children |
line wrap: on
line source
Inference Questions Context You are tasked with building a simplified inference engine component responsible for handling incoming user requests for a large language model (LLM). To optimize throughput and GPU utilization, the engine must batch multiple requests together, run the inference call once per batch, and then deconstruct the results to return token-level output to the individual users. Objective Complete the provided Python class, BatchInferenceEngine by implementing the methods necessary to: Queue incoming user requests. Process a batch when the queue reaches a defined batch size. Simulate the token-level output from an LLM and correctly associate each generated token with its original request. Task Requirements Implement the logic for $enqueue\_request$. Implement the logic for $\_process\_batch$. Demonstrate the usage by creating 7 unique requests and enqueueing them one by one. Show the state of the queue and the processed tokens after each batch run.