docs/gen-ai/DynamicLoading.md
Dynamic loading is a technique to inference very large model on a machine with limited GPU memory. The idea is to load only part of the model to GPU memory and run inference on the loaded part. Once the inference is done, the loaded part is released from GPU memory and the next part is loaded to GPU memory. This process is repeated until the whole model is processed.
The technique is available in both llama.cpp and huggingface accelerate. The GenAI model package should also support this technique.
Experiment over partial loading is done in PR #10. The main take-away are
The result can be found in this report