README - Machinelearning

This folder contains the design doc for GenAI Model package

Dynamic loading: load only part of model to GPU when gpu memory is limited. We explore the result w/o dynamic loading in this report
Improve loading speed: I notice that the model loading speed from disk to memory is slower in torchsharp than what it is in huggingface. Need to investigate the reason and improve the loading speed
Quantization: quantize the model to reduce the model size and improve the inference speed