src/Microsoft.ML.GenAI.Phi/README.md
Torchsharp implementation of Microsoft phi-series models for GenAI
The following phi-models are supported and tested:
## make sure you have lfs installed
git clone https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
var weightFolder = "/path/to/Phi-3-mini-4k-instruct";
var configName = "config.json";
var config = JsonSerializier.Deserialize<Phi3Config>(File.ReadAllText(Path.Combine(weightFolder, configName)));
var model = new Phi3ForCausalLM(config);
// load tokenizer
var tokenizerModelName = "tokenizer.model";
var tokenizer = Phi3TokenizerHelper.FromPretrained(Path.Combine(weightFolder, tokenizerModelName));
// load weight
model.LoadSafeTensors(weightFolder);
// initialize device
var device = "cuda";
if (device == "cuda")
{
torch.InitializeDeviceType(DeviceType.CUDA);
}
// create causal language model pipeline
var pipeline = new CausalLMPipeline<Tokenizer, Phi3ForCausalLM>(tokenizer, model, device);
IChatCompletionService to sematic kernelvar kernel = Kernel.CreateBuilder()
.AddGenAIChatCompletion(pipeline)
.Build();
var chatService = kernel.GetRequiredService<IChatCompletionService>();
var chatHistory = new ChatHistory();
chatHistory.AddSystemMessage("you are a helpful assistant");
chatHistory.AddUserMessage("write a C# program to calculate the factorial of a number");
await foreach (var response in chatService.GetStreamingChatMessageContentsAsync(chatHistory))
{
Console.Write(response);
}
Phi3Agent from pipelinevar agent = new Phi3Agent(pipeline, name: "assistant")
.RegisterPrintMessage();
var task = """
write a C# program to calculate the factorial of a number
""";
await agent.SendAsync(task);
Please refer to Microsoft.ML.GenAI.Samples for more examples.
It's recommended to run model inference on GPU, which requires at least 8GB of GPU memory for phi-3-mini-4k-instruct model if fully loaded.
If your GPU memory is not enough, you can choose to dynamically load the model weight to GPU memory. Here is how it works behind the scene:
Here is how to enable dynamic loading of model:
You can infer the size of each layer using InferDeviceMapForEachLayer API. The deviceMap will be a key-value dictionary, where the key is the layer name and the value is the device name (e.g. "cuda" or "cpu").
// manually set up the available memory on each device
var deviceSizeMap = new Dictionary<string, long>
{
["cuda"] = modelSizeOnCudaInGB * 1L * 1024 * 1024 * 1024,
["cpu"] = modelSizeOnMemoryInGB * 1L * 1024 * 1024 * 1024,
["disk"] = modelSizeOnDiskInGB * 1L * 1024 * 1024 * 1024,
};
var deviceMap = model.InferDeviceMapForEachLayer(
devices: ["cuda", "cpu", "disk"],
deviceSizeMapInByte: deviceSizeMap);
ToDynamicLoadingModel APIOnce the deviceMap is calculated, you can pass it to ToDynamicLoadingModel api to load the model weight.
model = model.ToDynamicLoadingModel(deviceMap, "cuda");