docs/neox/evaluation/llm_int8.html
This code evaluate GPT-NeoX using LLM.int8() quantization, on a suite of tasks.
14importtorch15fromtorchimportnn1617fromlabmlimportmonit18fromlabml\_nn.neox.evaluationimportrun\_eval\_harness19fromlabml\_nn.neox.modelimportLayerGenerator
22defmain():
Device
24device=torch.device('cuda:0')
Load layers in float16 into CPU. We convert the layers to int8 later, because doing that on the fly after loading layers to GPU causes CUDA memory fragmentation (about 3GB memory can get lost due to fragmentation).
29layer\_generator=LayerGenerator(is\_clone\_layers=True,30dtype=torch.float16,31device=torch.device('cpu'),32)
Load layers
34layers=list(layer\_generator.load())
This reduces CUDA memory fragmentation
37forlayerinmonit.iterate('Convert to int8',layers,is\_children\_silent=True):38layer\_generator.post\_load\_prepare(layer,39device=device,40is\_llm\_int8=True,41llm\_int8\_threshold=6.0,42)43layer.to(device)
Create nn.Sequential model
46model=nn.Sequential(\*layers)
49print(run\_eval\_harness(model,'half\_precision',[],device))
53if\_\_name\_\_=='\_\_main\_\_':54main()