Evaluate GPT-NeoX using LLM.int8() quantization on test suite

This code evaluate GPT-NeoX using LLM.int8() quantization, on a suite of tasks.

14importtorch15fromtorchimportnn1617fromlabmlimportmonit18fromlabml\_nn.neox.evaluationimportrun\_eval\_harness19fromlabml\_nn.neox.modelimportLayerGenerator

22defmain():

Device

24device=torch.device('cuda:0')

Load layers in float16 into CPU. We convert the layers to int8 later, because doing that on the fly after loading layers to GPU causes CUDA memory fragmentation (about 3GB memory can get lost due to fragmentation).

29layer\_generator=LayerGenerator(is\_clone\_layers=True,30dtype=torch.float16,31device=torch.device('cpu'),32)

Load layers

34layers=list(layer\_generator.load())

This reduces CUDA memory fragmentation

37forlayerinmonit.iterate('Convert to int8',layers,is\_children\_silent=True):38layer\_generator.post\_load\_prepare(layer,39device=device,40is\_llm\_int8=True,41llm\_int8\_threshold=6.0,42)43layer.to(device)

Create nn.Sequential model

46model=nn.Sequential(\*layers)

Run evaluation harness

49print(run\_eval\_harness(model,'half\_precision',[],device))

53if\_\_name\_\_=='\_\_main\_\_':54main()

labml.ai