Back to Annotated Deep Learning Paper Implementations

Evaluate GPT-NeoX using LLM.int8() quantization on test suite

docs/neox/evaluation/llm_int8.html

latest1.7 KB
Original Source

homeneoxevaluation

View code on Github

#

Evaluate GPT-NeoX using LLM.int8() quantization on test suite

This code evaluate GPT-NeoX using LLM.int8() quantization, on a suite of tasks.

14importtorch15fromtorchimportnn1617fromlabmlimportmonit18fromlabml\_nn.neox.evaluationimportrun\_eval\_harness19fromlabml\_nn.neox.modelimportLayerGenerator

#

22defmain():

#

Device

24device=torch.device('cuda:0')

#

Load layers in float16 into CPU. We convert the layers to int8 later, because doing that on the fly after loading layers to GPU causes CUDA memory fragmentation (about 3GB memory can get lost due to fragmentation).

29layer\_generator=LayerGenerator(is\_clone\_layers=True,30dtype=torch.float16,31device=torch.device('cpu'),32)

#

Load layers

34layers=list(layer\_generator.load())

#

This reduces CUDA memory fragmentation

37forlayerinmonit.iterate('Convert to int8',layers,is\_children\_silent=True):38layer\_generator.post\_load\_prepare(layer,39device=device,40is\_llm\_int8=True,41llm\_int8\_threshold=6.0,42)43layer.to(device)

#

Create nn.Sequential model

46model=nn.Sequential(\*layers)

#

Run evaluation harness

49print(run\_eval\_harness(model,'half\_precision',[],device))

#

53if\_\_name\_\_=='\_\_main\_\_':54main()

labml.ai