Evaluate GPT-NeoX using LLM.int8() quantization on test suite

This code evaluate GPT-NeoX using, on a suite of tasks.

12importargparse1314importtorch15fromtorchimportnn1617fromlabml\_nn.neox.evaluationimportrun\_eval\_harness18fromlabml\_nn.neox.modelimportLayerGenerator

21defmain():

Argument parser

23parser=argparse.ArgumentParser()2425parser.add\_argument("--flash",action='store\_true',help="whether to use Flash Attention")2627opt=parser.parse\_args()

Device

30device=torch.device('cuda:0')

Load layers

32layers=list(LayerGenerator(is\_clone\_layers=True,33filter\_layers=None,34dtype=torch.float16,35device=device,36is\_flash\_attention=opt.flash,37).load())

Create nn.Sequential model

40model=nn.Sequential(\*layers)

Run evaluation harness

43print(run\_eval\_harness(model,'half\_precision',['lambada'],device))

47if\_\_name\_\_=='\_\_main\_\_':48main()

labml.ai