docs/neox/utils/cache.html
During inference the model outputs token by token. We use this simple cache to store key's and value's attention layers, so that we don't have to recompute them for previous tokens.
15fromtypingimportAny
This maintains a key-value cache and queues push values and pop them in the same order. The queues are useful since we have multiple attention layers.
18classCache:
26def\_\_init\_\_(self):27self.\_cache={}
29defclear\_all(self):
33self.\_cache={}
name is the name of the queuevalue is the value to be pushed35defpush(self,name:str,value:Any):
Create an empty queue if it's not present
44ifnamenotinself.\_cache:45self.\_cache[name]=[]
Push to the queue
48self.\_cache[name].append(value)
name is the name of the queueReturns size of the queue if exists else None
50defq\_size(self,name):
58ifnamenotinself.\_cache:59returnNone6061iftype(self.\_cache[name])!=list:62returnNone6364returnlen(self.\_cache[name])
name is the name of the queueReturns the value
66defpop(self,name:str):
73returnself.\_cache[name].pop(0)
key is the name of the value to be cachedvalue is the value75defset(self,key:str,value:Any):
82self.\_cache[key]=value
key is the name used when cachingdefault is the default value if the cache is emptyReturns the cached value
84defget(self,key:str,default:Any=None):
92returnself.\_cache.get(key,default)
key is the name used when caching94defclear(self,key:str):
100delself.\_cache[key]
Singleton for cache
104\_INSTANCE=None
Returns the cache instance
107defget\_cache()-\>Cache:
113global\_INSTANCE114115if\_INSTANCEisNone:116\_INSTANCE=Cache()117118return\_INSTANCE