2 Comments
User's avatar
Meng Li's avatar

When conducting large model inference, we need to focus on three aspects:

1. Memory (VRAM): The amount of memory required is determined by the size of the model parameters.

2. Bandwidth: Large model inference is a memory-intensive computation that frequently accesses VRAM, so the bandwidth specifications will affect the inference speed.

3. Quantization: Many models now provide quantized versions in addition to the standard FP16 precision models. Lower precision quantization can save more VRAM and improve bandwidth access speed, which is a common technique used in model inference.

Cloud-based inference optimization also revolves around these three aspects.

Expand full comment
Ben Dickson's avatar

Correct. But this was more about training the models for 1M token context

Expand full comment