2 Comments
Jun 25Liked by Ben Dickson

When conducting large model inference, we need to focus on three aspects:

1. Memory (VRAM): The amount of memory required is determined by the size of the model parameters.

2. Bandwidth: Large model inference is a memory-intensive computation that frequently accesses VRAM, so the bandwidth specifications will affect the inference speed.

3. Quantization: Many models now provide quantized versions in addition to the standard FP16 precision models. Lower precision quantization can save more VRAM and improve bandwidth access speed, which is a common technique used in model inference.

Cloud-based inference optimization also revolves around these three aspects.

Expand full comment
author

Correct. But this was more about training the models for 1M token context

Expand full comment