When conducting large model inference, we need to focus on three aspects:
1. Memory (VRAM): The amount of memory required is determined by the size of the model parameters.
2. Bandwidth: Large model inference is a memory-intensive computation that frequently accesses VRAM, so the bandwidth specifications will affect the inference speed.
3. Quantization: Many models now provide quantized versions in addition to the standard FP16 precision models. Lower precision quantization can save more VRAM and improve bandwidth access speed, which is a common technique used in model inference.
Cloud-based inference optimization also revolves around these three aspects.
When conducting large model inference, we need to focus on three aspects:
1. Memory (VRAM): The amount of memory required is determined by the size of the model parameters.
2. Bandwidth: Large model inference is a memory-intensive computation that frequently accesses VRAM, so the bandwidth specifications will affect the inference speed.
3. Quantization: Many models now provide quantized versions in addition to the standard FP16 precision models. Lower precision quantization can save more VRAM and improve bandwidth access speed, which is a common technique used in model inference.
Cloud-based inference optimization also revolves around these three aspects.
Correct. But this was more about training the models for 1M token context