Lessons learned from creating LLMs with…

Ben Dickson

Jun 25, 2024

Gradient and Crusoe customized Llama-3 to support 1-million tokens in the context window. Here's how they did it.

Read →

2 Comments

Meng Li

Jun 25, 2024

When conducting large model inference, we need to focus on three aspects:

1. Memory (VRAM): The amount of memory required is determined by the size of the model parameters.

2. Bandwidth: Large model inference is a memory-intensive computation that frequently accesses VRAM, so the bandwidth specifications will affect the inference speed.

3. Quantization: Many models now provide quantized versions in addition to the standard FP16 precision models. Lower precision quantization can save more VRAM and improve bandwidth access speed, which is a common technique used in model inference.

Cloud-based inference optimization also revolves around these three aspects.

Expand full comment

Reply (1)

Ben Dickson

Jun 25, 2024

Correct. But this was more about training the models for 1M token context

Expand full comment

TechTalks

Lessons learned from creating LLMs with…