Serving Llama3 at 3,500 tokens/s

E
Ethan PetersenJuly 16, 2024

Getting Started

Before we jump in, ensure that you have the Crusoe CLI installed and configured to work with your account. We’ll use this tool to provision our resources and tear them down at the end.

Navigate to the root directory of this repository. Then, provision resources with:

The startup script will take care of creating a filesystem and mounting the disk as well as dependency installation. After the creation has completed, ssh into the public IP address in the output of crusoe compute vms create.

Once in the VM, check on the startup script's status by running journalctl -u lifecycle-script.service -f. If you see Finished Run lifecycle scripts. at the bottom, then you're ready to proceed. Otherwise, wait until setup has completed. It can take ~10 minutes, as kernels are being compiled for the GPU and large model files are being downloaded.

Benchmarking

After setup has completed, let's run a quick benchmark! Navigate to /workspace/llama3-qserve/qserve and run the below commands:

This will run a few rounds of benchmarking with 1024 sequence length, 512 output length, and 128 batch size. The throughput is logged to stdout and the results will be saved to results.csv. Once completed, you should see Round 2 Throughput: 3568.7845477930728 tokens / second. (though your numbers may be slightly different).

Chat.py

We've included a simple chat script to show how to use the QServe Python library. To use the script, move it into the qserve root directory, then run the below command:

This will bring up a command line chat interface, simply type a prompt and hit enter to send it to the QServe engine. You'll see the assistant's response in stdout and can continue the conversation. Type exit and hit enter when you want to terminate the script.

Within chat.py, you can see that we begin by parsing the engine arguments which dictate the model being used, quantization configuration, etc.

Then, we instantiate the engine.

In main, we register a conversation template (in this case, Llama3-8B-Instruct) and configure our sampling parameters.

Then, we enter a loop where the bulk of the functionality is defined. To send a request to the engine, we first append the message to our conversation which takes care of formatting and applying the model's template. By calling get_prompt(), we receive the conversation history in an appropriate format for the LLM to generate from. Finally, we add the request by sending a request_id

If ifb_mode is on, the engine will automatically schedule and pack requests for continuous/in-flight batching with no changes to the code. For this application, you won't notice any changes however it is a drastic improvement when serving multiple concurrent users.

To progress the engine, we call engine.step() and log the current outputs. We then check their status and see if any have finished. If we were serving concurrent users, we would want to use the request identifier in order to match results and route back to the correct user.

Clean Up

To clean up the resources used, run the below commands: