Background image

Author: Nomic Supercomputing Team

Run LLMs on Any GPU: GPT4All Universal GPU Support

Access to powerful machine learning models should not be concentrated in the hands of a few organizations. With GPT4All, Nomic AI has helped tens of thousands of ordinary people run LLMs on their own local computers, without the need for expensive cloud infrastructure or specialized hardware. While CPU inference with GPT4All is fast and effective, on most machines graphics processing units (GPUs) present an opportunity for faster inference. GPUs are ubiquitous in LLM training and inference because of their superior speed, but deep learning algorithms traditionally run only on top-of-the-line NVIDIA GPUs that most ordinary people (and more than a few companies) can't easily access.

Today we're excited to announce the next step in our effort to democratize access to AI: official support for quantized large language model inference on GPUs from a wide variety of vendors including AMD, Intel, Samsung, Qualcomm and NVIDIA with open-source Vulkan support in GPT4All. The Nomic AI Vulkan backend will enable accelerated inference of foundation models such as Meta's LLaMA2, Together's RedPajama, Mosaic's MPT, and many more on graphics cards found inside common edge devices.

These include modern consumer GPUs like:

As well as modern cloud inference machines, including:

You can consult this database to see if your device supports Vulkan API 1.2+.

This technology allows us to target a wide variety of devices from across the edge computing world with single code base. As long as the devices have enough RAM and compatible Vulkan drivers, these could include in the future devices like:

Using the Nomic Vulkan backend

You can currently run any LLaMA/LLaMA2 based model with the Nomic Vulkan backend in GPT4All.

Try it on your Windows, MacOS or Linux machine through the GPT4All Local LLM Chat Client.

It will just work - no messy system dependency installs, no multi-gigabyte Pytorch binaries, no configuring your graphics card. GPT4All auto-detects compatible GPUs on your device and currently supports inference bindings with Python and the GPT4All Local LLM Chat Client. Other bindings are coming out in the following days:

Nomic Vulkan Benchmarking

You can find Python documentation for how to explicitly target a GPU on a multi-GPU system here.

What is Vulkan?

Vulkan is a high-performance, low-level compute API designed to provide direct access to GPU hardware, maximizing the potential of efficient parallel processing. In contrast to high-level machine learning libraries like TensorFlow or PyTorch, which abstract hardware details, Vulkan gives granular control over the GPU. Proprietary solutions such as NVIDIA CUDA are limited to specific hardware. Vulkan's platform-agnostic design ensures flexibility across a variety of operating systems and hardware platforms. This broader compatibility makes Vulkan an optimal choice for deployment of machine learning models in diverse hardware environments.

How fast is the Nomic Vulkan backend?

We benchmark inference on GPUs manufactured by several hardware providers. Note in this comparison Nomic Vulkan is a single set of GPU kernels that work on both AMD and Nvidia GPUs. Nomic Vulkan outperforms OpenCL on modern Nvidia cards and further improvements are imminent.

Nomic Vulkan Benchmarking
Nomic Vulkan Benchmarks: Single batch item inference token throughput benchmarks.

When should I use the GPT4All Vulkan backend?

There are several large language model deployment options and which one you use depends on cost, memory and deployment constraints. GPT4All Vulkan and CPU inference should be preferred when your LLM powered application has:

You should currently use a specialized LLM inference server such as vLLM, FlexFlow, text-generation-inference or gpt4all-api with a CUDA backend if your application:

GPT4All in Production
Nomic Vulkan in Production.

Reducing ML serving costs with Nomic

If you'd like to reduce your machine learning infrastructure and serving costs reach out to Nomic to secure early access to our Supercomputing Team's inference optimization stack. The Nomic Supercomputing inference stack allows custom machine learning architectures to run anywhere with low costs and resources.

For more information on Nomic Vulkan, read on!

What's next?

Democratized access to the building blocks behind machine learning systems is crucial. In the next few GPT4All releases the Nomic Supercomputing Team will introduce:

Nomic Vulkan License

The GPT4All Vulkan backend is released under the Software for Open Models License (SOM). The purpose of this license is to encourage the open release of machine learning models. You can find the full license text here. If an entity wants their machine learning model to be usable with GPT4All Vulkan Backend, that entity must openly release the machine learning model. This includes the model weights and logic to execute the model. Some examples of models that are compatible with this license include LLaMA, LLaMA2, Falcon, MPT, T5 and fine-tuned versions of such models that have openly released weights. Examples of models which are not compatible with this license and thus cannot be used with GPT4All Vulkan include gpt-3.5-turbo, Claude and Bard until they are openly released.

All code related to CPU inference of machine learning models in GPT4All retains its original open-source license.

We've thought a lot about how best to accelerate an ecosystem of open models and open model software and worked with Heather Meeker, a well regarded thought leader in open source licensing who has done a lot of thinking about open licensing for LLM's, to design this license. In the coming days we'll be talking more about this license and what we're hoping to achieve.

Code and Acknowledgements

You can find the Nomic Vulkan backend in our Github repository here.

Some examples of the GLSL kernel's we've written can be found here and here and here. If you are interested in improving the efficiency of these implementations, we are happy to provide thorough review.

We'd like to thank the ggml and llama.cpp community for a great codebase with which to launch this backend.

open-source the data, open-source the models, gpt4all.

cheers.


The Nomic Supercomputing Team has one open position. A strong candidate has a history of significant open-source contributions and experience optimizing embedded systems. Apply here.

nomic logo
nomic logonomic logo nomic logo nomic logonomic logonomic logo nomic logo nomic logo
“Henceforth, it is the map that precedes the territory” – Jean Baudrillard