Webvar
The Inference Server - Llama.cpp - CUDA - NVIDIA Container - Ubuntu 22 - logo

The Inference Server - Llama.cpp - CUDA - NVIDIA Container - Ubuntu 22

Run AI Inference on your own server for coding support, creative writing, summarizing, ... without sharing data with other services. The Inference server has all you need to run state-of-the-art inference on GPU servers. Includes llama.cpp inference, latest CUDA and NVIDIA Docker container support. Support for llama-cpp-python, Open Interpreter, Tabby coding assistant.
awsPurchase this listing from Webvar in AWS Marketplace using your AWS account. In AWS Marketplace, you can quickly launch pre-configured software with just a few clicks. AWS handles billing and payments, and charges on your AWS bill.

About

The Inference server offers the full infrastructure to run fast inference on GPUs.

It includes llama.cpp inference, latest CUDA and NVIDIA Docker container toolkit.

Leverage the multitude of models freely available to run inference with 8 bit or lower quantized models which makes inference possible on e.g. 16 GB or 24 GB memory GPUs.

Llama.cpp offer efficient inference of quantized models in interactive and server mode. It features

Plain C/C++ implementation without dependencies

2-bit, 3-bit, 4-bit, 5-bit, 6-bit and 8-bit integer quantization support

Running inference on GPU and CPU simultaneously allowing to run larger models in case GPU memory is insufficient

AVX, AVX2 and AVX512 support for x86 architectures

Supported models: LLaMA, LLaMA 2, Falcon, Alpaca, GPT4All, Chinese LLaMA / Alpaca and Chinese LLaMA-2 / Alpaca-2, Vigogne (French), Vicuna, Koala, OpenBuddy (Multilingual), Pygmalion 7B / Metharme 7B, WizardLM, Baichuan-7B and its derivations (such as baichuan-7b-sft), Aquila-7B / AquilaChat-7B, Starcoder models, Mistral AI v0.1, Refact

Here is our guide How to use the AI SP Inference Server

The Inference server supports in addition

llama-cpp-python: OpenAI API compatible Llama.cpp inference server

Open Interpreter: let language models run code on your computer. An open-source, locally running implementation of OpenAIs Code Interpreter.

Tabby coding assistant: a self-hosted AI coding assistant, offering an open-source alternative to GitHub Copilot

Includes remote desktop access via NICE DCV high-end remote desktops or via ssh (putty, ...).

Related Products

How it works?

Search

Search 25000+ products and services vetted by AWS.

Request private offer

Our team will send you an offer link to view.

Purchase

Accept the offer in your AWS account, and start using the software.

Manage

All your transactions will be consolidated into one bill in AWS.

Create Your Marketplace with Webvar!

Launch your marketplace effortlessly with our solutions. Optimize sales processes and expand your reach with our platform.