Llama.cpp
llama.cpp python library is a simple Python bindings for
@ggerganov
llama.cpp.This package provides:
- Low-level access to C API via ctypes interface.
- High-level Python API for text completion
OpenAI
-like APILangChain
compatibilityLlamaIndex
compatibility- OpenAI compatible web server
- Local Copilot replacement
- Function Calling support
- Vision API support
- Multiple Models
Overviewโ
Integration detailsโ
Class | Package | Local | Serializable | JS support |
---|---|---|---|---|
ChatLlamaCpp | langchain-community | โ | โ | โ |
Model featuresโ
Tool calling | Structured output | JSON mode | Image input | Audio input | Video input | Token-level streaming | Native async | Token usage | Logprobs |
---|---|---|---|---|---|---|---|---|---|
โ | โ | โ | โ | โ | โ | โ | โ | โ | โ |
Setupโ
To get started and use all the features show below, we reccomend using a model that has been fine-tuned for tool-calling.
We will use Hermes-2-Pro-Llama-3-8B-GGUF from NousResearch.
Hermes 2 Pro is an upgraded version of Nous Hermes 2, consisting of an updated and cleaned version of the OpenHermes 2.5 Dataset, as well as a newly introduced Function Calling and JSON Mode dataset developed in-house. This new version of Hermes maintains its excellent general task and conversation capabilities - but also excels at Function Calling
See our guides on local models to go deeper:
Installationโ
The LangChain LlamaCpp integration lives in the langchain-community
and llama-cpp-python
packages:
%pip install -qU langchain-community llama-cpp-python
Instantiationโ
Now we can instantiate our model object and generate chat completions:
# Path to your model weights
local_model = "local/path/to/Hermes-2-Pro-Llama-3-8B-Q8_0.gguf"
import multiprocessing
from langchain_community.chat_models import ChatLlamaCpp
llm = ChatLlamaCpp(
temperature=0.5,
model_path=local_model,
n_ctx=10000,
n_gpu_layers=8,
n_batch=300, # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
max_tokens=512,
n_threads=multiprocessing.cpu_count() - 1,
repeat_penalty=1.5,
top_p=0.5,
verbose=True,
)