hf to gguf

Convert model to ht format

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load the trained model and tokenizer

model = GPT2LMHeadModel.from_pretrained("./trained_model")

tokenizer = GPT2Tokenizer.from_pretrained("./trained_model")

# Define the directory to save the model and tokenizer

save_directory = "./trained_model/hf_model"

# Save the model and tokenizer



# Load the model and tokenizer from the saved directory

loaded_model = GPT2LMHeadModel.from_pretrained(save_directory)

loaded_tokenizer = GPT2Tokenizer.from_pretrained(save_directory)

# Test the loaded model

input_ids = loaded_tokenizer.encode("Hello, how are you?", return_tensors="pt")

outputs = loaded_model.generate(input_ids)

print(loaded_tokenizer.decode(outputs[0], skip_special_tokens=True))

Convert model to gguf

Source: https://www.substratus.ai/blog/converting-hf-model-gguf-model/

Downloading a HuggingFace model. There are various ways to download models, but in my experience the huggingface_hub library has been the most reliable. The git clone method occasionally results in OOM errors for large models.

Install the huggingface_hub library:

pip install huggingface_hub

Create a Python script named download.py with the following content:

from huggingface_hub import snapshot_download


snapshot_download(repo_id=model_id, local_dir="vicuna-hf",

local_dir_use_symlinks=False, revision="main")

Run the Python script:

python download.py

You should now have the model downloaded to a directory called vicuna-hf. Verify by running:

ls -lash vicuna-hf

Converting the model: convert the downloaded HuggingFace model to a GGUF model. Llama.cpp comes with a converter script to do this.

Get the script by cloning the llama.cpp repo:

git clone https://github.com/ggerganov/llama.cpp.git

Install the required python libraries:

pip install -r llama.cpp/requirements.txt

Verify the script is there and understand the various options:

python llama.cpp/convert.py -h

Convert the HF model to GGUF model:

python llama.cpp/convert.py vicuna-hf \

--outfile vicuna-13b-v1.5.gguf \

--outtype q8_0

In this case we're also quantizing the model to 8 bit by setting --outtype q8_0. Quantizing helps improve inference speed, but it can negatively impact quality. You can use --outtype f16 (16 bit) or --outtype f32 (32 bit) to preserve original quality.

Verify the GGUF model was created:

ls -lash vicuna-13b-v1.5.gguf

Pushing the GGUF model to HuggingFace. You can optionally push back the GGUF model to HuggingFace.

Create a Python script with the filename upload.py that has the following content:

from huggingface_hub import HfApi

api = HfApi()

model_id = "substratusai/vicuna-13b-v1.5-gguf"

api.create_repo(model_id, exist_ok=True, repo_type="model")






Get a HuggingFace Token that has write permission from here: https://huggingface.co/settings/tokens

Set your HuggingFace token:

export HUGGING_FACE_HUB_TOKEN=<paste-your-own-token>

Run the upload.py script:

python upload.py


