You ask — we answer!

Apps & Guides

Your own Qwen using HF

Large neural network models, with their extraordinary abilities, are firmly rooted in our lives. Recognizing this as an opportunity for future development, large corporations began to develop their own versions of these models. The Chinese giant, Alibaba, didn’t stand by. They created their own model, QWen (Tongyi Qianwen), which became the basis for many other neural network models.

Prerequisites

Update cache and packages

Let’s update the package cache and upgrade your operating system before you start setting up Qwen. Also, we need to add Python Installer Packages (PIP), if it isn’t already present in the system. Please note that for this guide, we are using Ubuntu 22.04 LTS as the operating system:

sudo apt update && sudo apt -y upgrade && sudo apt install python3-pip

Install NVIDIA® drivers

You can use the automated utility that is included in Ubuntu distributions by default:

sudo ubuntu-drivers autoinstall

Alternatively, you can install NVIDIA® drivers manually using our step-by-step guide. Don’t forget to reboot the server:

sudo shutdown -r now

Text generation web UI

Clone the repository

Open the working directory on the SSD:

cd /mnt/fastdisk

Clone the project’s repository:

git clone https://github.com/oobabooga/text-generation-webui.git

Install requirements

Open the downloaded directory:

cd text-generation-webui

Check and install all missing components:

pip install -r requirements.txt

Add SSH key to HF

Before starting, you need to set up port forwarding (remote port 7860 to 127.0.0.1:7860) in your SSH-client. You can find additional information in the following article: Connect to Linux server.

Update the package cache repository and installed packages:

sudo apt update && sudo apt -y upgrade

Generate and add an SSH-key that you can use in Hugging Face:

cd ~/.ssh && ssh-keygen

When the keypair is generated, you can display the public key in the terminal emulator:

cat id_rsa.pub

Copy all information starting from ssh-rsa and ending with usergpu@gpuserver as shown in the following screenshot:

Copy RSA key

Open a web browser, type https://huggingface.co/ into the address bar and press Enter. Log into your HF-account and open Profile settings. Then choose SSH and GPG Keys and click on the Add SSH Key button:

Add SSH key

Fill in the Key name and paste the copied SSH Public key from the terminal. Save the key by pressing Add key:

Paste the key

Now, your HF-account is linked with the public SSH-key. The second part (private key) is stored on the server. The next step is to install a specific Git LFS (Large File Storage) extension, which is used for downloading large files such as neural network models. Open your home directory:

cd ~/

Download and run the shell script. This script installs a new third-party repository with git-lfs:

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash

Now, you can install it using the standard package manager:

sudo apt-get install git-lfs

Let’s configure git to use our HF nickname:

git config --global user.name "John"

And linked to the HF email account:

git config --global user.email "john.doe@example.com"

Download the model

The next step is to download the model using the repository cloning technique commonly used by software developers. The only difference is that the previously installed Git-LFS will automatically process the marked pointer files and download all the content. Open the necessary directory (/mnt/fastdisk in our example):

cd /mnt/fastdisk

This command may take some time to complete:

git clone git@hf.co:Qwen/Qwen1.5-32B-Chat-GGUF

Run the model

Execute a script that will start the web server and specify /mnt/fastdisk as the working directory with models. This script may download some additional components upon first launch.

./start_linux.sh --model-dir /mnt/fastdisk

Open your web browser and select the llama.cpp from the Model loader drop-down list:

llama.cpp settings

Be sure to set the n-gpu-layers parameter. It is he who is responsible for what percentage of calculations will be offloaded to the GPU. If you leave the number at 0, then all calculations will be performed on the CPU, which is quite slow. Once all parameters are set, click the Load button. After that, go to the Chat tab and select Instruct mode. Now, you can enter any prompt and receive a response:

Qwen chat example

Processing will be performed by default on all available GPUs, taking into account the previously specified parameters:

Qwen task GPU loading

See also:



Updated: 28.03.2025

Published: 20.01.2025