LeaderGPU® | GPU solutions for high-performance computing

Qwen3-Coder: A Broken Paradigm

Tue, 12 Aug 2025 14:11:06 +0200

We’re used to thinking that open-source models always lag behind their commercial counterparts in quality. It may seem that they’re developed exclusively by enthusiasts who cannot afford to invest vast sums in creating high-quality datasets and training models on tens of thousands of modern GPUs.

It’s a different story when large corporations like OpenAI, Anthropic, or Meta take on the task. They not only have the resources but also the world’s top neural network specialists. Unfortunately, the models they create, especially the latest versions, are closed-source. Developers explain this by citing the risks of uncontrolled use and the need to ensure AI safety.

On one hand, their reasoning is understandable: many ethical questions remain unresolved, and the very nature of neural network models allows only indirect influence on the final output. On the other hand, keeping models closed and offering access only through their own API is also a solid business model.

Not all companies behave this way, however. For instance, the French company Mistral AI offers both commercial and open-source models, enabling researchers and enthusiasts to use them in their projects. But special attention should be paid to the achievements of Chinese companies, most of which build open-weight and open-source models capable of seriously competing with proprietary solutions.

DeepSeek, Qwen3, and Kimi K2

The first major breakthrough came with DeepSeek-V3. This multimodal language model from DeepSeek AI was developed using the Mixture of Experts (MoE) approach and impressive 671B parameters, with 37B most relevant ones activated for each token. Most importantly, all its components (model weights, inference code, and training pipelines) were released openly.

This instantly made it one of the most attractive LLMs for AI application developers, and researchers alike. The next headline-grabber was DeepSeek-R1 - the first open-source reasoning model. On its release day, it rattled the U.S. stock market after its developers claimed that training such an advanced model had cost only $6 million.

While the hype around DeepSeek eventually cooled down, the next releases were no less important for the global AI industry. We’re talking, of course, about Qwen 3. We covered its features in detail in our What's new in Qwen 3 review, so we won’t linger on it here. Soon after, another player appeared: Kimi K2 from Moonshot AI.

With its MoE architecture, 1T parameters (32B activated per token), and open-source code, Kimi K2 quickly drew community attention. Rather than focusing on reasoning, Moonshot AI aimed for state of the art performance in mathematics, programming, and deep cross-disciplinary knowledge.

The ace up Kimi K2’s sleeve was its optimization for integration into AI agents. This network was literally designed to make full use of all available tools. It excels in tasks requiring not only code writing but also iterative testing at each development stage. However, it has weaknesses too, which we’ll discuss later.

Kimi K2 is a large language model in every sense. Running the full-size version requires ~2 TB of VRAM (FP8: ~1 TB). For obvious reasons, this isn’t something you can do at home, and even many GPU servers won’t handle it. The model needs at least 8 NVIDIA® H200 accelerators. Quantized versions can help, but at a noticeable cost to accuracy.

Qwen3-Coder

Seeing Moonshot AI’s success, Alibaba developed its own Kimi K2-like model, but with significant advantages that we’ll discuss shortly. Initially, it was released in two versions:

Qwen3-Coder-480B-A35B-Instruct (~250 GB VRAM)
Qwen3-Coder-480B-A35B-Instruct-FP8 (~120 GB VRAM)

A few days later, smaller models without the reasoning mechanism appeared, requiring far less VRAM:

Qwen3-Coder-30B-A3B-Instruct (~32 GB VRAM)
Qwen3-Coder-30B-A3B-Instruct-FP8 (~18 GB VRAM)

Qwen3-Coder was designed for integration with development tools. It includes a special parser for function calls (qwen3coder_tool_parser.py, analogous to OpenAI’s function calling). Alongside the model, a console utility was released, capable of taste ranging from code compilation to querying a knowledge base. This idea isn’t new, essentially it’s heavily reworked extension of Anthropic’s Gemini code app.

The model is compatible with the OpenAI API, allowing it to be deployed locally or on a remote server and connected to most systems that support this API. This includes both ready-made client apps and machine learning libraries. This makes it viable not only for the B2C but also for the B2B segment, offering a seamless drop-in replacement for OpenAI’s product without any changes to application logic.

One of its most in-demand features is extended context length. By default, it supports 256k tokens but can be increased to 1M using the YaRN (Yet another RoPe extensioN) mechanism. Modern LLMs are typically trained on short datasets (2k–8k tokens), and large context lengths can cause them to lose track of earlier content.

YaRN is an elegant “trick” that makes the model think it’s working with its usual short sequences while actually processing much longer ones. The key idea is to “stretch” or “dilate” the positional space while preserving the mathematical structure the model expects. This allows effective processing of sequences tens of thousands of tokens long without retraining or extra memory required by traditional context extension methods.

Downloading and Running Inference

Make sure you’ve installed CUDA® beforehand, either using NVIDIA®’s official instructions or the Install CUDA® toolkit in Linux guide. To check for the required compiler:

nvcc --version

Expected output:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Feb_27_16:19:38_PST_2024
Cuda compilation tools, release 12.4, V12.4.99
Build cuda_12.4.r12.4/compiler.33961263_0

If you get:

Command 'nvcc' not found, but can be installed with:
sudo apt install nvidia-cuda-toolkit

you need to add the CUDA® binaries to your system’s $PATH.

export PATH=/usr/local/cuda-12.4/bin:$PATH

export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH

This is a temporary solution. For permanent edit ~/.bashrc and add the same two lines at the end.

Now, prepare your system for managing virtual environments. You can use Python’s built-in venv or the more advanced Miniforge. Assuming Miniforge is installed:

conda create -n venv python=3.10

conda activate venv

Install PyTorch with CUDA® support matching your system:

pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu124

Then install the essential libraries:

Transformers – Hugging Face’s main model library
Accelerate – enables multi-GPU inference
HuggingFace Hub – for downloading/uploading models & datasets
Safetensors – safe model weight format
vLLM – recommended inference library for Qwen

pip install transformers accelerate huggingface_hub safetensors vllm

Download the model:

hf download Qwen/Qwen3-Coder-30B-A3B-Instruct --local-dir ./Qwen3-30B

Run inference with tensor parallelism (splitting layer tensors across GPUs, for example 8):

python -m vllm.entrypoints.openai.api_server \
--model /home/usergpu/Qwen3-30B \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.9 \
--dtype auto \
--host 0.0.0.0 \
--port 8000

This launches the vLLM OpenAI API Server.

Testing and Integration

cURL

Install jq for pretty-printing JSON:

sudo apt -y install jq

Test the server:

curl -s http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "/home/usergpu/Qwen3-30B",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello! What can you do?"}
  ],
  "max_tokens": 180
}' | jq -r '.choices[0].message.content'

VSCode

To integrate with Visual Studio Code, install the Continue extension and add to config.yaml:

- name: Qwen3-Coder 30B
  provider: openai
  apiBase: http://[server_IP_address]:8000/v1
  apiKey: none
  model: /home/usergpu/Qwen3-30B
  roles:
    - chat
    - edit
    - apply

Qwen-Agent

For a GUI-based setup with Qwen-Agent (including RAG, MCP, and code interpreter):

pip install -U "qwen-agent[gui,rag,code_interpreter,mcp]"

Open the nano editor:

nano script.py

Example Python script to launch Qwen-Agent with a Gradio WebUI:

from qwen_agent.agents import Assistant
from qwen_agent.gui import WebUI

llm_cfg = {
    'model': '/home/usergpu/Qwen3-30B',
    'model_server': 'http://localhost:8000/v1',
    'api_key': 'EMPTY',
    'generate_cfg': {'top_p': 0.8},
}

tools = ['code_interpreter']

bot = Assistant(
    llm=llm_cfg,
    system_message="You are a helpful coding assistant.",
    function_list=tools
)

WebUI(bot).run()

Run the script:

python script.py

The server will be available at: http://127.0.0.1:7860

You can also integrate Qwen3-Coder into agent frameworks like CrewAI for automating complex tasks with toolsets such as web search or vector database memory.

See also:

How to install CrewAI with GUI

Wed, 23 Jul 2025 15:05:43 +0200

The capabilities of neural network models are growing every day. Researchers and commercial companies are investing more and more into training them. But on their own, these models can’t act autonomously. To solve specific tasks, they need guidance: context extension and direction setting. This approach isn’t always efficient, especially for complex problems.

But what if we allowed a neural network to act autonomously? And what if we provided it with many tools to interact with the external world? You’d get an AI agent capable of solving tasks by independently determining which tools to use. Sounds complicated, but it works very well. However, even for an advanced user, creating an AI agent from scratch can be a non-trivial task.

The reason is that most popular libraries lack a graphical user interface. They require interaction through a programming language like Python. This drastically raises the entry threshold and makes AI agents too complex for independent implementation. This is exactly the case with CrewAI.

What is CrewAI

CrewAI is a very popular and convenient library, but it doesn’t come with a GUI by default. This prompted independent developers to create an unofficial interface. The open source nature of CrewAI made the task much easier, and soon the community released the project CrewAI Studio.

Developers and enthusiasts gained deeper insight into the system’s architecture and could build tools tailored to specific tasks. Regular users could create AI agents without writing a single line of code. It became easier to assign tasks and manage access to neural networks and tools. It also allowed for exporting and importing agents from server to server and sharing them with friends, colleagues, or the open source community.

A separate advantage of CrewAI Studio is its deployment flexibility. It can be installed as a regular app or as a Docker container - the preferred method since it includes all necessary libraries and components for running the system.

Installation

Update your OS packages and installed apps to the latest versions:

sudo apt update && sudo apt -y upgrade

Use the automatic driver installation script or follow our guide Install NVIDIA® drivers in Linux:

sudo ubuntu-drivers autoinstall

Reboot the server for changes to take effect:

sudo shutdown - r now

After reconnecting via SSH, install Apache 2 web server utilities, which will give you access to the .htpasswd file generator used for basic user authentication:

sudo apt install -y apache2-utils

Install Docker Engine using the official shell script:

curl -sSL https://get.docker.com/ | sh

Add Docker Compose to the system:

sudo apt install -y docker-compose

Clone the repository:

git clone https://github.com/strnad/CrewAI-Studio.git

Navigate to the downloaded directory:

cd CrewAI-Studio

Create a .htpasswd file for the usergpu user. You’ll be prompted to enter a password twice:

htpasswd -c .htpasswd usergpu

Now edit the container deployment file. By default, there are two containers:

sudo nano docker-compose.yaml

Delete the section:

ports:
  - "5432:5432"

And add the following service:


nginx:
  image: nginx:latest
  container_name: crewai_nginx
  ports:
    - "80:80"
  volumes:
    - ./nginx.conf:/etc/nginx/nginx.conf:ro
    - ./.htpasswd:/etc/nginx/.htpasswd:ro
  depends_on:
    - web

Nginx will need a config file, so create one:

sudo nano nginx.conf

Paste in the following:

events {}

http {
  server {
    listen 80;

    location / {
      proxy_pass http://web:8501;

      # WebSocket headers
      proxy_http_version 1.1;
      proxy_set_header Upgrade $http_upgrade;
      proxy_set_header Connection "upgrade";

      # Forward headers
      proxy_set_header Host $host;
      proxy_set_header X-Real-IP $remote_addr;
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header X-Forwarded-Proto $scheme;

      auth_basic "Restricted Content";
      auth_basic_user_file /etc/nginx/.htpasswd;
    }
  }
}

All important service variables for CrewAI are defined in the .env file. Open the .env_example file for editing:

nano .env_example

Add the following lines:

OLLAMA_HOST="http://open-webui:11434"
OLLAMA_MODELS="ollama/llama3.2:latest"

And add Postgres config:

POSTGRES_USER="admin"
POSTGRES_PASSWORD="your_password"
POSTGRES_DB="crewai_db"
AGENTOPS_ENABLED="False"

Now copy the example file and rename it to .env so the system can read it during container deployment:

cp .env_example .env

In this example, we’ll use local models with inference handled by Ollama. We recommend our guide Open WebUI: All in one, and during deployment add -e OLLAMA_HOST=0.0.0.0 to allow CrewAI to connect directly to the Ollama container. Download the desired model (e.g., llama3.2:latest) via WebUI or by connecting to the container console and running:

ollama pull llama3.2:latest

Once everything is set up, launch the deployment:

sudo docker-compose up -d --build

Now, visiting http://[your_server_ip]/ will prompt for login credentials. Upon correct input, the CrewAI interface will appear.

Features

Let’s explore the key entities CrewAI uses. This will help you understand how to configure workflows. The central entity in the Agent, an autonomous task executor. Each agent has attributes that help them fulfill their duties:

Role. A brief, 2-3 word job description.
Backstory. Optional; helps the language model understand how the agent should behave and what experiences to rely on.
Goal. The objective the agent should pursue.
Allow delegation. Enables the agent to delegate tasks (or parts of them) to others.
Verbose. Tells the agent to log detailed actions.
LLM Provider and Model. Specifies the model and provider to use.
Temperature. Determines response creativity. Higher = more creative.
Max iterations. Number of tries the agent has to succeed, acting as a safeguard (e.g., against infinite loops).

Agents operate by iteratively analyzing input, reasoning, and drawing conclusions using available tools.

Input is defined by a Task entity. Each task includes a description, an assigned agent and optionally an expected result. Tasks run sequentially by default but can be parallelized using the Async execution flag.

Autonomous agent work is supported by Tools that enable real-world interaction. CrewAI includes tools for web searches, site parsing, API calls, and file handling, enhancing context and helping agents achieve goals.

Lastly, there is the Crew entity. It unites agents with different roles into a team to tackle complex problems. They can communicate, delegate, review, and correct one another, essentially forming a collective intelligence.

Using

Now that you’re familiar with the entities, let’s build and run a minimal CrewAI workflow. In this example, we’ll track global progress in cancer drug development.

We’ll use three agents:

Oncology Drug Pipeline Analyst - tracks new developments from early stages to clinical trials.
Regulatory and Approval Watchdog - monitors new drug approvals and regulatory changes.
Scientific Literature and Innovation Scout - scans scientific publications and patents related to oncology.

Open the Agents section and create the first agent:

For now, we’re using the previously downloaded llama3.2:latest model, but in a real scenario, choose the one that best fits the task. Repeat the process for the remaining agents and move on to task creation.

Gather all agents into a crew and assign the prepared task to them:

Activate necessary tools from the list:

Finally, go to the Kickoff! page and click Run Crew! After some iterations, the system will return a result, such as:

Before we finish, let’s check the Import/export section. Your workflow or crew can be exported as JSON to transfer to another CrewAI server. You can also create a Single-Page Application (SPA) with a single click - perfect for production deployment:

Conclusion

CrewAI significantly simplifies the creation of AI agents, allowing integration into any application or standalone use. The library is based on the idea of distributed intelligence, where each agent is a domain expert, and the combined team outperforms a single generalist agent.

Since it’s written in Python, CrewAI integrates easily with ML platforms and tools. Its open source nature allows for extension through third-party modules. Inter-agent communication reduces token usage by distributing context processing.

As a result, complex tasks are completed faster and more efficiently. The lower entry barrier provided by CrewAI Studio expands the reach of AI agents and multi-agent systems. And support for local models ensures better control over sensitive data.

See also:

What's new in Qwen 3

Mon, 14 Jul 2025 08:05:08 +0200

The global AI race is accelerating. Research institutions, private corporations, and even entire nations are now competing for leadership in the AI domain. Broadly speaking, this race can be divided into several phases. The first stage involved the creation of narrow AI. Existing neural network models such as GPT, MidJourney, and AlphaFold show that this stage has been successfully achieved.

The next step envisions the evolution of AI into AGI (Artificial General Intelligence). AGI should match human intelligence in solving a wide range of tasks, from writing stories and performing scientific calculations to understanding social situations and learning independently. As of the time of writing, this level has not been reached yet.

The ultimate stage in AI development is referred to as ASI (Artificial Super Intelligence). It would far exceed human capabilities in all areas. This would make it possible to develop technologies we can’t even imagine today and to manage global systems with a precision beyond human capabilities. However, this might only become a reality after decades (or even centuries) of continuous advancement.

As a result, most AI race participants are focused on reaching AGI while retaining control over it. The development of AGI is closely tied to a host of complex technical, ethical, and legal challenges. Still, the potential rewards far outweigh the costs, which is why corporations like Alibaba Group are investing heavily in this area.

The release of Qwen 3 marks a significant milestone not only for one company’s neural networks but also on the global stage. Compared to its predecessor, the model introduces several important innovations.

Features

Qwen 2.5 was pretrained on a dataset of 18B tokens, while the new model has doubled that amount to 36B tokens. The largest dataset has significantly improved the base model’s accuracy. Interestingly, in addition to publicly available internet data gathered through parsing, the system was also trained on PDF documents. These are typically well-structured and knowledge-dense, which helps the model provide more accurate answers and better understand complex formulations.

One of the most promising directions in AI development is building models capable of reasoning, which can expand the task context through an iterative process. On one hand, this allows for more comprehensive problem-solving, but on the other hand, reasoning tends to slow the process down considerably. Therefore, the developers of Qwen 3 have introduced two operational modes:

Thinking mode. The model builds up context step-by-step before providing a final answer. This makes it possible to tackle complex problems that require deep understanding.
Non-thinking mode. The model responds almost instantly but may produce more superficial answers without in-depth analysis.

This manual control over model behavior enhances user experience for handling many routine tasks. Reducing the use of thinking mode also significantly lowers GPU load, allowing more tokens to be processed within the same time frame.

In addition to this binary choice, there’s also a soft-switching mechanism. This hybrid behavior allows the model to adapt to context using internal weighting mechanisms. If the model deems a task difficult, it will automatically trigger reasoning or even self-verification. It can also respond to user cues such as “Let’s think step by step”.

Another significant improvement is expanded multilingual support. While Qwen 2.5 supported only 29 languages, version 3 can now understand and generate text in 119 languages and dialects. This has greatly improved instruction following and contextual comprehension. As a result, Qwen 3 can now be effectively used in non-English environments.

In addition, Qwen 3 is now significantly better integrated with MCP servers, giving the model tools to dive deeper into problem-solving and execute actions. It can now interact with external sources and manage complex processes directly.

Model training

Pre Training

Such a substantial leap forward wouldn’t have been possible without a multi-stage training system. Initially, the model was pretrained on 30B tokens with a 4K context length, allowing it to acquire general knowledge and basic language skills.

This was followed by a refinement stage using more scientific and well-structured data. During this stage, the model also gained the ability to effectively write applications in multiple programming languages.

Finally, it was trained on a high-quality dataset with extended context. As a result, Qwen 3 now supports an effective context length of 128K tokens, that’s roughly 350 pages of typed text, depending on the language. For instance, Cyrillic-based languages often have shorter tokens due to morphology and use of prefixes, suffixes, etc.

Reasoning Pipeline

Building reasoning-capable models is a fascinating but labor-intensive process that combines various existing techniques aimed at simulating human thought. Based on publicly available information, we can assume that Qwen 3’s reasoning training involved four main stages:

Cold start for long chains of thought. Training the model to break problems into multiple steps without prior adaptation. This helps it learn iterative thinking and develop a basic layer of reasoning skills.
Reinforcement learning based on reasoning. At this stage, rewards depend not only on the final answer but also on how well the model constructs logical, interpretable, and structured reasoning chains. The absence of errors and hallucinations is also evaluated.
Merging reasoning modes. Humans typically rely on two thinking styles: fast (intuitive) and slow (analytical). Depending on the task type, the neural model should learn to both switch between and integrate these styles. This is usually done using examples that mix both styles or through special tokens indicating which style to apply.
General reinforcement learning. This final stage resembles a sandbox environment where the model learns to interact with tools, perform multi-steps tasks, and develop adaptive behavior. Here, it also becomes attuned to user preferences.

Conclusion

Qwen 3 is a major milestone for Alibaba Group. Its training quality and methodology make it a serious contender against established players like OpenAI and Anthropic. The improvements over the previous version are substantial.

An added benefit is its open-source nature, with the codebase publicly available on GitHub under the Apache 2.0 license.

Further development of the Qwen model family will help strengthen its position in the global AI arena and narrow the gap with closed-source commercial models. And all current achievements are, in one way or another, steps toward humanity’s progress in building AGI.

See also:

MCP server based on N8N

Wed, 02 Jul 2025 15:28:18 +0200

The development of generative neural networks has accelerated significantly in recent years. They’ve become noticeably faster and more accurate in their responses and have learned to reason. However, their capabilities are still fundamentally limited by their architecture. For example, every existing LLM at the time of writing has a knowledge cutoff date. This means that with each passing day, such an LLM becomes more likely to produce incorrect answers, simply because it lacks information about events that occurred after that date.

This limitation necessitates retraining the model entirely on fresher data, which is expensive and time-consuming. But there is another way. If you enable the model to interact with the outside world, it can independently find and update the information requested during a user conversation, without requiring retraining.

This is roughly how the RAG (Retrieval Augmented Generation) mechanism works. When answering a question, the model first queries a pre-prepared vector database, and if it finds relevant information, it incorporates it into the prompt. Thus, by explaining and updating the vector DB, the quality of LLM responses can be greatly improved.

But there is another, even more interesting way to embed up-to-date context into prompts. It’s called MCP, which stands for Model Context Protocol. It was originally developed by Anthropic for its Claude model. The key moment came when the source code for MCP was made open-source, allowing thousands of AI researchers to build custom servers for various purposes.

The essence of MCP is to give a neural network model access to tools with which it can independently update its knowledge and perform various actions to efficiently solve given tasks. The model itself decides which tool to use and whether it’s appropriate in each situation.

Support for MCP soon appeared in various IDEs like Cursor, as well as in automation platforms like N8N. The latter is especially intuitive, as workflows are created visually, making it easier to understand. Within N8N, you can either connect to an existing MCP server or create your own. Moreover, you can even organize a direct connection within a single workflow. But let’s go step by step.

Creating a Simple AI Agent

Before getting started, make sure the main requirement is met, you have an LLM ready for connections. This could be a locally running model using Ollama or an external service like OpenAI’s ChatGPT. In the first case, you’ll need to know the local Ollama API address (and optionally its authentication), and in the second case, you’ll need an active OpenAI account with sufficient credits.

Building an agent starts with the key AI Agent node. At a minimum, it must be linked with two other nodes, one to act as a trigger, and the other to connect to the LLM. If you don’t specify a trigger, the system will create one automatically, triggering the agent upon receiving any message in the internal chat:

The only missing piece is the LLM. For instance, you can use our Open WebUI: All in one guide to set up Ollama with a web interface. The only change required is that the containers for N8N and Open WebUI must be on the same network. For example, if the N8N container is on a network named web, then in the deployment command for Open WebUI, replace --network=host with --network=web.

In some cases, you will also need to manually set the OLLAMA_HOST environment variable, for example: -e OLLAMA_HOST=0.0.0.0. This allows connections to the Ollama API not only from localhost but also from other containers. Suppose Ollama is deployed in a container named ollama-webui. Then the base URL for connecting from N8N would be:

http://open-webui:11434

Before connecting the Ollama Chat Model node, don’t forget to download at least one model. You can do this either from the web interface or via the container CLI. The following command will download the Llama 3.1 model with 8 billion parameters:

ollama pull llama3.1:8b

Once downloaded and installed, the model will automatically appear in the list of available ones:

A minimal working AI Agent workflow looks like this:

In this form, the agent can use only one model and doesn’t store input data or enhance prompts using external tools. So it makes sense to add at least the Simple Memory node. For light loads, it’s sufficient to store requests and responses.

But let's go back to MCP. To start, create a server using the special MCP Server Trigger node:

This node is fully self-contained and doesn’t require external activation. It’s triggered solely by an incoming external request to its webhook address. By default, there are two URLs: Test URL and Production URL. The first is used during development, while the second works only when the workflow is saved and activated.

The trigger is useless on its own, it needs connected tools. For example, let’s connect one of the simplest tools: a calculator. It will expect a mathematical expression as input. Nodes communicate using plain JSON, so for the calculator to compute 2 + 2, the input should be:

[
  {
    "query": {
      "input": "2 + 2"
    }
  }
]

LLMs can easily generate such JSON from plain text task descriptions and send them to the node, which performs the calculations and returns the result. Let’s connect the MCP client to the agent:

It’s worth noting that this node doesn’t need any additional connections. In its settings, it’s enough to specify the endpoint address where it will send data from the AI Agent. In our example, this address points to the container named n8n.

Of course, at this stage you can specify any external MCP server address available to you. But for this article, we’ll use a local instance running within N8N. Let’s see how the client and server behave when the AI Agent is asked to perform a simple math operation:

Upon receiving the request, the AI Agent will:

Search in Simple Memory to see if the user asked this before or if any context can be reused.
Send the prompt to the LLM, which will correctly break down the math expression and prepare the corresponding JSON.
Send the JSON to the Calculator tool and receive the result.
Use the LLM to generate the final response and insert the result into the reply.
Store the result in Simple Memory.
Output the message in the chat.

Similarly, agents can work with other tools on the MCP server. Instead of Simple Memory, you can use more advanced options like MongoDB, Postgres, Redis, or even something like Zep. Of course, these require minimal database maintenance, but overall performance will increase significantly.

There are also far more options for tool selection. Out of the box, the MCP Server Trigger node supports over 200 tools. These can be anything, from simple HTTP requests to prebuilt integrations with public internet services. Within a single workflow, you can create both a server and a client. One important thing to note: these nodes can’t be visually connected in the editor, and that’s expected behavior:

Instead of the default trigger, you can use other options such as receiving a message via a messenger, submitting a website form, or executing on a schedule. This lets you set up workflows that react to events or perform routine operations like daily data exports from Google Ads.

And that’s not the end of what’s possible with AI agents. You can build multi-agent systems using different neural network models that work together to solve tasks with greater accuracy, considering many more influencing factors in the process.

See also:

How to install N8N

Mon, 23 Jun 2025 14:30:26 +0200

AI agents in 2025 remain one of the most promising approaches for solving complex tasks using large language models. These agents are autonomous and capable of selecting various tools on their own to accomplish assigned tasks. This approach enables achieving results with less human involvement and higher quality. It also opens up opportunities for discovering more original and effective ways of dealing with problems.

Instead of just formulating a task, you instruct the neural network to solve it independently, based on the resources allocated to it. However, for this scheme to work, there needs to be a mechanism that connects neural network interfaces with various tools, whether its web search or a vector database for storing intermediate results.

n8n is an automation platform that supports integration with various neural networks and public services. Users can visually design how data will be processed and what final result needs to be achieved. Unlike classic no-code solutions, n8n allows arbitrary code to be included at any stage of the process, which is especially useful when built-in functionality is not sufficient.

The result is a system that combines the simplicity of no-code with the flexibility of traditional programming. However, to fully understand it, you'll still need to spend some time exploring and reviewing workflow examples for better comprehension. In this article, we’ll walk you through how to deploy n8n on LeaderGPU servers.

Preparing the server

Update the system

Update the package list and upgrade all installed packages:

sudo apt update && sudo apt -y upgrade

Automatically install the recommended NVIDIA® driver (proprietary) or use our step-by-step guide Install NVIDIA® drivers in Linux:

sudo ubuntu-drivers autoinstall

Now reboot the server:

sudo shutdown -r now

Install Docker

You can use the official installation script:

curl -sSL https://get.docker.com/ | sh

Let’s add NVIDIA® container toolkit GPG key and repository for Docker integration:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Update the package list and install the NVIDIA® container toolkit:

sudo apt update && sudo apt -y install nvidia-container-toolkit

Restart Docker to apply changes and enable installed toolkit:

sudo systemctl restart docker

Install n8n

To allow the system to store data, you need to create a volume before launching the container:

sudo docker volume create n8n_data

Now, let’s launch a container that will open port 5678 for external connections and mount the created n8n_data volume to the directory /home/node/.n8n inside the container:

sudo docker run -d --name n8n -p 5678:5678 -v n8n_data:/home/node/.n8n docker.n8n.io/n8nio/n8n

The first time you launch the application, you might be puzzled by the following error message:

This isn’t exactly an error, it’s more of a warning about how to properly configure the system for access. The issue is that by default, the system doesn’t have a TLS/HTTPS certificate. Without it, the connection won’t be secure. So, you have three options:

Connect your own certificate. You can do this by specifying the paths to the certificate files via environment variables, or by configuring a reverse proxy server.
Create an SSH tunnel and forward port 5678 to localhost on the computer you’re connecting from. This way, you’ll immediately get a secure personal connection. However, no one else will be able to access the server externally.
Bypass the warning. If this is a test server not intended for production use and you don’t care about security, you can disable the warning by setting the N8N_SECURE_COOKIE environment variable to FALSE. This is strongly discouraged as it makes the server vulnerable to potential attacks. Still, it might be acceptable in specific scenarios.

This article will explore each option in detail so you can choose the right one.

Connecting to server

If you don’t yet have an SSL certificate, we recommend ordering one from LeaderSSL. It can be used for any website, online store, or to verify an email’s authenticity.

Using Environment Variables

The simplest way to configure HTTPS is to upload your certificate to the server and specify it via Docker environment variables. Start by creating a directory for the certificate files:

mkdir ~/n8n-certs

You can upload these files (typically cert.crt and privkey.key) to this directory using any method. For more detailed info, see:

Now, let’s launch the container using one full command:

sudo docker run -d \
--name n8n \
-p 5678:5678 \
-v n8n_data:/home/node/.n8n \
-v ~/n8n-certs:/certs \
-e N8N_PROTOCOL=https \
-e N8N_SSL_CERT="/certs/cert.crt" \
-e N8N_SSL_KEY="/certs/privkey.key" \
docker.n8n.io/n8nio/n8n

Here’s a breakdown of each argument:

sudo docker run -d launches the Docker container in daemon (background) mode
--name n8n assigns the container the name n8n
-p 5678:5678 forwards port 5678 to the container
-v n8n_data:/home/node/.n8n creates and mounts a volume named n8n_data to the hidden directory /home/node/.n8n inside the container
-v ~/n8n-certs:/certs mounts the certificate directory
-e N8N_PROTOCOL=https forces N8N to use the HTTPS protocol
-e N8N_SSL_CERT="/certs/cert.crt" sets the path to the certificate file
-e N8N_SSL_KEY="/certs/privkey.key" sets the path to the certificate key
docker.n8n.io/n8nio/n8n container image source

Traefik

A slightly more complex but flexible setup involves using the Traefik reverse proxy server to secure the connection to N8N. The configuration file is based on the official method specified in the documentation. First, install docker-compose tool:

sudo apt -y install docker-compose

We’ll deploy Traefik and N8N together, and they need to be on the same network. Create a network called the web.

sudo docker network create web

Now, create a docker-compose.yml file to define and run both containers:

nano docker-compose.yml

services:
  traefik:
    image: "traefik"
    container_name: "proxy"
    restart: always
    command:
      - "--api.insecure=true"
      - "--providers.docker=true"
      - "--providers.docker.exposedbydefault=false"
      - "--entrypoints.web.address=:80"
      - "--entrypoints.web.http.redirections.entryPoint.to=websecure"
      - "--entrypoints.web.http.redirections.entrypoint.scheme=https"
      - "--entrypoints.websecure.address=:443"
      - "--certificatesresolvers.mytlschallenge.acme.tlschallenge=true"
      - "--certificatesresolvers.mytlschallenge.acme.email=${SSL_EMAIL}"
      - "--certificatesresolvers.mytlschallenge.acme.storage=/letsencrypt/acme.json"
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - traefik_data:/letsencrypt
      - /var/run/docker.sock:/var/run/docker.sock:ro
    networks:
      - web

  n8n:
    image: docker.n8n.io/n8nio/n8n
    container_name: "n8n"
    restart: always
    ports:
      - "127.0.0.1:5678:5678"
    labels:
      - traefik.enable=true
      - traefik.http.routers.n8n.rule=Host(`${SUBDOMAIN}.${DOMAIN_NAME}`)
      - traefik.http.routers.n8n.tls=true
      - traefik.http.routers.n8n.entrypoints=web,websecure
      - traefik.http.routers.n8n.tls.certresolver=mytlschallenge
      - traefik.http.middlewares.n8n.headers.SSLRedirect=true
      - traefik.http.middlewares.n8n.headers.STSSeconds=315360000
      - traefik.http.middlewares.n8n.headers.browserXSSFilter=true
      - traefik.http.middlewares.n8n.headers.contentTypeNosniff=true
      - traefik.http.middlewares.n8n.headers.forceSTSHeader=true
      - traefik.http.middlewares.n8n.headers.SSLHost=${DOMAIN_NAME}
      - traefik.http.middlewares.n8n.headers.STSIncludeSubdomains=true
      - traefik.http.middlewares.n8n.headers.STSPreload=true
      - traefik.http.routers.n8n.middlewares=n8n@docker
    environment:
      - N8N_HOST=${SUBDOMAIN}.${DOMAIN_NAME}
      - N8N_PORT=5678
      - N8N_PROTOCOL=https
      - NODE_ENV=production
      - WEBHOOK_URL=https://${SUBDOMAIN}.${DOMAIN_NAME}/
      - GENERIC_TIMEZONE=${GENERIC_TIMEZONE}
    volumes:
      - n8n_data:/home/node/.n8n
      - ./local-files:/files
    networks:
      - web

volumes:
  n8n_data:
  traefik_data:

networks:
  web:
    name: web

In addition to the docker-compose.yml file, we will create another file named .env. This file will contain variables such as the domain name and email address used to request an SSL certificate from Let's Encrypt. If we ever need to change something, like the domain name, we'll only need to update it in this file and then recreate the container.

nano .env

DOMAIN_NAME=example.com
SUBDOMAIN=n8n
GENERIC_TIMEZONE=Europe/Amsterdam
SSL_EMAIL=user@example.com

Finally, deploy both containers:

sudo docker-compose up -d

Now, N8N is available here: https://n8n.example.com.

Nginx Proxy Manager

Unlike Traefik, which is configured via files, Nginx Proxy Manager offers a user-friendly web-interface. However, it doesn’t detect services dynamically, you must add them manually. Still, it works well for static services like N8N.

Create another docker-compose.yml file in a separate directory with the following content:

services:
  app:
    image: 'jc21/nginx-proxy-manager:latest'
    container_name: proxy
    restart: unless-stopped
    ports:
      - '80:80'
      - '443:443'
      - '81:81'
    volumes:
      - ./data:/data
      - ./letsencrypt:/etc/letsencrypt
    networks:
      - web

  n8n:
    image: docker.n8n.io/n8nio/n8n
    container_name: n8n
    restart: unless-stopped
    environment:
      - N8N_HOST=n8n.example.com
      - N8N_PORT=5678
      - WEBHOOK_URL=https://n8n.example.com/
      - N8N_PROTOCOL=http
    volumes:
      - n8n_data:/home/node/.n8n
    networks:
      - web

volumes:
  n8n_data:

networks:
  web:
    external: true

Deploy with:

sudo docker-compose up -d

Then open web-interface at: http://your_hostname_or_ip:81

Username: admin@example.com
Password: changeme

You’ll be prompted to update your credentials. After that, open Hosts → Proxy Hosts → Add Proxy Host, enter your domain name (e.g., n8n.example.com):

Fill up the necessary fields:

Set Destination/IP to n8n.
Set Port to 5678.
Under the SSL tab, choose Request a new SSL certificate with Let’s Encrypt.
Enter your email and agree to the terms.
Click on Websockets support.
Optionally click on Force SSL.

After pressing Save button, the certificate will be requested and installed:

Once done, opening your domain will lead to the N8N interface.

SSH-tunnel

If you don’t need N8N accessibility externally, you can forward port 5678 via SSH. This encrypts all traffic, and N8N will be available at http://localhost:5678/.

Note: This setup won’t work for integrations with external services like messengers that require public HTTPS access.

The easiest way to forward the port is with the popular SSH client PuTTY. Once installed, open SSH → Tunnels and set Source port - 5678 and Destination - localhost:5678. Then click Add.

Go back to Session, enter your server’s IP, and click Open. Once authenticated, the tunnel is active. Open http://localhost:5678 in a browser to access N8N.

Note: The connection only works while the SSH session is active. Closing PuTTY will terminate the tunnel.

Bypass

This method is not recommended for use on public networks. If you launch the container with the N8N_SECURE_COOKIE=false environment variable, the warning will disappear, and you’ll get access via HTTP:

sudo docker run -d --name n8n -p 5678:5678 -e N8N_SECURE_COOKIE=false -v n8n_data:/home/node/.n8n docker.n8n.io/n8nio/n8n

Warning: this exposes the N8N admin panel via unencrypted HTTP, making it vulnerable to MITM (Man-In-The-Middle) attacks and potentially allows an attacker to fully take over your server.

See also:

Triton™ Inference Server

Wed, 26 Feb 2025 16:40:21 +0100

Business requirements may vary, but they all share one core principle: systems must operate quickly and deliver the highest possible quality. When dealing with neural network inference, efficient use of computing resources becomes crucial. Any GPU underutilization or idle time directly translates to financial losses.

Consider a marketplace as an example. These platforms host numerous products, each with multiple attributes: text descriptions, technical specifications, categories, and multimedia content like photos and videos. All content requires moderation to maintain fair conditions for sellers and prevent prohibited goods or illegal content from appearing on the platform.

While manual moderation is possible, it’s slow and inefficient. In today’s competitive environment, sellers need to expand their product range quickly: the faster items appear on the marketplace, the better chances of being discovered and purchased. Manual moderation is also costly and prone to human error, potentially allowing inappropriate content through.

Automatic moderation using specially trained neural networks offers a solution. This approach brings multiple benefits: it substantially reduces the moderation costs while typically improving quality. Neural networks process content much faster than humans, allowing sellers to clear the moderation stage more quickly, especially when handling large products volumes.

The approach does have its challenges. Implementing automated moderation requires developing and training neural network models, demanding both skilled personnel and substantial computing resources. However, the benefits become apparent quickly after initial implementation. Adding automated model deployment can significantly streamline ongoing operations.

Inference

Assume we’ve figured out the machine learning procedures. The next step is determining how to run model inference on a rented server. For a single model, you typically choose a tool that works well with the specific framework it was built on. However, when dealing with multiple models created in different frameworks, you have two options.

You can either convert all models to a single format, or choose a tool that supports multiple frameworks. Triton™ Inference Server fits perfectly with the second approach. It supports the following backends:

TensorRT™
TensorRT-LLM
vLLM
Python
PyTorch (LibTorch)
ONNX Runtime
Tensorflow
FIL
DALI

Additionally, you can use any application as a backend. For instance, if you need post-processing with a C/C++ application, you can integrate it seamlessly.

Scaling

Triton™ Inference Server efficiently manages computing resources on a single server by running multiple models simultaneously and distributing the workload across GPUs.

Installation is done through a Docker container. DevOps engineers can control GPU allocation at startup, choosing to use all GPUs or limit their number. While the software doesn’t handle horizontal scaling directly, you can use traditional load balancers like HAproxy or deploy applications in a Kubernetes cluster for this purpose.

Preparing the system

To set up Triton™ on a LeaderGPU server running Ubuntu 22.04, first update the system using this command:

sudo apt update && sudo apt -y upgrade

First, install the NVIDIA® drivers using the autoinstaller script:

sudo ubuntu-drivers autoinstall

Reboot the server to apply the changes:

sudo shutdown -r now

Once the server is back online, install Docker using the following installation script:

curl -sSL https://get.docker.com/ | sh

Since Docker can’t pass through GPUs to containers by default, you’ll need the NVIDIA® Container Toolkit. Add the NVIDIA® repository by downloading and registering its GPG key:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Update the packages cache and install the toolkit:

sudo apt update && sudo apt -y install nvidia-container-toolkit

Restart Docker to enable the new capabilities:

sudo systemctl restart docker

The operating system is now ready to use.

Installing Triton™ Inference Server

Let’s download the project repository:

git clone https://github.com/triton-inference-server/server

This repository contains pre-configured neural network samples and a model download script. Navigate to the examples directory:

cd server/docs/examples

Download the models by running the following script, which will save them to ~/server/docs/examples/model_repository:

./fetch_models.sh

Triton™ Inference Server’s architecture requires models to be stored separately. You can store them either locally in any server directory or on network storage. When starting the server, you’ll need to mount this directory to the container at the /models mount point. This serves as a repository for all model versions.

Launch the container with this command

sudo docker run --gpus=all --rm -p8000:8000 -p8001:8001 -p8002:8002 -v ~/server/docs/examples/model_repository:/models nvcr.io/nvidia/tritonserver:25.01-py3 tritonserver --model-repository=/models

Here’s what each parameter does:

--gpus=all specifies that all available GPUs will be used in the server;
--rm destroy the container after the process is completed or stopped;
-p8000:8000 forwards port 8000 to receive HTTP requests;
-p8001:8001 forwards port 8001 to receive gRPC requests;
-p8002:8002 forwards port 8002 to request metrics;
-v ~/server/docs/examples/model_repository:/models forwards the directory with models;
nvcr.io/nvidia/tritonserver:25.01-py3 address of the container from the NGC™ catalog;
tritonserver --model-repository=/models launch the Triton™ Inference Server with the location of the models repository at /models.

The command output will show all available models in the repository, each ready to accept requests:

+----------------------+---------+--------+
| Model                | Version | Status |
+----------------------+---------+--------+
| densenet_onnx        | 1       | READY  |
| inception_graphdef   | 1       | READY  |
| simple               | 1       | READY  |
| simple_dyna_sequence | 1       | READY  |
| simple_identity      | 1       | READY  |
| simple_int8          | 1       | READY  |
| simple_sequence      | 1       | READY  |
| simple_string        | 1       | READY  |
+----------------------+---------+--------+

The three services have been successfully launched on ports 8000, 8001, and 8002:

I0217 08:00:34.930188 1 grpc_server.cc:2466] Started GRPCInferenceService at 0.0.0.0:8001
I0217 08:00:34.930393 1 http_server.cc:4636] Started HTTPService at 0.0.0.0:8000
I0217 08:00:34.972340 1 http_server.cc:320] Started Metrics Service at 0.0.0.0:8002

Using the nvtop utility, we can verify that all GPUs are ready to accept the load:

Installing the client

To access our server, we’ll need to generate an appropriate request using the client included in the SDK. We can download this SDK as a Docker container:

sudo docker pull nvcr.io/nvidia/tritonserver:25.01-py3-sdk

Run the container in interactive mode to access the console:

sudo docker run -it --gpus=all --rm --net=host nvcr.io/nvidia/tritonserver:25.01-py3-sdk

Let’s test this with the DenseNet model in ONNX format, using the INCEPTION method to preprocess and analyze image mug.jpg:

/workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg

The client will contact the server, which will create a batch and process it using the container’s available GPUs. Here’s the output:

Request 0, batch size 1
Image '/workspace/images/mug.jpg':
   15.349562 (504) = COFFEE MUG
   13.227461 (968) = CUP
   10.424891 (505) = COFFEEPOT

Preparing the repository

For Triton™ to manage models correctly, you must prepare the repository in a specific way. Here’s the directory structure:

model_repository/ 
        └── your_model/ 
                ├── config.pbtxt 
                └── 1/
                    └── model.*

Each model needs its own directory containing a config.pbtxt configuration file with its description. Here’s an example:

name: "Test"
platform: "pytorch_libtorch"
max_batch_size: 8
input [
  {
    name: "INPUT_0"
    data_type: TYPE_FP32
    dims: [ 3, 224, 224 ]
  }
]
output [
  {
    name: "OUTPUT_0"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]

In this example, a model named Test will run on the PyTorch backend. The max_batch_size parameter sets the maximum number of items that can be processed simultaneously, enabling efficient load balancing across resources. Setting this value to zero disables batching, causing the model to process requests sequentially.

The model accepts one input and produces one output, both using the FP32 number type. The parameters must match the model’s requirements exactly. For image processing, a typical dimension specification is dims: [ 3, 224, 224 ], where:

3 - number of color channels (RGB);
224 - image height in pixels;
224 - image width in pixels.

The output dims: [ 1000 ] represents a one-dimensional vector of 1000 elements, which suits image classification tasks. To determine the correct dimensionality for your model, consult its documentation. If the configuration file is incomplete, Triton™ will attempt to generate any missing parameters automatically.

Launching a custom model

Let’s launch the inference of the distilled DeepSeek-R1 model we discussed earlier. First, we’ll create the necessary directory structure:

mkdir ~/model_repository && mkdir ~/model_repository/deepseek && mkdir ~/model_repository/deepseek/1

Navigate to the model directory:

cd ~/model_repository/deepseek

Create a configuration file config.pbtxt:

nano config.pbtxt

Paste the following:

# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
#  * Redistributions of source code must retain the above copyright
#    notice, this list of conditions and the following disclaimer.
#  * Redistributions in binary form must reproduce the above copyright
#    notice, this list of conditions and the following disclaimer in the
#    documentation and/or other materials provided with the distribution.
#  * Neither the name of NVIDIA CORPORATION nor the names of its
#    contributors may be used to endorse or promote products derived
#    from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
    
# Note: You do not need to change any fields in this configuration.
    
backend: "vllm"
    
# The usage of device is deferred to the vLLM engine
instance_group [
  {
    count: 1
    kind: KIND_MODEL
  }
]

Save the file by pressing Ctrl + O, then the editor with Ctrl + X. Navigate to the directory 1:

cd 1

Create a model configuration file model.json with the following parameters:

{
    "model":"deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
    "disable_log_requests": true,
    "gpu_memory_utilization": 0.9,
    "enforce_eager": true
}

Note that the gpu_memory_utilization value varies by GPU and should be determined experimentally. For this guide, we’ll use 0.9. Your directory structure inside ~/model_repository should now look like this:

└── deepseek
        ├── 1
        │   └── model.json
        └── config.pbtxt

Set the LOCAL_MODEL_REPOSITORY variable for convenience:

LOCAL_MODEL_REPOSITORY=~/model_repository/

Start the inference server with this command:

sudo docker run --rm -it --net host --shm-size=2g  --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v $LOCAL_MODEL_REPOSITORY:/opt/tritonserver/model_repository  nvcr.io/nvidia/tritonserver:25.01-vllm-python-py3 tritonserver --model-repository=model_repository/

Here’s what each parameter does:

--rm automatically removes the container after stopping;
-it runs the container in interactive mode with terminal output;
--net host uses the host’s network stack instead of container isolation;
--shm-size=2g sets shared memory to 2 GB;
--ulimit memlock=-1 removes memory lock limit;
--ulimit stack=67108864 sets stack size to 64 MB;
--gpus all enables access to all server GPUs;
-v $LOCAL_MODEL_REPOSITORY:/opt/tritonserver/model_repository mounts the local model directory in the container;
nvcr.io/nvidia/tritonserver:25.01-vllm-python-py3 specifies the container with vLLM backend support;
tritonserver --model-repository=model_repository/ launch the Triton™ Inference Server with the location of the models repository at model_repository.

Test the server by sending a request with curl, using a simple prompt and a 4096 token response limit:

curl -X POST localhost:8000/v2/models/deepseek/generate -d '{"text_input": "Tell me about the Netherlands?", "max_tokens": 4096}'

The server successfully receives and processes the request.

The internal Triton™ task scheduler handles all incoming requests when the server is under load.

Conclusion

Triton™ Inference Server excels at deploying machine learning models in production by efficiently distributing requests across available GPUs. This maximizes the use of rented server resources and reduces computing infrastructure costs. The software works with various backends, including vLLM for large language models.

Since it installs as a Docker container, you can easily integrate it into any modern CI/CD pipeline. Try it yourself by renting a server from LeaderGPU.

DeepSeek-R1: future of LLMs

Wed, 19 Feb 2025 15:10:33 +0100

While generative neural networks have been developing rapidly, their progress in recent years has remained fairly steady. This changed with the arrival of DeepSeek, a Chinese neural network that not only impacted the stock market but also captured the attention of developers and researchers worldwide. In contrast to other major projects, DeepSeek’s code was released under the permissive MIT license. This move towards open source earned praise from the community, who eagerly began exploring the new model’s capabilities.

The most impressive aspect was that training this new neural network reportedly cost 20 times less than competitors offering similar quality. The model required just 55 days and $5.6 million to train. When DeepSeek was released, it triggered one of the largest single-day drops in US stock market history. Though markets eventually stabilized, the impact was significant.

This article will examine how accurately media headlines reflect reality and explore which LeaderGPU configurations are suitable for installing this neural network yourself.

Architectural features

DeepSeek has chosen a path of maximum optimization, unsurprising given China’s U.S. export restrictions. These restrictions prevent the country from officially using the most advanced GPU models for AI development.

The model employs Multi Token Prediction (MTP) technology, which predicts multiple tokens in a single inference step instead of just one. This works through parallel token decoding combined with special masked layers that maintain autoregressivity.

MTP testing has shown remarkable results, increasing generation speeds by 2-4 times compared to traditional methods. The technology’s excellent scalability makes it valuable for current and future natural language processing applications.

The Multi-Head Latent Attention (MLA) model features an enhanced attention mechanism. As the model builds long chains of reasoning, it maintains focused attention on the context at each stage. This enhancement improves its handling of abstract concepts and text dependencies.

MLA’s key feature is its ability to dynamically adjust attention weights across different abstraction levels. When processing complex queries, MLA examines data from multiple perspectives: word meanings, sentence structures, and overall context. These perspectives form distinct layers that influence the final output. To maintain clarity, MLA carefully balances each layer’s impact while staying focused on the primary task.

DeepSeek’s developers incorporated Mixture of Experts (MoE) technology into the model. It contains 256 pre-trained expert neural networks, each specialized for different tasks. The system activates 8 of these networks for each token input, enabling efficient data processing without increasing computational costs.

In the full model with 671b parameters, only 37b are activated for each token. The model intelligently selects the most relevant parameters for processing each incoming token. This efficient optimization saves computational resources while maintaining high performance.

A crucial feature of any neural network chatbot is its context window length. Llama 2 has a context limit of 4,096 tokens, GPT-3.5 handles 16,284 tokens, while GPT-4 and DeepSeek can process up to 128,000 tokens (about 100,000 words, equivalent to 300 pages of typewritten text).

R - stands for Reasoning

DeepSeek-R1 has acquired a reasoning mechanism similar to OpenAI o1, enabling it to handle complex tasks more efficiently and accurately. Instead of providing immediate answers, the model expands the context by generating step-by-step reasoning in small paragraphs. This approach enhances the neural network’s ability to identify complex data relationships, resulting in more comprehensive and precise answers.

When faced with a complex task, DeepSeek uses its reasoning mechanism to break down the problem into components and analyze each one separately. The model then synthesizes these findings to generate a user response. While this appears to be an ideal approach for neural networks, it comes with significant challenges.

All modern LLMs share a concerning trait - artificial hallucinations. When presented with a question it cannot answer, instead of acknowledging its limitations, the model might generate fictional answers supported by made-up facts.

When applied to a reasoning neural network, these hallucinations could compromise the thought process by basing conclusions on fictional rather than factual information. This could lead to incorrect conclusions - a challenge that neural network researchers and developers will need to address in the future.

VRAM consumption

Let’s explore how to run and test DeepSeek R1 on a dedicated server, focusing on the GPU video memory requirements.

Model	VRAM (Mb)	Model size (Gb)
deepseek-r1:1.5b	1,952	1.1
deepseek-r1:7b	5,604	4.7
deepseek-r1:8b	6,482	4.9
deepseek-r1:14b	10,880	9
deepseek-r1:32b	21,758	20
deepseek-r1:70b	39,284	43
deepseek-r1:671b	470,091	404

The first three options (1.5b, 7b, 8b) are basic models that can handle most tasks efficiently. These models run smoothly on any consumer GPU with 6-8 GB of video memory. The mid-tier versions (14b and 32b) are ideal for professional tasks but require more VRAM. The largest models (70b and 671b) require specialized GPUs and are primarily used for research and industrial applications.

Server selection

To help you choose a server for DeepSeek inference, here are the ideal LeaderGPU configurations for each model group:

1.5b / 7b / 8b / 14b / 32b / 70b

For this group, any server with the following GPU types will be suitable. Most LeaderGPU servers will run these neural networks without any issues. Performance will mainly depend on the number of CUDA® cores. We recommend servers with multiple GPUs, such as:

671b

Now for the most challenging case: how do you run inference on a model with a 404 GB base size? This means approximately 470 GB of video memory will be required. LeaderGPU offers multiple configurations with the following GPUs capable of handling this load:

A100
H100

Both configurations handle the model load efficiently, distributing it evenly across multiple GPUs. For example, this is what a server with 8xH100 looks like after loading the deepseek-r1:671b model:

The computational load balances dynamically across GPUs, while high-speed NVLink® interconnects prevent data exchange bottlenecks, ensuring maximum performance.

Conclusion

DeepSeek-R1 combines many innovative technologies like Multi Token Prediction, Multi-Head Latent Attention, and Mixture of Experts into one significant model. This open-source software demonstrates that LLMs can be developed more efficiently with fewer computational resources. The model has various versions from smaller 1.5b to huge 671b which require specialized hardware with multiple high-end GPUs working in parallel.

By renting a server from LeaderGPU for DeepSeek-R1 inference, you get a wide range of configurations, reliability, and fault tolerance. Our technical support team will help you with any problems or questions, while the automatic operating system installation reduces deployment time.

Choose your LeaderGPU server and discover the possibilities that open up when using modern neural network models. If you have any questions, don’t hesitate to ask them in our chat or email.

Intel Habana Gaudi 2: install and test

Thu, 23 Jan 2025 13:41:09 +0100

Before you start installing the Gaudi 2 accelerators software, there is one important feature worth mentioning. We are accustomed to the fact that training and inference of neural networks can be performed using GPUs. However, Intel Habana Gaudi 2 is very different from GPUs and represents a different class of devices that are designed solely for the accelerating AI tasks.

Many familiar applications and frameworks will not work without first preparing the operating system and, in some cases, without a special GPU Migration Toolkit. This explains the large number of preparatory steps that we describe in this article. Let’s start in order.

Step 1. Install SynapseAI Software Stack

To start working with Intel Habana Gaudi 2 accelerators, you need to install the SynapseAI stack. It includes a special graph compiler that transforms the topology of the neural network model to effectively optimize execution on Gaudi architecture, API libraries for horizontal scaling, as well as a separate SDK for creating high-performance algorithms and machine learning models.

Separately, we note that SynapseAI is the part that allows you to create a bridge between popular frameworks like PyTorch/TensorFlow and the Gaudi 2 AI accelerators. This allows you to work with familiar abstractions, and Gaudi 2 independently optimizes calculations Specific operators for which accelerators do not have hardware support are executed on the CPU.

To simplify the installation of individual SynapseAI components, a convenient shell script has been created. Let’s download it:

wget -nv https://vault.habana.ai/artifactory/gaudi-installer/latest/habanalabs-installer.sh

Make the file executable:

chmod +x habanalabs-installer.sh

Run the script:

./habanalabs-installer.sh install --type base

Follow the system prompts during installation. You’ll find a detailed report in the log file. You can see in it which packages were installed, as well as whether the accelerators were successfully found and initialized.

Logs here: /var/log/habana_logs/install-YYYY-MM-DD-HH-MM-SS.log

[  +3.881647] habanalabs hl5: Found GAUDI2 device with 96GB DRAM
[  +0.008145] habanalabs hl0: Found GAUDI2 device with 96GB DRAM
[  +0.032034] habanalabs hl3: Found GAUDI2 device with 96GB DRAM
[  +0.002376] habanalabs hl4: Found GAUDI2 device with 96GB DRAM
[  +0.005174] habanalabs hl1: Found GAUDI2 device with 96GB DRAM
[  +0.000390] habanalabs hl2: Found GAUDI2 device with 96GB DRAM
[  +0.007065] habanalabs hl7: Found GAUDI2 device with 96GB DRAM
[  +0.006256] habanalabs hl6: Found GAUDI2 device with 96GB DRAM

Just as the nvidia-smi utility provides information about installed GPUs and running compute processes, SynapseAI has a similar program. You can run it to get a report on the current state of the Gaudi 2 AI accelerators:

hl-smi

Step 2. TensorFlow test

TensorFlow is one of the most popular platforms for machine learning. Using the same installation script, you can install a pre-built version of TensorFlow with support for Gaudi 2 accelerators. Let’s start by installing the general dependencies:

./habanalabs-installer.sh install -t dependencies

Next, we’ll install dependencies for TensorFlow:

./habanalabs-installer.sh install -t dependencies-tensorflow

Install the TensorFlow platform inside a virtual environment implemented using the Python Virtual Environment (venv) mechanism:

./habanalabs-installer.sh install --type tensorflow --venv

Let’s activate the created virtual environment:

source habanalabs-venv/bin/activate

Create a simple Python code example that will utilize the capabilities of Gaudi 2 accelerators:

nano example.py


import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
import os
# Import Habana Torch Library
import habana_frameworks.torch.core as htcore
class SimpleModel(nn.Module):
   def __init__(self):
       super(SimpleModel, self).__init__()
       self.fc1   = nn.Linear(784, 256)
       self.fc2   = nn.Linear(256, 64)
       self.fc3   = nn.Linear(64, 10)
   def forward(self, x):
       out = x.view(-1,28*28)
       out = F.relu(self.fc1(out))
       out = F.relu(self.fc2(out))
       out = self.fc3(out)
       return out
def train(net,criterion,optimizer,trainloader,device):
   net.train()
   train_loss = 0.0
   correct = 0
   total = 0
   for batch_idx, (data, targets) in enumerate(trainloader):
       data, targets = data.to(device), targets.to(device)
       optimizer.zero_grad()
       outputs = net(data)
       loss = criterion(outputs, targets)
       loss.backward()
       # API call to trigger execution
       htcore.mark_step()
       optimizer.step()
       # API call to trigger execution
       htcore.mark_step()
       train_loss += loss.item()
       _, predicted = outputs.max(1)
       total += targets.size(0)
       correct += predicted.eq(targets).sum().item()
   train_loss = train_loss/(batch_idx+1)
   train_acc = 100.0*(correct/total)
   print("Training loss is {} and training accuracy is {}".format(train_loss,train_acc))
def test(net,criterion,testloader,device):
   net.eval()
   test_loss = 0
   correct = 0
   total = 0
   with torch.no_grad():
       for batch_idx, (data, targets) in enumerate(testloader):
           data, targets = data.to(device), targets.to(device)
           outputs = net(data)
           loss = criterion(outputs, targets)
           # API call to trigger execution
           htcore.mark_step()
           test_loss += loss.item()
           _, predicted = outputs.max(1)
           total += targets.size(0)
           correct += predicted.eq(targets).sum().item()
   test_loss = test_loss/(batch_idx+1)
   test_acc = 100.0*(correct/total)
   print("Testing loss is {} and testing accuracy is {}".format(test_loss,test_acc))
def main():
   epochs = 20
   batch_size = 128
   lr = 0.01
   milestones = [10,15]
   load_path = './data'
   save_path = './checkpoints'
   if(not os.path.exists(save_path)):
       os.makedirs(save_path)
   # Target the Gaudi HPU device
   device = torch.device("hpu")
   # Data
   transform = transforms.Compose([
       transforms.ToTensor(),
   ])
   trainset = torchvision.datasets.MNIST(root=load_path, train=True,
                                           download=True, transform=transform)
   trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                           shuffle=True, num_workers=2)
   testset = torchvision.datasets.MNIST(root=load_path, train=False,
                                       download=True, transform=transform)
   testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                           shuffle=False, num_workers=2)
   net = SimpleModel()
   net.to(device)
   criterion = nn.CrossEntropyLoss()
   optimizer = optim.SGD(net.parameters(), lr=lr,
                       momentum=0.9, weight_decay=5e-4)
   scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=milestones, gamma=0.1)
   for epoch in range(1, epochs+1):
       print("=====================================================================")
       print("Epoch : {}".format(epoch))
       train(net,criterion,optimizer,trainloader,device)
       test(net,criterion,testloader,device)
       torch.save(net.state_dict(), os.path.join(save_path,'epoch_{}.pth'.format(epoch)))
       scheduler.step()
if __name__ == '__main__':
   main()

Finally, execute the application:

python3 example.py

To exit the virtual environment, run the following command:

deactivate

Step 3. Clone training repository

Clone the repository with the MLperf code:

git clone https://github.com/mlcommons/training_results_v3.0

Create a separate directory that will be used by the Docker container with MLperf:

mkdir -p mlperf

Change the directory:

cd mlperf

Let’s export some environment variables:

export MLPERF_DIR=/home/usergpu/mlperf

export SCRATCH_DIR=/home/usergpu/mlperf/scratch

export DATASETS_DIR=/home/usergpu/mlperf/datasets

Create new directories using the variables created:

mkdir -p $MLPERF_DIR/Habana

mkdir -p $SCRATCH_DIR

mkdir -p $DATASETS_DIR

Copy the benchmark app to $MLPERF_DIR/Habana:

cp -R training_results_v3.0/Intel-HabanaLabs/benchmarks/ $MLPERF_DIR/Habana

Export another variable that will store a link to download the desired version of the Docker container:

export MLPERF_DOCKER_IMAGE=vault.habana.ai/gaudi-docker-mlperf/ver3.1/pytorch-installer-2.0.1:1.13.99-41

Step 4. Install Docker

Our instance runs Ubuntu Linux 22.04 LTS and does not support Docker by default. So, before downloading and running containers, you need to install Docker support. Let’s refresh the packages cache and install some basic packages that you’ll need later:

sudo apt update && sudo apt -y install apt-transport-https ca-certificates curl software-properties-common

To install Docker, you need to add a digitally signed project repository. Download the digital signature key and add it to the operating system’s key store:

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

Docker can run on platforms with various architectures. The following command will detect your server’s architecture and add the corresponding repository line to the APT package manager list:

echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

Update the packages cache and policies and install docker-ce (Docker Community Edition):

sudo apt update && apt-cache policy docker-ce && sudo apt install docker-ce

Finally, check that Docker daemon is up and running:

sudo systemctl status docker

Step 5. Run Docker container

Let’s launch the container in privileged mode using the previously specified variables:

sudo docker run --privileged --security-opt seccomp=unconfined \
  --name mlperf3.0 -td                    \
  -v /dev:/dev                            \
  --device=/dev:/dev                      \
  -e LOG_LEVEL_ALL=6                      \
  -v /sys/kernel/debug:/sys/kernel/debug  \
  -v /tmp:/tmp                            \
  -v $MLPERF_DIR:/root/MLPERF             \
  -v $SCRATCH_DIR:/root/scratch           \
  -v $DATASETS_DIR:/root/datasets/        \
  --cap-add=sys_nice --cap-add=SYS_PTRACE \
  --user root --workdir=/root --net=host  \
  --ulimit memlock=-1:-1 $MLPERF_DOCKER_IMAGE

For convenience, you can gain access to the terminal inside the container via SSH:

sudo docker exec mlperf3.0 bash -c "service ssh start"

To open a command shell (bash) in the current session, run the following command:

sudo docker exec -it mlperf3.0 bash

Step 6. Prepare a dataset

To run Bert implementation tests from MLperf, you need a prepared dataset. The optimal method is to generate a dataset from preloaded data. The MLperf repository includes a special script, prepare_data.sh, which requires a specific set of packages to function. Let’s navigate to the following directory:

cd /root/MLPERF/Habana/benchmarks/bert/implementations/PyTorch

Install all required packages using the pre-generated list and the pip package manager:

pip install -r requirements.txt

Set the PYTORCH_BERT_DATA variable to instruct the script where to store data:

export PYTORCH_BERT_DATA=/root/datasets/pytorch_bert

Run the script:

bash input_preprocessing/prepare_data.sh -o $PYTORCH_BERT_DATA

The generation procedure is quite long and can take several hours. Please be patient and do not interrupt the process. If you plan to disconnect from the SSH session, it is recommended to use the screen utility immediately before starting the Docker container.

Step 7. Pack the dataset

The next step is to “cut” the dataset into equal pieces for the subsequent launch of MLperf. Let’s create the separate directory for packed data:

mkdir $PYTORCH_BERT_DATA/packed

Run the packing script:

python3 pack_pretraining_data_pytorch.py \
  --input_dir=$PYTORCH_BERT_DATA/hdf5/training-4320/hdf5_4320_shards_uncompressed \
  --output_dir=$PYTORCH_BERT_DATA/packed \
  --max_predictions_per_seq=76

Step 8. Run a test

Now that the dataset is prepared, it’s time to run the test. However, it’s impossible to do this without prior preparation. The Bert test authors left some hard-coded values in the script, which will interfere with the test execution. First, rename the following directory:

mv $PYTORCH_BERT_DATA/packed $PYTORCH_BERT_DATA/packed_data_500_pt

Change the directory:

cd /root/MLPERF/Habana/benchmarks/bert/implementations/HLS-Gaudi2-PT

Since the GNU Nano editor isn’t installed inside the container, it must be installed separately. Alternatively, you can use the built-in Vi editor:

apt update && apt -y install nano

Now, edit the test launch script:

nano launch_bert_pytorch.sh

Find the first line:

DATA_ROOT=/mnt/weka/data/pytorch/bert_mlperf/packed_data

Replace with the following:

DATA_ROOT=/root/datasets/pytorch_bert

Find the second line:

INPUT_DIR=$DATA_ROOT/packed

Replace with the following:

INPUT_DIR=$DATA_ROOT/packed_data_500_pt

Save the file and exit.

The test code includes a limiter function that restricts the gradient from exceeding certain values, preventing potential exponential growth. For reasons unknown to us, this function is absent in the PyTorch version used in the container, causing the test to terminate abnormally during the warm-up stage.

A potential workaround might be to temporarily remove this function from the code in the fastddp.py file. To do this, open the file:

nano ../PyTorch/fastddp.py

Find and comment out the following three lines of code using the # (shebang symbol) so they look like this:

#from habana_frameworks.torch import _hpex_C
#    clip_global_grad_norm = _hpex_C.fused_lamb_norm(grads, 1.0)
#    _fusion_buffer.div_((clip_global_grad_norm * _all_reduce_group_size).to(_fusion_buffer.dtype))

Also, save the file and exit. Change the directory:

cd ../HLS-Gaudi2-PT

Finally, run the script. It will take approximately 20 minutes to complete:

./launch_bert_pytorch.sh

See also:

NVIDIA® RTX™ 50: expectations and reality

Thu, 23 Jan 2025 13:34:30 +0100

The highlight of CES 2025 was NVIDIA® CEO Jensen Huang’s speech. The revelation of new GPU specifications within minutes caught many off guard. In this article, we’ll examine how expert predictions matched the actual announcements.

Let’s look at the lineup first. The RTX™ 40 series launched with 6 models, ranging from the RTX™ 4060 to the RTX™ 4090. While many expected a similar range for the RTX™ 50 series, that didn’t happen. Instead, the RTX™ 50 family includes just 4 models: RTX™ 5070, RTX™ 5070 Ti, RTX™ 5080, and RTX™ 5090. We may see both the RTX™ 5050 and RTX™ 5060 in the future, but no official sources have verified these graphics cards yet.

Technological process

Moore’s law, the empirical observation that “the number of transistors in an integrated circuit doubles about every two years”, is often said to be no longer relevant to chip performance. Since 2022, Jensen Huang has repeatedly declared Moore’s law dead. Instead he proposed a new concept that emphasizes the simultaneous development of architecture, microchips, software libraries and algorithms.

Together, this shift allows us to focus on overall system performance rather than just transistors count. The concept of computing efficiency has sparked ongoing discussions in the tech community. While views on this topic vary, the industry clearly faces both physical and economic barriers to further miniaturization.

Let’s take a look at the new generation GPU process technology. The presentation didn’t specifically mention this, but all previous generation cards were built on the 4N process. The RTX™ 50 series uses a different 4NP process technology. At the same time, it’s important to understand that 4N and 4NP are just marketing names. The transistors themselves remain 5 nm in size.

The improved 4NP process technology primarily enables higher transistor density on the chip and faster clock speeds. While experts predicted that the RTX™ 50 would use the same process technology as the RTX™ 40, they were technically incorrect, though not by much, since the transistor size remains unchanged and TSMC continues as the manufacturer.

Number of cores

Prior to the RTX™ 50 series release, numerous data leaks revealed the GPU’s basic characteristics. Initial insider reports from July 2024 suggested the flagship would feature 24,576 cores, 192 Ray-tracing cores, and 768 Tensor cores. However, subsequent leaks adjusted these numbers to more realistic values.

The final RTX™ 5090 shipped with 21,760 CUDA® cores (up from the RTX™ 4090’s 16,384), 170 Ray-tracing cores, and 680 Tensor cores. This aligns with the company’s recent strategy of boosting performance not just through increased transistor count, but through comprehensive architectural optimization.

Memory

The new GPUs use of GDDR7 memory came as no surprise. Industry experts had predicted this move in 2024 after the three major manufacturers (Samsung, Micron, and SK hynix) showcased their GDDR7 prototypes in succession. NVIDIA® was generous with memory distribution: the base RTX™ 5070 model features 12 GB GDDR7 on a 192-bit bus, while the RTX™ 5070 Ti and RTX™ 5080 both carry 16 GB GDDR7 on a 256-bit bus. At the top end, the flagship RTX™ 5090 comes with a massive 32 GB GDDR7 on a 512-bit bus.

Experts initially predicted that the maximum throughput of this memory configuration would be 1,5 Tbps. However, reality surpassed these expectations, achieving a throughput of 1,7 Tbps. This dramatic improvement primarily benefits the GPU’s AI processing capabilities rather than gaming performance. The new generation’s combination of high capacity and fast memory is particularly valuable for large language models and generative neural networks.

Technologies

For gamers

Real-time ray tracing has become one of the most revolutionary GPU technologies, marking the beginning of the RTX™ line. For many consumers, this feature has been a key factor in their purchase decisions. In RTX™ 50 series cards, DLSS (Deep Learning Super Sampling) version 4 may play an equally important role. This technology significantly boosts GPU performance in games through its hybrid frame rendering approach.

With DLSS enabled, instead of rendering every frame conventionally, some frames are generated in real time using AI. While early versions of this technology could only upscale frames to higher resolutions, DLSS 3 introduced a more advanced capability: for every conventionally rendered frame, it can generate an additional AI-created frame.

DLSS 4 will generate three AI-powered frames for every traditionally rendered frame. This significantly increases the frame per second (FPS) without putting heavy load on the GPU. The AI analyzes object and scene movement to ensure the generated frames closely match conventionally rendered ones.

This raises an important question: how do we handle input lag? Since frame generation takes time, each iteration adds to the response time. A smooth picture with slow response to player actions can severely impact the gaming experience. To address this, NVIDIA® has improved their Reflex 2 technology alongside DLSS to minimize latency.

Specifically, Frame Warp was integrated into the system. This technology reduces game latency by updating rendered frames with the latest mouse input just before display. It enhances both multiplayer competition and single-player responsiveness.

For content creators

The RTX™ 50 series isn’t just for gaming. Video content creators will find significant value in these new GPUs. The flagship RTX™ 5090 model comes equipped with 3 encoders and 2 decoders, compared to the RTX™ 4090’s 2 encoders and 1 decoder. These components have been enhanced through collaborative development with industry leaders: Adobe, Blackmagic Design, ByteDance, and Wondershare. As a result, the RTX™ 5090 renders video 60% faster than the RTX™ 4090 and four times faster than the RTX™ 3090.

Beyond raw speed improvements, the quality has also been enhanced. The 9th generation NVENC encoder delivers 5% better quality in HEVC and AV1 tasks. The AV1 Ultra Quality mode achieves better data compression while maintaining image quality, reducing file sizes by 5%. This means faster video rendering on the RTX™ 5090, and decreasing the time between editing and production.

Conclusion

Looking back six months, the experts’ predictions and expectations proved overly optimistic. As the release date approached, it became evident that the new GPUs would offer more than just additional computing units. The key innovation would be new optimization and AI technologies enhancing existing frame rendering systems.

At CES 2025, during the GPU 50 series presentation, a new AI era was unveiled. This vision portrayed a world where digital assistants and robots handle complex tasks. At its core would be an ecosystem combining supercomputers for AI training, affordable inference accelerators for consumer devices, and versatile software operating both locally and in the cloud. While the full extent of this future remains uncertain, one thing is clear - we stand at the threshold of turning science fiction into reality.

LeaderGPU remains committed to providing reliable access to these cutting-edge technologies. Order your first GPU server today and begin transforming your ideas into reality.

See also:

Advantages and Disadvantages of GPU sharing

Thu, 23 Jan 2025 13:24:12 +0100

Moore’s Law has remained relevant for nearly half a century. Processor chips continue to pack in more transistors, and technologies advance daily. As technology evolves, so does our approach to computing. The rise of certain computing tasks has significantly influenced hardware development. For instance, devices originally designed for graphics processing are now key, affordable tools for modern neural networks.

The management of computing resources has also transformed. Mass services now rarely use mainframes, as they did in the 1970s and ‘80s. Instead, they prefer cloud services or building their own infrastructure. This shift has changed customer demands, with a focus on rapid, on-demand scaling and maximizing the use of allocated computing resources.

Virtualization and containerization technologies emerged as solutions. Applications are now packaged in containers with all necessary libraries, simplifying deployment and scaling. However, manual management became impractical as container numbers soared into the thousands. Specialized orchestrators like Kubernetes now handle effective management and scaling. These tools have become an essential part of any modern IT infrastructure.

Server virtualization

Concurrently, virtualization technologies evolved, enabling the creation of isolated environments within a single physical server. Virtual machines behave identically to regular physical servers, allowing the use of standard management tools. Depending on the hypervisor, a specialized API is often included, facilitating the automation of routine procedures.

However, this flexibility comes with reduced security. Attackers have shifted their focus from targeting individual virtual machines to exploiting hypervisor’s vulnerabilities. By gaining control of a hypervisor, attackers can access all associated virtual machines at will. Despite ongoing security improvements, modern hypervisors remain attractive targets.

Traditional virtualization addresses two key issues. First issue: it ensures the isolation of virtual machines from one another. Bare-metal solutions avoid this problem as customers rent entire physical servers under their control. But for virtual machines isolation is software-based at the hypervisor level. A code error or random bug can compromise this isolation, risking data leakage or corruption.

The second issue concerns resource management. While it’s possible to guarantee resource allocation to specific virtual machines, managing numerous machines presents a dilemma. Resources can be underutilized, resulting in fewer virtual machines per physical server. This scenario is unprofitable for infrastructure and inevitably leads to price increases.

Alternatively, you can use automatic resource management mechanisms. Although a virtual machine is allocated specific declared characteristics, in fact, only the required minimum is provided within these limits. If the machine needs more processor time or RAM, the hypervisor will attempt to provide it, but can’t guarantee it. This situation is similar to airplane overbooking, where airlines sell more tickets than there are seats available.

The logic is identical. If statistics show that about 10% of passengers don't come on time for their flight, airlines can sell 10% more tickets with minimal risk. If all passengers come, some won’t fit on board. The airline will face minor consequences in the form of compensation but will likely continue this practice.

Many infrastructure providers employ a similar strategy. Some are transparent about it, stating they don’t guarantee constant availability of computing resources but offer significantly reduced prices. Others use similar mechanisms without advertising it. They’re betting that not all customers will consistently use 100% of their server resources, and even if some do, they’ll be in the minority. Meanwhile, idle resources generate profit.

In this context, bare-metal solutions have an advantage. They guarantee that allocated resources are fully managed by the customer and not shared with other users of the infrastructure provider. This eliminates scenarios where high load from a neighboring server’s user negatively impacts performance.

GPU virtualization

Classic virtualization inevitably faces the challenge of emulating physical devices. To reduce overhead costs, special technologies have been developed that allow virtual machines to directly access the server’s physical devices. This approach works well in many cases, but when applied to graphics processors, it creates immediate limitations. For instance, if a server has 8 GPUs installed, only 8 virtual machines can access them.

To overcome this limitation, vGPU technology was invented. It divides one GPU into several logical ones, which can then be assigned to virtual machines. This allows each virtual machine to get its “piece of cake”, and their total number is no longer limited by the number of video cards installed in the server.

Virtual GPUs are most commonly used when building VDI (Virtual Desktop Infrastructure) in areas where virtual machines require 3D acceleration. For example, a virtual workplace for a designer or planner typically involves graphics processing. Most applications in these fields perform calculations on both the central processor and the GPU. This hybrid approach significantly increases productivity and ensures optimal use of available computing resources.

However, this technology has several drawbacks. It’s not supported by all GPUs and is only available in the server segment. Support also depends on the installed version of the operating system and the GPU driver. vGPU has a separate licensing mechanism, which substantially increases operations costs. Additionally, its software components can potentially serve as attack vectors.

Recently, information was published about eight vulnerabilities affecting all users of NVIDIA® GPUs. Six vulnerabilities were identified in GPU drivers, and two were found in the vGPU software. These issues were quickly addressed, but it serves as a reminder that isolation mechanisms in such systems are not flawless. Constant monitoring and timely installation of updates remain the primary ways to ensure security.

When building infrastructure to process confidential and sensitive user data, any virtualization becomes a potential risk factor. In such cases, a bare-metal approach may offer better quality and security.

Conclusion

Building a computing infrastructure always requires risk assessment. Key questions to consider include: Is customer data securely protected? Do the chosen technologies create additional attack vectors? How can potential vulnerabilities be isolated and eliminated? Answering these questions helps make informed choices and safeguard against future problems.

At LeaderGPU, we’ve reached a clear conclusion: currently, bare-metal technology is superior in ensuring user data security while serving as an excellent foundation for building a bare-metal cloud. This approach allows our customers to maintain flexibility without taking on the added risks associated with GPU virtualization.

See also:

What is Knowledge Distillation

Thu, 23 Jan 2025 13:21:29 +0100

Large Language Models (LLMs) have become an integral part of our lives through their unique capabilities. They comprehend context and generate coherent, extensive texts based on it. They can process and respond in any language while considering the cultural nuances of each.

LLMs excel at complex problem-solving, programming, maintaining conversations, and more. This versatility comes from processing vast amounts of training data, hence the term "large". These models can contain tens or hundreds of billions of parameters, making them resource-intensive for everyday use.

Training is the most demanding process. Neural network models learn by processing enormous datasets, adjusting their internal "weights" to form stable connections between neurons. These connections store knowledge that the trained neural network can later use on end devices.

However, most end devices lack the necessary computing power to run these models. For instance, running the full version of Llama 2 (70B parameters) requires a GPU with 48 GB of video memory, hardware that few users have at home, let alone on mobile devices.

Consequently, most modern neural networks operate in cloud infrastructure rather than on portable devices, which access them through APIs. Still, device manufacturers are making progress in two ways: equipping devices with specialized computing units like NPUs, and developing methods to improve the performance of compact neural network models.

Reducing the size

Cut off the excess

Quantization is the first and most effective method for reducing neural network size. Neural network weights typically use 32-bit floating point numbers, but we can shrink them by changing this format. Using 8-bit values (or even binary ones in some cases) can reduce the network's size tenfold, though this significantly decreases answer accuracy.

Pruning is another approach, which removes unimportant connections in the neural network. This process works during both training and with completed networks. Beyond just connections, pruning can remove neurons or entire layers. This reduction in parameters and connections leads to lower memory requirements.

Matrix or tensor decomposition is the third common size-reduction technique. Breaking down one large matrix into a product of three smaller matrices reduces the total parameters while maintaining quality. This can shrink the network's size by dozens of times. Tensor decomposition offers even better results, though it requires more hyperparameters.

While these methods effectively reduce size, they all face the challenge of quality loss. Large compressed models outperform their smaller, uncompressed counterparts, but each compression risks reducing answer accuracy. Knowledge distillation represents an interesting attempt to balance quality with size.

Let’s try it together

Knowledge distillation is best explained through the analogy of a student and teacher. While students learn, teachers teach and also continuously update their existing knowledge. When both encounter new knowledge, the teacher has an advantage, they can draw upon their broad knowledge from other areas, while the student lacks this foundation yet.

This principle applies to neural networks. When training two neural networks of the same type but different sizes on identical data, the larger network typically performs better. Its greater capacity for "knowledge" enables more accurate responses than its smaller counterpart. This raises an interesting possibility: why not train the smaller network not just on the dataset, but also on the more accurate outputs of the larger network?

This process is knowledge distillation: a form of supervised learning where a smaller model learns to replicate the predictions of a larger one. While this technique helps offset the quality loss from reducing neural network size, it does require extra computational resources and training time.

Software and logic

With the theoretical foundation now clear, let's examine the process from a technical perspective. We'll begin with software tools that can guide you through the training and knowledge distillation stages.

Python, along with the TorchTune library from the PyTorch ecosystem, offers the simplest approach for studying and fine-tuning large language models. Here's how the application works:

Two models are loaded: a full model (teacher) and a reduced model (student). During each training iteration, the teacher model generates high-temperature predictions while the student model processes the dataset to make its own predictions.

Both models' raw output values (logits) are evaluated through a loss function (a numerical measure of how much a prediction deviates from the correct value). Weight adjustments are then applied to the student model through backpropagation. This enables the smaller model to learn and replicate the teacher model's predictions.

The primary configuration file in the application code is called a recipe. This file stores all distillation parameters and settings, making experiments reproducible and allowing researchers to track how different parameters influence the final outcome.

When selecting parameter values and iteration counts, maintaining balance is crucial. A model that's distilled too much may lose its ability to recognize subtle details and context, defaulting to templated responses. While perfect balance is nearly impossible to achieve, careful monitoring of the distillation process can substantially improve the prediction quality of even modest neural network models.

It is also worth paying attention to monitoring during the training process. This will help to identify problems in time and promptly correct them. For this, you can use the TensorBoard tool. It integrates seamlessly into PyTorch projects and allows you to visually evaluate many metrics, such as accuracy and losses. Moreover, it allows you to build a model graph, track memory usage and execution time of operations.

Conclusion

Knowledge distillation is an effective method for optimizing neural networks to improve compact models. It works best when balancing performance with answer quality is essential.

Though knowledge distillation requires careful monitoring, its results can be remarkable. Models become much smaller while maintaining prediction quality, and they perform better with fewer computing resources.

When planned well with appropriate parameters, knowledge distillation serves as a key tool for creating compact neural networks without sacrificing quality.

See also:

AudioCraft by MetaAI: create music by description

Wed, 22 Jan 2025 15:51:35 +0100

Modern generative neural networks are becoming smarter. They are writing stories, engaging in conversations with people, and creating ultra-realistic images. Now, they can produce simple music tracks without the need for professional artists. This future has become a reality today. It’s expected, as musical harmonies and rhythms are rooted in mathematical principles.

Meta has demonstrated its commitment to the world of open-source software. They have placed three neural network models publicity available that enable the creation of sounds and music from text descriptions:

MusicGen — generates music from text.
AudioGen — generates audio from text.
EnCodec — high quality neural audio compressor.

MusicGen was trained on 20,000 hours of music. You can utilize it locally via dedicated LeaderGPU servers as a platform.

Standard installation

Update the package cache repository:

sudo apt update && sudo apt -y upgrade

Install the Python package manager, pip, and the ffmpeg libraries:

sudo apt -y install python3-pip ffmpeg

Install torch 2.0 or newer using pip:

pip install 'torch>=2.0'

The next command automatically installs audiocraft and all necessary dependencies:

pip install -U audiocraft

Let’s write a simple Python app, using the large pre-trained MusicGen model with 3.3B parameters:

nano generate.py

from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
model = MusicGen.get_pretrained("facebook/musicgen-large")
model.set_generation_params(duration=30)  # generate a 30 seconds sample.
descriptions = ["rock solo"]
wav = model.generate(descriptions)  # generates sample.
for idx, one_wav in enumerate(wav):
    # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
    audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness")

Execute the created app:

python3 generate.py

After a few seconds, the generated file (0.wav) will appear in the directory.

Coffee Vampir 3

Clone a project repository:

git clone https://github.com/CoffeeVampir3/audiocraft-webui.git

Open the cloned directory:

cd audiocraft-webui

Run command that prepares your system and installs all necessary packages:

pip install -r requirements.txt

Then, run the Coffee Vampire 3 server with the following command:

python3 webui.py

Coffee Vampire 3 uses Flask as a framework. By default, it runs on localhost with port 5000. If you want remote access, please use the port forwarding feature in your SSH client. Otherwise, you can organize VPN-connection to the server.

Attention! This is a potentially dangerous action; use at your own risk:

nano webui.py

Scroll down to the end and replace socketio.run(app) to socketio.run(app, host=’0.0.0.0’, port=5000)

Save the file and run the server using the command above. This allows access to the server from the public internet without any authentication.

Don’t forget to disable AdBlock software, as it can block the music player on the right side of the webpage. You can start by entering the prompt and confirming with the Submit button:

TTS Generation WebUI

Step 1. Drivers

Update the package cache repository:

sudo apt update && sudo apt -y upgrade

Install NVIDIA® drivers using automatic installer or our guide Install NVIDIA® drivers in Linux:

sudo ubuntu-drivers autoinstall

Reboot the server:

sudo shutdown -r now

Step 2. Docker

The next step is to install Docker. Let’s install some packages that need to be added to the Docker repository:

sudo apt -y install apt-transport-https curl gnupg-agent ca-certificates software-properties-common

Download the Docker GPG key and store it:

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

Add the repository:

sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu focal stable"

Install Docker CE (Community Edition) with CLI and the containerd runtime:

sudo apt -y install docker-ce docker-ce-cli containerd.io

Add the current user to the docker group:

sudo usermod -aG docker $USER

Apply changes without the logout and login procedure:

newgrp docker

Step 3. GPU passthrough

Let’s enable NVIDIA® GPUs passthrough in Docker. The following command reads the current OS version into the distribution variable, which we can use in the next step:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

Download the NVIDIA® repository GPG key and store it:

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -

Download the NVIDIA® repos list and store it for use in the standard APT package manager:

curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

Update the package cache repository and install the GPU passthrough toolkit:

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

Restart the Docker daemon:

sudo systemctl restart docker

Step 4. WebUI

Download the repository archive:

wget https://github.com/rsxdalv/tts-generation-webui/archive/refs/heads/main.zip

Unpack it:

unzip main.zip

Open the project’s directory:

cd tts-generation-webui-main

Start building the image:

docker build -t rsxdalv/tts-generation-webui .

Run the created container:

docker compose up -d

Now you can open http://[server_ip]:7860, type your prompt, select the necessary model, and click the Generate button:

The system automatically downloads the selected model during the first generation. Enjoy!

See also:

How to monitor LangFlow application

Wed, 22 Jan 2025 15:14:55 +0100

In our article Low-code AI app builder Langflow we explored how to get started with this low-code AI app builder’s visual programming environment. It enables anyone, even those without programming knowledge, to build applications powered by large neural network models. These could be AI chatbots or document processing applications that can analyze and summarize content.

Langflow uses a building-block approach where users connect pre-made components to create their desired application. However, two key challenges often arise: troubleshooting when neural networks behave unexpectedly, and managing costs. Neural networks require substantial computing resources, making it essential to monitor and predict infrastructure expenses.

LangWatch addresses both challenges. This specialized tool helps Langflow developers monitor user requests, track costs, and detect anomalies, such as when applications are used in unintended ways.

This tool was originally designed as a service but can be deployed on any server, including locally. It integrates with most LLM providers, whether cloud-based or on-premise. Being open source, LangWatch can be adapted to almost any project: adding new features or connecting with internal systems.

LangWatch lets you set up alerts when specific metrics exceed defined thresholds. This helps you quickly detect unexpected increases in request costs or unusual response delays. Early detection helps prevent unplanned expenses and potential service attacks.

For neural network researchers, this application enables both monitoring and optimization of common user requests. It also provides tools to evaluate model response quality and make adjustments when needed.

Quick start

System prepare

Like Langflow, the simplest way to run the application is through a Docker container. Before installing LangWatch, you’ll need to install Docker Engine on your server. First, update your package cache and the packages to their latest versions:

sudo apt update && sudo apt -y upgrade

Install additional packages required by Docker:

sudo apt -y install apt-transport-https ca-certificates curl software-properties-common

Download the GPG key to add the official Docker repository:

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

Add the repository to APT using the key you downloaded and installed earlier:

echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

Refresh the package list:

sudo apt update

To ensure that Docker will be installed from the newly added repository and not from the system one, you can run the following command:

apt-cache policy docker-ce

Install Docker Engine:

sudo apt install docker-ce

Verify that Docker has been installed successfully and the corresponding daemon is running and in the active (running) status:

sudo systemctl status docker

● docker.service - Docker Application Container Engine
    Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset>
    Active: active (running) since Mon 2024-11-18 08:26:35 UTC; 3h 27min ago
TriggeredBy: ● docker.socket
      Docs: https://docs.docker.com
  Main PID: 1842 (dockerd)
     Tasks: 29
    Memory: 1.8G
       CPU: 3min 15.715s
    CGroup: /system.slice/docker.service

Build and run

With Docker Engine installed and running, you can download the LangWatch application repository:

git clone https://github.com/langwatch/langwatch

The application includes a sample configuration file with environment variables. Copy this file so the image build utility can process it:

cp langwatch/.env.example langwatch/.env

Now you’re ready for the first launch:

sudo docker compose up --build

The system will take a moment to download all necessary container layers for LangWatch. Once complete, you’ll see a console message indicating the application is available at:

http://[LeaderGPU_IP_address]:3000

Navigate to this page in your browser, where you’ll be prompted to create a user account:

Unlike Langflow, this system has authentication enabled by default. After logging in, you’ll need to configure the system to collect data from your Langflow server.

Langflow integration

LangWatch needs a data source to function. The server listens on port 3000 and uses a RESTful API, which authenticates incoming data through an automatically generated API key.

To enable data transfer, you’ll need to set two variables in the Langflow configuration files: LANGWATCH_ENDPOINT and LANGWATCH_API_KEY. First, establish an SSH connection to your Langflow server (which should be offline during this process).

Navigate to the directory with the sample configuration for Docker:

cd langflow/docker_example

Open the configuration file for editing:

nano docker-compose.yml

In the “environment:” section, add the following variables (without brackets [] or quotation marks):

- LANGWATCH_API_KEY= [YOUR_API_KEY]
- LANGWATCH_ENDPOINT=http://[IP_ADDRESS]:3000

The YML file requires specific formatting. Follow these two key rules:

Use spaces (2 or 4) for indentation, never tabs.
Maintain proper hierarchical structure with consistent indentation.

Save the file with Ctrl + O and exit the editor with Ctrl + X, Langflow is now ready to launch:

sudo docker compose up

After launching, verify that everything works properly. Create a new project or open an existing one, then initiate a dialogue through Playground. Langflow will automatically send data to LangWatch for monitoring, which you can view in the web interface.

In the integration verification section, a check mark appears on the “Sync your first message” item. This indicates that data from Langflow is successfully flowing into LangWatch, confirming your setup is correct. Let’s examine that appears in the Messages section:

The Messages section displays the data entered into the application, the parameters used for response generation, and the neural network’s response itself. You can evaluate response quality and use various filters to sort through the data, even with hundreds or thousands of messages.

After this initial setup, explore the application’s features systematically. In the Evaluations section, you can set up dialogue verification algorithms for either dialogue moderation or data recognition, such as PII Detection. This feature scans input for sensitive information like social security numbers or phone numbers.

The application offers both local and cloud-based options through providers like Azure or Cloudflare. To use cloud features, you’ll need accounts with these services, along with their endpoint addresses and API keys. Keep in mind that these are third-party providers, so check their service costs directly.

For local options, the application features sophisticated RAG (Retrieval-augmented generation) capabilities. You can measure the accuracy and relevance of RAG-generated content, and use the gathered statistics to optimize the RAG system for more accurate neural network responses.

See also:

Low-code AI app builder Langflow

Wed, 22 Jan 2025 15:11:30 +0100

Software development has evolved dramatically in recent years. Modern programmers now have access to hundreds of programming languages and frameworks. Beyond traditional imperative and declarative approaches, a new and exciting method of creating applications is emerging. This innovative approach harnesses the power of neural networks, opening up fantastic possibilities for developers.

People have grown accustomed to AI assistants in IDEs helping with code autocompletion and modern neural networks easily generating code for simple Python games. However, new hybrid tools are emerging that could revolutionize the development landscape. One such tool in Langflow.

Langflow serves multiple purposes. For professional developers, it offers better control over complex systems like neural networks. For those unfamiliar with programming, it enables the creation of simple yet practical applications. These goals are achieved through different means, which we’ll explore in more detail.

Neural networks

The concept of a neural network can be simplified for users. Imagine a black box that receives input data and parameters influencing the final result. This box processes the input using complex algorithms, often referred to as “magic”, and produces output data that can be presented to the user.

The inner workings of this black box vary based on the neural network’s design and training data. It’s crucial to understand that developers and users can never achieve 100% certainty in results. Unlike traditional programming where 2 + 2 always equals 4, a neural network might give this answer with 99% certainty, always maintaining a margin of error.

Control over a neural network's "thinking" process is indirect. We can only adjust certain parameters, such as "temperature." This parameter determines how creative or constrained the neural network can be in its approach. A low temperature value limits the network to a more formal, structured approach to tasks and solutions. Conversely, high temperature values grant the network more freedom, potentially leading to reliance on less reliable facts or even the creation of fictional information.

This example illustrates how users can influence the final output. For traditional programming, this uncertainty poses a significant challenge - errors may appear unexpectedly, and specific results become unpredictable. However, this unpredictability is primarily a problem for computers, not for humans who can adapt to and interpret varying outputs.

If a neural network’s output is intended for a human, the specific wording used to describe it is generally less important. Given the context, people can correctly interpret various results from the machine’s perspective. While concepts like “positive value”, "result achieved”, or “positive decision” might mean roughly the same thing to a person, traditional programming would struggle with this flexibility. It would need to account for all possible answer variations, which is nearly impossible.

On the other hand, if further processing is handed off to another neural network, it can correctly understand and process the obtained result. Based on this, it can then form its own conclusion with a certain degree of confidence, as mentioned earlier.

Low-code

Most programming languages involve writing code. Programmers create the logic for each part of an application in their minds, then describe it using language-specific expressions. This process forms an algorithm: a clear sequence of actions leading to a specific, predetermined result. It’s a complex task requiring significant mental effort and a deep understanding of the language’s capabilities.

However, there is no need to reinvent the wheel. Many problems faced by modern developers have already been solved in various ways. Relevant code snippets can often be found on StackOverflow. Modern programming can be likened to assembling a whole from parts of different construction sets. The Lego system offers a successful model, having standardized different sets of parts to ensure compatibility.

The low-code programming method follows a similar principle. Various code pieces are modified to fit together seamlessly and are presented to developers as ready-made blocks. Each block can have data inputs and outputs. Documentation specifies the task each block type solves and the format in which it accepts or outputs data.

By connecting these blocks in a specific sequence, developers can form an application’s algorithm and clearly visualize its operational logic. Perhaps the most well-known example of this programming method is the turtle graphics method, commonly used in educational settings to introduce programming concepts and develop algorithmic thinking.

The essence of this method is simple: drawing images on the screen using a virtual turtle that leaves a trail as it crawls across the canvas. Using ready-made blocks, such as moving a set number of pixels, turning at specific angles, or raising and lowering the pen, developers can create programs that draw their desired pictures. Creating applications using a low-code constructor is similar to turtle graphics, but it allows users to solve a wide range of problems, not just drawing on a canvas.

This method was best implemented in IBM’s Node-RED programming tool. It was developed as a universal means of ensuring the joint operation of diverse devices, online services, and APIs. The equivalent of code snippets were nodes from the standard library (palette).

Node-RED’s capabilities can be expanded by installing add-ons or creating custom nodes that perform specific data actions. Developers place nodes from the palette onto the desktop and build relationships between them. This process creates the application’s logic, with visualization helping to maintain clarity.

Adding neural networks to this concept yields an intriguing system. Instead of processing data with specific mathematical formulas, you can feed it into a neural network and specify the desired output. Although the input data may vary slightly each time, the results remain suitable for interpretation by humans or other neural networks.

Retrieval Augmented Generation (RAG)

The accuracy of data in large language models is a pressing concern. These models rely solely on knowledge gained during training, which depends on the relevance of the datasets used. Consequently, large language models may lack sufficient relevant data, potentially leading to incorrect results.

To address this issue, data updating methods are necessary. Allowing neural networks to extract context from additional sources, such as websites, can significantly improve the quality of answers. This is precisely how RAG (Retrieval-Augmented Generation) works. Additional data is converted into vector representations and stored in a database.

In operation, neural network models can convert user requests into vector representations and compare them with those stored in the database. When similar vectors are found, the data is extracted and used in forming a response. Vector databases are fast enough to support this scheme in real-time.

For this system to function correctly, interaction between the user, the neural network model, external data sources, and the vector database must be established. Langflow simplifies this setup with its visual component - users simply build standard blocks and "link" them, creating a path for data flow.

The first step is to populate the vector database with relevant sources. These can include files from a local computer or web pages from the Internet. Here's a simple example of loading data into the database:

Now that we have a vector database in addition to the trained LLM, we can incorporate it into the general scheme. When a user submits a request in the chat, it simultaneously forms a prompt and queries the vector database. If similar vectors are found, the extracted data is parsed and added as context to the formed prompt. The system then sends a request to the neural network and outputs the received response to the user in the chat.

While the example mentions cloud services like OpenAI and AstraDB, you can use any compatible services, including those deployed locally on LeaderGPU servers. If you can't find the integration you need in the list of available blocks, you can either write it yourself or add one created by someone else.

Quick start

System prepare

The simplest way to deploy Langflow is within a Docker container. To set up the server, begin by installing Docker Engine. Then, update both the package cache and the packages to their latest versions:

sudo apt update && sudo apt -y upgrade

Install additional packages required by Docker:

sudo apt -y install apt-transport-https ca-certificates curl software-properties-common

Download the GPG key to add the official Docker repository:

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

Add the repository to APT using the key you downloaded and installed earlier:

echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

Refresh the package list:

sudo apt update

To ensure that Docker will be installed from the newly added repository and not from the system one, you can run the following command:

apt-cache policy docker-ce

Install Docker Engine:

sudo apt install docker-ce

Verify that Docker has been installed successfully and the corresponding daemon is running and in the active (running) status:

sudo systemctl status docker

● docker.service - Docker Application Container Engine
  Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset>
  Active: active (running) since Mon 2024-11-18 08:26:35 UTC; 3h 27min ago
TriggeredBy: ● docker.socket
    Docs: https://docs.docker.com
Main PID: 1842 (dockerd)
   Tasks: 29
  Memory: 1.8G
     CPU: 3min 15.715s
  CGroup: /system.slice/docker.service

Build and run

Everything is ready to build and run a Docker container with Langflow. However, there's one caveat: at the time of writing this guide, the latest version (tagged v1.1.0) has an error and won't start. To avoid this issue, we'll use the previous version, v1.0.19.post2, which works flawlessly right after download.

The simplest approach is to download the project repository from GitHub:

git clone https://github.com/langflow-ai/langflow

Navigate to the directory containing the sample deployment configuration:

cd langflow/docker_example

Now you will need to do two things. First, change the release tag so that a working version (at the time of writing this instruction) is built. Second, add simple authorization so that no one can use the system without knowing the login and password.

Open the configuration file:

sudo nano docker-compose.yml

instead of the following line:

image: langflowai/langflow:latest

specify the version instead of the latest tag:

image: langflowai/langflow:v1.0.19.post2

You also need to add three variables to the environment section:

  - LANGFLOW_AUTO_LOGIN=false
  - LANGFLOW_SUPERUSER=admin
  - LANGFLOW_SUPERUSER_PASSWORD=your_secure_password

The first variable disables access to the web interface without authorization. The second adds the username that will receive system administrator rights. The third adds the corresponding password.

If you plan to store the docker-compose.yml file in a version control system, avoid writing the password directly in this file. Instead, create a separate file with a .env extension in the same directory and store the variable value there.

LANGFLOW_SUPERUSER_PASSWORD=your_secure_password

In the docker-compose.yml file, you can now reference a variable instead of directly specifying a password:

LANGFLOW_SUPERUSER_PASSWORD=${LANGFLOW_SUPERUSER_PASSWORD}

To prevent accidentally exposing the *.env file on GitHub, remember to add it to .gitignore. This will keep your password reasonably secure from unwanted access.

Now, all that's left is to build our container and run it:

sudo docker compose up

Open the web page at http://[LeaderGPU_IP_address]:7860, and you'll see the authorization form:

Once you enter your login and password, the system grants access to the web interface where you can create your own applications. For more in-depth guidance, we suggest consulting the official documentation. It provides details on various environment variables that allow easy customization of the system to suit your needs.

See also:

Easy Diffusion UI

Wed, 22 Jan 2025 12:13:37 +0100

Easy Diffusion UI is an open source software available for download on GitHub. Here’s how to install it on Ubuntu 22.04 LTS. If you’ve just rented a server, install the GPU drivers and extend your home directory. Then, download the latest release of Easy Diffusion UI:

wget https://github.com/cmdr2/stable-diffusion-ui/releases/latest/download/Easy-Diffusion-Linux.zip

Unpack the downloaded ZIP-archive:

unzip Easy-Diffusion-Linux.zip

Change directory to easy-diffusion:

cd easy-diffusion

Start the installation:

./start.sh

This is a script collection that automatically downloads and installs all necessary components. It also downloads the standard Stable Diffusion model in SafeTensors format. Once all downloads and installations are complete, the Easy Diffusion UI will launch automatically.

Using

The previous article, Stable Diffusion WebUI, outlines a method for accepting connections from the public internet and provides simple login and password authorization. In this case, we aim to demonstrate another universal method for forwarding ports through an SSH connection. We use PuTTY to establish a secure connection to the remote server. You can find more information about it in our guide Connect to a Linux server.

To choose which ports to forward, please open Connection > SSH > Tunnels in the left option tree. Type 9000 in the Source Port field and 127.0.0.1:9000 in the Destination field. Then click the Add button:

After that, you can return to Session and save it for later use. Connect to the remote server as usual. Now, all data that you send or receive at port 9000 on the loopback address 127.0.0.1 will be redirected to the remote server. This method creates a virtual secure tunnel that exists as long as the connection does.

Once Easy Diffusion UI starts and port forwarding is on, you can open a web browser and navigate to the address http://127.0.0.1:9000. We recommend downloading and installing custom models, as described in this article, instead of relying solely on the standard model to generate images. Don’t forget to increase the number of inference steps and adjust the desired image resolution (marked with asterisks).

One of the major benefits of the Easy Diffusion UI is its support for multiple GPUs. When you want to create a batch of images, you can choose how many images will be created in parallel. For example, if you have dual GPU configuration:

You can display the GPU’s load during the image generation process. Establish another SSH connection and execute a single command:

watch -n 1 nvidia-smi

Also, Easy Diffusion UI simplifies your prompts creation as it provides numerous examples of image modifiers. You can mix them to achieve more accurate results:

It’s a good idea to explore PromptBook by OpenArt. This guide can significantly enhance your prompt creation skills. With the Easy Diffusion UI, once the image is generated, you can download it, use as example for generating the next image, or make modifications with just one click:

The most common use of the Upscale button is to increase an image’s resolution. The generative neural network uses the original image as a basis and adds additional pixels, thereby interpolating the source image to the desired size.

When generating faces, issues may arise such as misaligned eyes, disproportionate sizes, or malformed parts. Fortunately, these problems can be solved using the Fix Faces button. Additionally, negative prompts may be utilized to prevent incorrect faces from being generated.

Uninstall

All files, scripts, libraries, and models are stored in a single directory. If you want to remove Easy Diffusion UI from your server, just delete this directory along with all the content:

sudo rm -rf easy-diffusion

See also:

Stable Video Diffusion

Wed, 22 Jan 2025 11:53:04 +0100

Generative neural networks can create various types of content. Stable Diffusion was created to generate images from text description. However, it can also be used to create music, sounds, and even videos. Today, we’ll show you how to create short videos from a single image using Stable Diffusion with WebUI and ComfyUI.

Install Stable Diffusion

Let’s begin by installing Stable Diffusion using our step-by-step guide. After installation, please interrupt webui.sh script execution by pressing Ctrl + C and close the SSH-connection. The system doesn’t allow you to install extensions with the enabled --listen (--share) options. This means that you need to set up port forwarding (7860 and 8189) from your local machine to the remote server. The first port is needed for WebUI and the second for ComfyUI.

For example, in PuTTY, you need to open Connection >> SSH >> Tunnels and add two new forwarded ports as shown in the following screenshot:

Now, you can reconnect to the remote server and run ./webui.sh again.

Open this URL in your browser:

http://127.0.0.1:7860

Navigate to Extensions >> Available, then click on the Load from: button:

The system will download the JSON file with all available extensions. Type ComfyUI in the search input box and click the Install button:

Web page will be reloaded and you’ll get a new tab ComfyUI in the main panel. Switch to it and click Install ComfyUI:

When the installation is finished, interrupt the execution of the webui.sh script again by pressing Ctrl + C.

Install Stable Video Diffusion model

Open the model’s directory:

cd stable-diffusion-webui/models/Stable-diffusion/

Download the full Stable Video Diffusion model:

curl -L https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt/resolve/main/svd_xt.safetensors?download=true --output svd_xt.safetensors

Return to the home directory:

cd ~/

And run the Stable Diffusion service again:

./webui.sh

Download the example of the Stable Video Diffusion workflow in JSON format. Erase the ComfyUI default workflow by pressing Clear, then Load the downloaded example:

Ensure that you have the correct model selected in the Image Only Checkpoint Loader (img2vid model) node:

Click on the choose file to upload button in the Load Image node and select any single image that generative neural network will transform into a video:

Try generating a video with all default parameters by clicking the Queue Prompt button:

After the process is completed, you’ll get your video in WEBP format in the SaveAnimatedWEBP node. Right-click on the generated video and choose Save Image:

Here is the final result GIF.

Troubleshooting

If you get an error message: ModuleNotFoundError: No module named 'utils.json_util'; 'utils' is not a package, please follow these steps:

Rename the utils directory to utilities:

mv /home/usergpu/stable-diffusion-webui/extensions/sd-webui-comfyui/ComfyUI/utils /home/usergpu/stable-diffusion-webui/extensions/sd-webui-comfyui/ComfyUI/utilities

Edit custom_node_manager.py:

nano /home/usergpu/stable-diffusion-webui/extensions/sd-webui-comfyui/ComfyUI/app/custom_node_manager.py

Replace this line:

from utils.json_util import merge_json_recursive

with:

from utilities.json_util import merge_json_recursive

Save the file (Ctrl + O) and exit the editor (Ctrl + X). Then edit main.py:

nano /home/usergpu/stable-diffusion-webui/extensions/sd-webui-comfyui/ComfyUI/main.py

Replace this line:

import utils.extra_config

with:

import utilities.extra_config

Save the file, exit the editor, and run the Stable Diffusion service again:

./webui.sh

See also:

PyTorch for Windows

Wed, 22 Jan 2025 11:35:30 +0100

Before you begin installing PyTorch, you need to install the Python interpreter and Microsoft Visual C++ Redistributable. Open a web-browser and navigate to Python’s download page. Find the latest Python 3 release and click on the link:

Then scroll down the page and click on Windows Installer (64-bit):

Open the downloaded file to proceed with installation:

Check the box for Add python.exe to PATH and click on Install Now:

Wait a minute for the installation process to complete:

You can optionally Disable path length limit if you plan to use long names that could exceed the MAX_PATH limits:

Install MS Visual C++

Next, download Microsoft Visual C++ Redistributable using this link and click on the installer:

You must tick the I agree to the license terms and conditions box and click the Install button:

After a few seconds, this software will be installed and you can Close the installer:

Now, everything is ready for PyTorch installation. Click the Start button and type cmd on the keyboard. Right-click on Command Prompt and select Run as administrator from the context menu:

Install PyTorch

Execute the following command:

pip install torch torchvision

If you want to install a specific version of PyTorch, you can specify it during the installation:

pip install torch==1.9.0 torchvision==0.10.0

When the installation is complete, let’s check that PyTorch is working properly. Execute the following command to open the Python interpreter:

python

Type these two strings, ending your input with the Enter key:

import torch
print(torch.__version__)

If you get a result like this, it means that PyTorch was installed correctly:

2.0.1+cu117

See also:

PyTorch for Linux

Wed, 22 Jan 2025 10:14:16 +0100

Modern Linux distributions are highly dependent on the installed version of Python. Therefore, before installing PyTorch, we recommend creating a virtual environment using our step-by-step guide Linux system utilities.

Activate the created venv and proceed with the pip3 upgrade:

pip3 install --upgrade pip

Start the PyTorch installation:

pip3 install torch torchvision

If you want to install a specific version of PyTorch, just type the required version number:

pip3 install torch==1.9.0 torchvision==0.10.0

When the installation is finished, let’s check that PyTorch was installed correctly. Open the Python interpreter:

python3

Type these two strings, ending your input with the Enter key:

import torch
print(torch.__version__)

If you get a result like this, it means that PyTorch has been installed correctly:

2.0.1+cu117

See also:

Stable Diffusion: Riffusion

Tue, 21 Jan 2025 14:12:29 +0100

In our previous articles, we explored the fascinating capabilities of Stable Diffusion for generating captivating images. However, it’s important to note that this powerful generative neural network has even more to offer.

Riffusion is a Stable Diffusion model for music creation and editing. With Riffusion, you can generate a spectrogram of a desired musical segment and effortlessly transform it into a musical excerpt. Let’s install Riffusion on a LeaderGPU server and try it in action.

Prerequisites

Start by updating the package cache repository and installed packages:

sudo apt update && sudo apt -y upgrade

Don’t forget to install NVIDIA® drivers using the autoinstall command or manually, using our step-by-step guide:

sudo ubuntu-drivers autoinstall

Reboot the server:

sudo shutdown -r now

For creating a virtual environment, developers suggest using a tool named Anaconda. You can also use venv, which we discussed in the Linux system utilities tutorial. Download Anaconda installation script using curl:

curl --output anaconda.sh https://repo.anaconda.com/archive/Anaconda3-5.3.1-Linux-x86_64.sh

Make it executable:

chmod +x anaconda.sh

And run:

./anaconda.sh

Answer YES to all questions, except the last one (install Microsoft VSCode). Then, re-login to the SSH console and create a new virtual environment with Python v3.9:

conda create --name riffusion python=3.9

Activate the new virtual environment:

conda activate riffusion

If you want to use music formats other than wav, it is necessary to install the FFmpeg library set as well:

conda install -c conda-forge ffmpeg

Install Riffusion

Clone the Riffusion repository:

git clone https://github.com/riffusion/riffusion.git

Open the downloaded directory:

cd riffusion

Let’s make some changes in the requirements file. This prevents errors with torch compatibility:

nano requirements.txt

Find and fix packages versions:

diffusers==0.9.0
torchaudio==2.0.1

Save changes and proceed with preparing a virtual environment. The following command installs all necessary packages:

python -m pip install -r requirements.txt

Finally, you can open a “playground”. This is a simple web interface that helps you learn more about Riffusion’s features:

python -m riffusion.streamlit.playground

Open your favorite browser and enter the address http://[SERVER_IP]:8501/

Test a playground

Now, you can create music using text prompts and by changing the other parameters:

Also, you can do some tricky things, like splitting audio into separate components. For example, you can extract vocal from Bohemian rhapsody by Queen:

Remember, this is merely a single example of how Riffusion can be utilized. By creating your own application, you can achieve significantly more captivating outcomes. Powerful servers by LeaderGPU will take care of the calculations.

See also:

Stable Diffusion: Generate repeatable faces

Tue, 21 Jan 2025 13:51:05 +0100

Repeatability is the most important aspect when creating graphical content with generative neural networks. This holds true regardless of the type of content you create, be it a cinema or game character, landscape, or scene environment. The main problem can be formulated as: “How can I repeat my result?”. Every time you start generating images with the same positive and negative prompts, you’ll get different results. Sometimes the differences are minor and acceptable, but in most cases, they could pose a problem.

Stable Diffusion is learned on a large dataset captured from the real world, which explains why repeatability isn’t a strong point of this neural network model. However, this rule doesn’t apply to celebrity photos. These photos are found much more frequently in the real world and, therefore, in the dataset on which Stable Diffusion was trained. You can use these photos as a “constant” or a “starting point” in the generating process.

Method 1. “Shaken, not stirred”

Of course, you don’t need to create only celebrity images, but you can use multiple relevant prompts to get more or less consistent results. For example, we can take two famous Greek singers: Elena Paparizou and Marina Satti, and get repeatable results:

Model: Realistic Vision v6.0 beta 1

Positive prompts:

Elena Paparizou, Marina Satti, fashion portrait, alone, solo, greek woman in beautiful clothes, natural skin, 8k uhd, high quality, film grain, Canon EOS

Negative prompts:

bad anatomy, bad hands, three hands, three legs, bad arms, missing legs, missing arms, poorly drawn face, bad face, fused face, cloned face, worst face, three crus, extra crus, fused crus, worst feet, three feet, fused feet, fused thigh, three thigh, fused thigh, extra thigh, worst thigh, missing fingers, extra fingers, ugly fingers, long fingers, horn, extra eyes, huge eyes, 2girl, amputation, disconnected limbs, cartoon, cg, 3d, unreal, animate, nsfw, nude, censored

It works with any celebrities, as Stable Diffusion tried to reproduce the most prominent facial features. Here, we use the same model and “shake” two Hollywood stars (Dwayne Johnson and Danny Trejo) into one new synthetic character.

Positive prompts:

Dwayne Johnson, Danny Trejo, fashion portrait, alone, solo, 8k uhd, high quality, film grain, Canon EOS

Negative prompts:

bad anatomy, bad hands, three hands, three legs, bad arms, missing legs, missing arms, poorly drawn face, bad face, fused face, cloned face, worst face, three crus, extra crus, fused crus, worst feet, three feet, fused feet, fused thigh, three thigh, fused thigh, extra thigh, worst thigh, missing fingers, extra fingers, ugly fingers, long fingers, horn, extra eyes, huge eyes, amputation, disconnected limbs, cartoon, cg, 3d, unreal, animate, nsfw, nude, censored

Every time you mix the same celebrities, you get similar results. Let’s look at another method to generate repeatable characters.

Method 2. Name anchor

Celebrities are a good start, but let’s consider other methods for achieving repeatable results. The answer is quite simple: we can use multiple human names. Every nation has unique names, related to linguistic features. For example, the Greek name Kostas can translate to “labor” or “effort”, while Nikos means “Victory of the people”. These two names create a unique image of a generated person, aiding neural network models in understanding our creation objectives.

Positive prompts:

Portrait of [Kostas | Nikos] on a white background, greek man, short haircut, beard

Negative prompts:

woman, bad anatomy, bad hands, three hands, three legs, bad arms, missing legs, missing arms, poorly drawn face, bad face, fused face, cloned face, worst face, three crus, extra crus, fused crus, worst feet, three feet, fused feet, fused thigh, three thigh, fused thigh, extra thigh, worst thigh, missing fingers, extra fingers, ugly fingers, long fingers, horn, extra eyes, huge eyes, 2girl, amputation, disconnected limbs, cartoon, cg, 3d, unreal, animate, nsfw, nude, censored

Let’s generate numerous images (80-100) for further dataset creation. The main prompt was selected to provide convenient images that can be easily cleared from the background. Negative prompts protect us from including random images with distortions in the dataset, as well as images of women.

Tip: if you receive very different images from each other, try changing the CFG Scale parameter from 7.5 to 15. This will force the neural network to follow the prompts more formally.

You can select your own unique names with a simple name generator, like Behind the Name. Also, you can use the ControlNet feature to gain more control.

Method 3. Teach appearance

We can’t directly influence the final result, but we observe that some tokens (such as celebrity image tokens) carry more weight than others. This means we can create our conditional “celebrity” token by creating an appropriate prompt for it and further training the model on it. This is how LoRA (Low-Rank Adaptation of Large Language Models) operates. You can use our step-by-step guide to train your own LoRA model based on a self-made dataset.

After removing the background, we obtain clear portraits and use them to create a specific LoRA model. This model helps to replicate a face with a few minor changes:

Now, we can generate this character in different locations, create stories, and place him in various roles: from gardener to businessman. His face will be consistent recognizable and repeatable:

This method isn’t ideal, but it works perfectly in a variety of situations. You don’t need to prepare a dataset from a real person, and it can be generated remotely:

You can attempt to create such a virtual character yourself, without the assistance of a professional designer or 3D-modeling specialist. All you need are fast GPUs, which you can find in dedicated servers by LeaderGPU.

See also:

Stable Diffusion: LoRA selfie

Tue, 21 Jan 2025 13:44:25 +0100

You can create your first dataset using a simple camera and a fairly uniform background, such as a white wall or monotone blackout curtain. For a sample dataset, I used a mirrorless camera Olympus OM-D EM5 Mark II with 14-42 kit lenses. This camera supports remote control from any smartphone and a very fast continuous shooting mode.

I mounted the camera on a tripod and set the focus priority to face. After that, I selected the mode in which the camera captures 10 frames consecutively every 3 seconds and initiated the process. During the shooting process, I slowly turned my head in the selected direction and changed the direction after every 10 frames:

The result was around 100 frames with a monotone background:

The next step is to remove the background and leave the portrait on a white background.

Delete background

You can use standard Adobe Photoshop Remove background function and batch processing. Let’s store actions that we want to apply to every picture in a dataset. Open any image, click on the triangle icon, then click on the + symbol:

Type the name of the new action, for example, Remove Background and click Record:

On the Layers tab, find the lock symbol and click on it:

Next click on the Remove background button on the floating panel:

Right-click on Layer 0 and select Flatten Image:

All our actions have been recorded. Let’s stop this process:

Now, you can close the open file without saving changes and select File >> Scripts >> Image Processor…

Select input and output directories, choose the created Remove Background action in step 4 and click on Run button:

Please be patient. Adobe Photoshop will open every picture in the selected directory, repeat the recorded actions (turn off layer lock, delete background, flatten image) and save it in another selected directory. This process can take a couple of minutes, depending on the number of images.

When the process is finished, you can go to the next step.

Upload to server

Use one of the following guides (tailored to your PC operating system) to upload the dataset directory to the remote server. For example, place it in the default user’s home directory, /home/usergpu:

Pre-installation

Update existing system packages:

sudo apt update && sudo apt -y upgrade

Install two additional packages:

sudo apt install -y python3-tk python3.10-venv

Let’s install the CUDA® Toolkit version 11.8. Let’s download the specific pin file:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin

The following command places the downloaded file into the system directory, which is controlled by the apt package manager:

sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600

Next step is to download the main CUDA® repository:

wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda-repo-ubuntu2204-11-8-local_11.8.0-520.61.05-1_amd64.deb

After that, proceed with the package installation using the standard dpkg utility:

sudo dpkg -i cuda-repo-ubuntu2204-11-8-local_11.8.0-520.61.05-1_amd64.deb

Copy the GPG keyring to the system directory. This will make it available for use by operating system utilities, including the apt package manager:

sudo cp /var/cuda-repo-ubuntu2204-11-8-local/cuda-*-keyring.gpg /usr/share/keyrings/

Update system cache repositories:

sudo apt-get update

Install the CUDA® toolkit using apt:

sudo apt-get -y install cuda

Add CUDA® to PATH. Open the bash shell config:

nano ~/.bashrc

Add the following lines at the end of the file:

export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64\
                         ${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

Save the file and reboot the server:

sudo shutdown -r now

Install trainer

Copy the Kohya project’s repository to the server:

git clone https://github.com/bmaltais/kohya_ss.git

Open the downloaded directory:

cd kohya_ss

Make the setup script executable:

chmod +x ./setup.sh

Run the script:

./setup.sh

You’ll receive a warning message from the accelerate utility. Let’s resolve the issue. Activate the project’s virtual environment:

source venv/bin/activate

Install the missing package:

pip install scipy

And manually configure the accelerate utility:

accelerate config

Be careful, because activating an odd number of CPUs will cause an error. For example, if I have 5 GPUs, only 4 can be used with this software. Otherwise, an error will occur when the process starts. You can immediately check the new utility configuration by calling a default test:

accelerate test

If everything is okay, you’ll receive a message like this:

Test is a success! You are ready for your distributed training!

deactivate

Now, you can initiate the trainer’s public server with Gradio GUI and simple login/password authentication (change the user/password to your own):

./gui.sh --share --username user --password password

You’ll receive two strings:

Running on local URL: http://127.0.0.1:7860
Running on public URL: https://.gradio.live

Open your web browser and enter the public URL in the address bar. Type your username and password in the appropriate fields, then click Login:

Prepare the dataset

Start by creating a new folder where you will store the trained LoRA model:

mkdir /home/usergpu/myloramodel

Open the following tabs: Utilities >> Captioning >> BLIP captioning. Fill in the gaps as shown in the picture and click Caption images:

Trainer will download and run a specific neural network model (1.6 Gb) that creates text prompts for each image file in the selected directory. It will be executed on a single GPU and takes around a minute.

Switch tab to LoRA >> Tools >> Dataset preparation >> Dreambooth/LoRA folder preparation, fill in the gaps, and sequentially press Prepare training data and Copy info to Folders Tab:

In this example, we use the name nikolai as an Instance prompt and “person” as a Class prompt. We also set /home/usergpu/dataset as a Training Images and /home/usergpu/myloramodel as a Destination training directory.

Switch to the LoRA >> Training >> Folders tab again. Ensure that the Image folder, Output folder, and Logging folder are correctly filled. If desired, you can change the Model output name to your own. Finally, click the Start training button:

The system will start downloading additional files and models (~10 GB). After that, the training process will begin. Depending on the quantity of images and the settings applied, this can take several hours. Once the training is completed, you can download the /home/usergpu/myloramodel directory to your computer for future use.

Test your LoRA

We’ve prepared some articles about Stable Diffusion and its forks. You can try to install Easy Diffusion with our guide Easy Diffusion UI. After the system was installed and is running, you can upload your LoRA model in SafeTensors format directly to /home/usergpu/easy-diffusion/models/lora

Update the Easy diffusion web page and select your model from the drop-down list:

Let’s write a simple prompt, portrait of <nikolai> wearing a cowboy hat, and generate our first images. Here, we used a custom Stable Diffusion model downloaded from civitai.com: Realistic Vision v6.0 B1:

You can experiment with prompts and models, based on Stable Diffusion, to achieve better results. Enjoy!

See also:

Stable Diffusion: What is ControlNet

Tue, 21 Jan 2025 10:42:39 +0100

A common misconception among those first encountering generative neural networks is that controlling the final output is tremendously challenging, especially when attempting to alter the output through different prompt phrasing. Currently, a suite of tools known as ControlNet exists to facilitate relatively straightforward and effective control over the generation results.

In this article, we’ll demonstrate how to easily manipulate the pose of generated characters using pre-existing images and custom “skeletons”, with the help of one such tool, OpenPose.

Step 1. Install Stable Diffusion

Please use our step-by-step guide to install Stable Diffusion with the basic model and WebUI. This guide is based on the AUTOMATIC1111 script.

Step 2. Install ControlNet extension

We strongly advise against installing the ControlNet extension (sd-webui-controlnet) from the standard repository due to potential functionality issues. One significant issue we encountered during the preparation of this guide was the web interface freezing. Although the image is initially generated successfully, the WebUI becomes unresponsive when generating the image a second time. An alternative solution would be to install the same extension from an external source.

Open WebUI and follow the tabs: Extensions > Install from URL. Paste this URL in the appropriate field:

https://github.com/Mikubill/sd-webui-controlnet

Then click Install button:

When the process is completed successfully, the following message should appear:

Installed into /home/usergpu/stable-diffusion-webui/extensions/sd-webui-controlnet. Use Installed tab to restart.

Let’s restart the URL by pressing Apply and restart UI button on the Installed tab:

After rebooting the interface, the new ControlNet element with many additional options will appear:

Step 3. Download OpenPose

Add HF key

Let’s generate and add an SSH-key that you can use in Hugging Face:

cd ~/.ssh && ssh-keygen

When the keypair is generated, you can display the public key in the terminal emulator:

cat id_rsa.pub

Copy all information starting from ssh-rsa and ending with usergpu@gpuserver, as shown in the following screenshot:

Open a web browser, type https://huggingface.co/ into the address bar, and press Enter. Login into your HF-account and open Profile settings. Then choose SSH and GPG Keys and click on the Add SSH Key button:

Fill in the Key name and paste the copied SSH Public key from the terminal. Save the key by pressing Add key:

Now, your HF-account is linked with the public SSH-key. The second part (private key) is stored on the server. The next step is to install a specific Git LFS (Large File Storage) extension, which is used for downloading large files such as neural network models.

Install Git LFS

The next step is to install a specific Git LFS (Large File Storage) extension, which is used for downloading large files such as neural network models. Open your home directory:

cd ~/

Download and run the shell script. This script installs a new third-party repository with git-lfs:

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash

Now, you can install it using the standard package manager:

sudo apt-get install git-lfs

Let’s configure git to use our HF nickname:

git config --global user.name "John"

And linked to the HF email account:

git config --global user.email "john.doe@example.com"

Download the repository

We recommend, if possible, using a local hard drive to download and store models. You can learn more about this from our guide, Disk partitioning in Linux. For this example, we have mounted an SSD-drive to the /mnt/fastdisk mountpoint. Let’s make it owned by the default user:

sudo chown usergpu:usergpu /mnt/fastdisk

Open the directory:

cd /mnt/fastdisk

Clone the ControlNet repository from HuggingFace. Previously installed Git-LFS will automatically replace pointers with real files:

git clone git@hf.co:lllyasviel/ControlNet-v1-1

In this example, we add only one model to Stable Diffusion WebUI. However, you can copy all available models from the repository (~18GB):

cp /mnt/fastdisk/ControlNet-v1-1/control_v11p_sd15_openpose.pth /home/usergpu/stable-diffusion-webui/models/ControlNet/

Step 4. Run generating process

The current model provided is quite basic and might not yield satisfactory results. Therefore, we suggest replacing it with a custom model. Guidelines on how to do this can be found in this article: Stable Diffusion Models: customization & options. For this example, we downloaded RealisticVision v6.0 B1.

If you want to generate your first image using OpenPose, open the ControlNet tab, choose OpenPose, tick Enable and Allow Preview. Then click to Upload to add an image containing the desired pose:

You can request the system to generate a pose preview by clicking the button with the explosion icon:

On the left, the original image is displayed. On the right, you can see the “skeleton” representing the pose as recognized by the neural network model:

Now you can type the main prompt, for example “dancing bear, by Pixar” or “dancing fox, by Pixar” and click the Generate button. After a few seconds you’ll get results like this:

The system will attempt to generate a new picture, given the “skeleton” obtained from the original image. In some cases, the pose may not be accurate, but this can be easily corrected by manually editing the “skeleton”.

Step 5. Changing pose

While it may seem like magic, the model isn’t perfect, and occasional errors can impact the final image. To avoid issues during image generation, you have the option to manually adjust the “skeleton” by clicking on the Edit button:

In the provided editor, you can easily adjust the pose by dragging and dropping, or remove unwanted points with a right-click. After that, just click the Send pose to ControlNet button and the new pose will be applied:

Beyond OpenPose, ControlNet offers a variety of tools to customize and perfect your results. Moreover, the dedicated servers provided by LeaderGPU ensure a quick and convenient process.

See also:

Fooocus: Rethinking of SD and MJ

Tue, 21 Jan 2025 10:36:52 +0100

The advent of Stable Diffusion and MidJourney has revolutionized our understanding of the potential of generative neural networks. These tools have unveiled a fresh perspective on the process of image creation and the extent to which we can manipulate it. The primary approach involves providing the system with prompts about the desired outcome. Essentially, we highlight three important aspects: object, style, and environment.

Additional prompts that provide more specific instructions, such as the desired composition, type of camera/lens, and colorization, are also important, but not indispensable. The more comprehensive the instructions, the easier it is for the neural network to process. The role of a prompt engineer has even emerged in the professional space. However, this role can be easily replaced by the same generative neural networks. By combining image creation with text creation skills, we can generate extra prompts to achieve an optimal outcome.

This is the fundamental concept of Fooocus. It integrates the XL Stable Diffusion model and a GPT2-based prompt generator, which enriches and details your simple prompt. Moreover, Fooocus is equipped with various enhancements and extensions. These features facilitate the generation of spectacular images through a straightforward interface, devoid of complex tools. Let’s delve into its functionality and install Fooocus on a LeaderGPU dedicated server.

Prerequisites

Begin with the installation prerequisites and reboot afterward:

sudo apt update && sudo apt -y upgrade && sudo ubuntu-drivers autoinstall && sudo shutdown -r now

Download the shell script that installs Anaconda for managing virtual environments:

wget https://repo.anaconda.com/archive/Anaconda3-2023.09-0-Linux-x86_64.sh

Set the execution flag and provide data access:

chmod a+x Anaconda3-2023.09-0-Linux-x86_64.sh

Run the installation script:

./Anaconda3-2023.09-0-Linux-x86_64.sh

After the process is finished, we recommend disconnecting the SSH-session and preparing for port forwarding. You need to forward port 7865 from the remote server to a local loopback address, 127.0.0.1:7865. For more information, please refer to one of our previous guides: Stable Video Diffusion. Then, reconnect and proceed with cloning the project’s repository on a GitHub.

Fooocus install

git clone https://github.com/lllyasviel/Fooocus.git

Change directory to Fooocus:

cd Fooocus

Create a virtual environment using Anaconda and the YAML-config prepared by the project’s author:

conda env create -f environment.yaml

Let’s change our base environment to a newly created one:

conda activate fooocus

The following step is to install Python libraries:

pip install -r requirements_versions.txt

Now, everything is ready to start:

Fooocus start

python entry_with_update.py

The initial startup may take some time as the application verifies and downloads all the necessary files for operation. You might want to grab a cup of coffee in the meantime. Once the process is complete, open your browser and type the following URL into the address bar:

http://127.0.0.1:7865

Enter your simple prompt and click the Generate button. If you want more control, tick Advanced and select the necessary options:

The real magic unfolds behind the scenes. The moment you hit the Generate button, your input prompt is transferred to the GPT2-based language model. This model transforms your brief prompt into a mix of elaborative positive and negative prompts. This mix is subsequently input into the Stable Diffusion XL model, fine-tuned to emulate MidJourney style. As a result, even a brief prompt can generate impressive results.

Certainly, there’s no restriction on writing your own prompts. However, after multiple iterations, it becomes evident that even in the absence of this, the generated content remains intriguing and diverse.

See also:

Blender remote rendering with Flamenco

Tue, 21 Jan 2025 09:47:24 +0100

When rendering heavy scenes in Blender begins to consume too much of your team’s time, you have two options: either upgrade each team member’s computer or outsource rendering to a dedicated farm. Many companies offer ready-made rendering solutions, but if you require full control over the infrastructure, these solutions may not be the most reliable option.

An alternative approach could involve creating a hybrid infrastructure. In this setup, you would keep your data storage and rendering farm management within your existing infrastructure. The only element that would be located outside would be the rented GPU servers on which the rendering would be performed.

In general, the rendering farm infrastructure for Blender looks like this:

Here, we have a central Manager node that organizes all processes. It receives rendering tasks from users via a specific Blender Add-on and moves all necessary files to Shared Storage. Then, the Manager distributes the tasks to Worker nodes. They receive a job containing all information about where the Worker can find files to render and what to do with the results obtained. To implement this scheme, you can use a completely free and open-source application called Flamenco. In this guide, we show how to prepare all nodes, especially the Manager and Worker.

The Storage node doesn’t have any specific requirements. It can be used with any operating system that supports SMB/CIFS or NFS protocols. The only requirement is that the storage directory needs to be mounted and accessible by the operating system. In your infrastructure, this can be any shared folder accessible to all nodes.

Each node has different IP addresses, and the Wireguard VPN server will be a central point that joins them into one L2-network. This server, located on the external perimeter, allows you to work without making changes to the existing NAT policy.

For this example, we create the following mixed configuration:

10.0.0.1 - Wireguard VPN server (virtual server by any infrastructure provider) with an external IP;
10.0.0.2 - Worker node (dedicated server by LeaderGPU) with an external IP;
10.0.0.3 - Manager node (virtual server in office network) located behind NAT;
10.0.0.4 - Storage node (virtual server in office network) located behind NAT;
10.0.0.5 - User node (consumer laptop in office network) located behind NAT.

Step 1. Wireguard

VPN Server

You can install and configure Wireguard manually, using an official guide and examples. However, there is an easier alternative: unofficial script by software engineer from Paris (Stanislas aka angristan).

Download the script from GitHub:

wget https://raw.githubusercontent.com/angristan/wireguard-install/master/wireguard-install.sh

Make it executable:

sudo chmod +x wireguard-install.sh

Execute:

sudo ./wireguard-install.sh

Follow the instructions and set the IP address range 10.0.0.1/24. The system will ask you to immediately create a configuration file for the first client. According to the plan, this client will be the worker node with name Worker and address 10.0.0.2. When the script is completed, a configuration file will appear in the root directory: /root/wg0-client-Worker.conf.

Execute the following command to view this configuration:

cat /home/usergpu/wg0-client-Worker.conf

[Interface]
PrivateKey = [CLIENT_PRIVATE_KEY]
Address = 10.0.0.2/32,fd42:42:42::2/128
DNS = 1.1.1.1,1.0.0.1
[Peer]
PublicKey = [SERVER_PRIVATE_KEY]
PresharedKey = [SERVER_PRESHARED_KEY]
Endpoint = [IP_ADDRESS:PORT]
AllowedIPs = 10.0.0.0/24,::/0

Execute installation script again to create another client. Add all future clients this way, and finally, you can check that all configuration files were created:

cd ~/

ls -l | grep wg0

-rw-r--r-- 1 root    root      529 Jul 14 12:59 wg0-client-Manager.conf
-rw-r--r-- 1 root    root      529 Jul 14 12:59 wg0-client-Storage.conf
-rw-r--r-- 1 root    root      529 Jul 14 12:59 wg0-client-User.conf
-rw-r--r-- 1 root    root      529 Jul 14 12:58 wg0-client-Worker.conf

VPN Clients

VPN clients include all nodes that need to be connected to a single network. In our guide, this refers to the manager node, storage node, client node (if using Linux), and worker nodes. If the VPN server is running on a worker node, it does not need to be configured as a client (this step can be skipped).

Update the packages cache repository, then install Wireguard and CIFS support packages:

sudo apt update && sudo apt -y install wireguard cifs-utils

Elevate privileges to superuser:

sudo -i

Open the Wireguard configuration directory:

cd /etc/wireguard

Execute the umask command so that only the superuser has access to files in this directory:

umask 077

Generate a private key and save it into a file:

wg genkey > private-key

Generate a public key using the private key:

wg pubkey > public-key < private-key

Create a configuration file:

nano /etc/wireguard/wg0.conf

Paste your own configuration, created for this client:

[Interface]
PrivateKey = [CLIENT_PRIVATE_KEY]
Address = 10.0.0.2/32,fd42:42:42::2/128
DNS = 1.1.1.1,1.0.0.1
[Peer]
PublicKey = [SERVER_PRIVATE_KEY]
PresharedKey = [SERVER_PRESHARED_KEY]
Endpoint = [SERVER_IP_ADDRESS:PORT]
AllowedIPs = 10.0.0.0/24,::/0
PersistentKeepalive = 1

Don’t forget to add the PersistentKeepalive = 1 option (where 1 means 1 second) on every node located behind NAT. You can choose this period experimentally. The recommended value by Wireguard’s authors is 25. Save the file and exit, using the CTRL + X shortcut and the Y key to confirm.

If you want to passthrough internet trafic set AllowedIPs to 0.0.0.0/0,::/0

Then, logout from the root account:

exit

Start the connection using systemctl:

sudo systemctl start wg-quick@wg0.service

Check that everything is OK and the service has started successfully:

sudo systemctl status wg-quick@wg0.service

● wg-quick@wg0.service - WireGuard via wg-quick(8) for wg0
Loaded: loaded (/lib/systemd/system/wg-quick@.service; enabled; vendor preset: enabled)
Active: active (exited) since Mon 2023-10-23 09:47:53 UTC; 1h 45min ago
  Docs: man:wg-quick(8)
        man:wg(8)
        https://www.wireguard.com/
        https://www.wireguard.com/quickstart/
        https://git.zx2c4.com/wireguard-tools/about/src/man/wg-quick.8
        https://git.zx2c4.com/wireguard-tools/about/src/man/wg.8
Process: 4128 ExecStart=/usr/bin/wg-quick up wg0 (code=exited, status=0/SUCCESS)
Main PID: 4128 (code=exited, status=0/SUCCESS)
  CPU: 76ms

If you encounter an error such as «resolvconf: command not found» in Ubuntu 22.04 simply create a symbol link:

sudo ln -s /usr/bin/resolvectl /usr/local/bin/resolvconf

Enable the new service to connect automatically while the operating system is booting:

sudo systemctl enable wg-quick@wg0.service

Now, you can check connectivity by sending echo packets:

ping 10.0.0.1

PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.
64 bytes from 10.0.0.1: icmp_seq=1 ttl=64 time=145 ms
64 bytes from 10.0.0.1: icmp_seq=2 ttl=64 time=72.0 ms
64 bytes from 10.0.0.1: icmp_seq=3 ttl=64 time=72.0 ms
64 bytes from 10.0.0.1: icmp_seq=4 ttl=64 time=72.2 ms
--- 10.0.0.1 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 71.981/90.230/144.750/31.476 ms

Step 2. NAS node

Connect to the VPN server using the guide from Step 1. Then, install the server and client Samba packages:

sudo apt install samba samba-client

Backup your default configuration:

sudo cp /etc/samba/smb.conf /etc/samba/smb.conf.bak

Create a directory that will be used as a share:

sudo mkdir /mnt/share

Create a new user group that will get access to the new share:

sudo groupadd smbusers

Add an existing user to the created group:

sudo usermod -aG smbusers user

Set a password for this user. This is a necessary step because the system password and tha Samba password are different entities:

sudo smbpasswd -a $USER

Remove the default configuration:

sudo rm /etc/samba/smb.conf

And create a new one:

sudo nano /etc/samba/smb.conf

[global]
workgroup = WORKGROUP
security = user
map to guest = bad user
wins support = no
dns proxy = no
[private]
path = /mnt/share
valid users = @smbusers
guest ok = no
browsable = yes
writable = yes

Save the file and test the new parameters:

testparm -s

Restart both Samba services:

sudo service smbd restart

sudo service nmbd restart

Finally, give the permission to share the folder:

sudo chown user:smbusers /mnt/share

Step 3. Samba client connection

All nodes in Flamenco use a shared directory located at /mnt/flamenco. You must mount this directory on each node before running the flamenco-client or flamenco-manager scripts. In this example, we use a worker node hosted on LeaderGPU with the username usergpu. Please replace these details with your own if they differ.

Create a hidden file where you can store SMB share credentials:

nano /home/usergpu/.smbcredentials

Type these two strings:

username=user # your Samba username
password=password # your Samba password

Save this file and exit. Then, secure this file by changing the access permissions:

sudo chmod 600 /home/usergpu/.smbcredentials

Create a new directory that can be used as a mount point to attach the remote storage:

sudo mkdir /mnt/flamenco

And make the user the owner of this directory:

sudo chown usergpu:users /mnt/flamenco

The only thing left is to have the network directory mounted automatically:

sudo nano /etc/systemd/system/mnt-flamenco.mount

[Unit]
Description=Mount Remote Storage
[Mount]
What=//10.0.0.4/private
Where=/mnt/flamenco
Type=cifs
Options=mfsymlinks,credentials=/home/usergpu/.smbcredentials,uid=usergpu,gid=users
[Install]
WantedBy=multi-user.target

Add two lines to your VPN configuration in the [Interface] section:

sudo -i

nano /etc/wireguard/wg0.conf

…
PostUp = ping 10.0.0.4 -c 4 && systemctl start mnt-flamenco.mount
PostDown = systemctl stop mnt-flamenco.mount
…

Reboot the server:

sudo shutdown -r now

Check that the services are loaded and the shared directory is successfully mounted:

df -h

Filesystem          Size  Used Avail Use% Mounted on
tmpfs                35G  3.3M   35G   1% /run
/dev/sda2            99G   18G   77G  19% /
tmpfs               174G     0  174G   0% /dev/shm
tmpfs               5.0M     0  5.0M   0% /run/lock
tmpfs                35G  8.0K   35G   1% /run/user/1000
//10.0.0.4/private   40G  9.0G   31G  23% /mnt/flamenco

Step 4. Manager node

Set up a VPN connection using the guide from Step 1. Stop the VPN service before continuing:

sudo systemctl stop wg-quick@wg0.service

Let’s prepare. Automatic mounting required utilities for the CIFS protocol:

sudo apt -y install cifs-utils

The next important step is to install Blender. You can do this using the standard APT packet manager, but this will most likely install one of the older versions (less then v3.6.4). Let’s use Snap to install the latest version:

sudo snap install blender --classic

Check the installed version using the following command:

blender --version

Blender 4.4.3
build date: 2025-04-29
build time: 15:12:13
build commit date: 2025-04-29
build commit time: 14:09
build hash: 802179c51ccc
build branch: blender-v4.4-release
build platform: Linux
build type: Release
…

If you receive an error message indicating missing libraries, simply install them. All these libraries are included in the XOrg package:

sudo apt -y install xorg

Download the application:

wget https://flamenco.blender.org/downloads/flamenco-3.7-linux-amd64.tar.gz

Unpack the downloaded archive:

tar xvfz flamenco-3.7-linux-amd64.tar.gz

Go to the created directory:

cd flamenco-3.7-linux-amd64/

And start Flamenco for the first time:

./flamenco-manager

Open the following address in your web-browser: http://10.0.0.3:8080/. Click on the Let's go button. Type /mnt/flamenco in the required field, then click Next:

Flamenco will attempt to locate the Blender executable file. If you have installed Blender from Snap, the path will be /snap/bin/blender. Check this point and click Next:

Check the summary and click Confirm:

Return to the SSH session and use the Ctrl + C keyboard shortcut to interrupt the application. The first launch generates the configuration file flamenco-manager.yaml. Let’s add some options to the variables and blenderArgs sections:

nano flamenco-manager.yaml

# Configuration file for Flamenco.
# For an explanation of the fields, refer to flamenco-manager-example.yaml
#
# NOTE: this file will be overwritten by Flamenco Manager's web-based configuration system.
#
# This file was written on 2023-10-17 12:41:28 +00:00 by Flamenco 3.7
_meta:
  version: 3
manager_name: Flamenco Manager
database: flamenco-manager.sqlite
listen: :8080
autodiscoverable: true
local_manager_storage_path: ./flamenco-manager-storage
shared_storage_path: /mnt/flamenco
shaman:
  enabled: true
  garbageCollect:
    period: 24h0m0s
    maxAge: 744h0m0s
    extraCheckoutPaths: []
task_timeout: 10m0s
worker_timeout: 1m0s
blocklist_threshold: 3
task_fail_after_softfail_count: 3
variables:
  blender:
    values:
    - platform: linux
      value: blender
    - platform: windows
      value: blender
    - platform: darwin
      value: blender
  storage:
    values:
    is_twoway: true
    values:
    - platform: linux
      value: /mnt/flamenco
    - platform: windows
      value: Z:\
    - platform: darwin
      value: /Volumes/shared/flamenco
  blenderArgs:
    values:
    - platform: all
      value: -b -y -E CYCLES -P gpurender.py

The first additional block describes additional Two-way variables, which are needed for multi-platform farms. This solves the main problem with slashes and paths. In Linux, we use the forward slash symbol (/) as a separator, but in Windows, we use the backslash symbol (\). Here, we create the replacement rule for all available alternatives: Linux, Windows and macOS (Darwin).

When you mount network share in Windows, you need to choose a drive letter. For example our Storage is mounted with Z: letter. The replacement rule tells the system that for the Windows platform, the /mnt/flamenco path will be located at Z:\. For macOS, this path will be /Volumes/shared/flamenco.

Look at the second added block. This instructs Blender to use Cycles rendering engine and calls a simple Python script, gpurender.py, when Blender executes. This is a simple trick to select the GPU instead of the CPU. There is no standard option to do this directly. You can’t invoke blender --use-gpu or something similar. However, you can invoke any external Python script using the -P option. This command instructs the Worker to find a script in the local directory and execute it when the assigned job invokes the Blender executable.

Now, we can delegate control of the application to the systemd init subsystem. Let’s inform the system about the location of the working directory, the executable file, and the user privileges required for launching. Create a new file:

sudo nano /etc/systemd/system/flamenco-manager.service

Fill it with the following strings:

[Unit]
Description=Flamenco Manager service
[Service]
User=user
WorkingDirectory=/home/user/flamenco-3.7-linux-amd64
ExecStart=/home/user/flamenco-3.7-linux-amd64/flamenco-manager
Restart=always
[Install]
WantedBy=multi-user.target

Save the file and exit the nano text editor.

sudo systemctl daemon-reload

sudo systemctl start flamenco-manager.service

sudo systemctl status flamenco-manager.service

● flamenco-manager.service - Flamenco Manager service
Loaded: loaded (/etc/systemd/system/flamenco-manager.service; disabled; vendor preset: enabled)
Active: active (running) since Tue 2023-10-17 11:03:50 UTC; 7s ago
Main PID: 3059 (flamenco-manage)
 Tasks: 7 (limit: 4558)
  Memory: 28.6M
     CPU: 240ms
CGroup: /system.slice/flamenco-manager.service
        └─3059 /home/user/flamenco-3.7-linux-amd64/flamenco-manager

Enable automatic start when the system boots:

sudo systemctl enable flamenco-manager.service

Step 5. Worker node

Connect to the VPN server using the guide from Step 1 and mount the share from Step 3. Stop the VPN service before continuing:

sudo snap install blender --classic

Modern *.blend files are compressed with the Zstandard algorithm. To avoid mistakes, it is essential to incorporate support for this algorithm:

sudo apt -y install python3-zstd

Download the application:

wget https://flamenco.blender.org/downloads/flamenco-3.7-linux-amd64.tar.gz

Unpack the downloaded archive:

tar xvfz flamenco-3.7-linux-amd64.tar.gz

Navigate to the created directory:

cd flamenco-3.7-linux-amd64/

Create an additional script that enables GPU rendering when Flamenco jobs runs:

nano gpurender.py

import bpy
def enable_gpus(device_type, use_cpus=False):
    preferences = bpy.context.preferences
    cycles_preferences = preferences.addons["cycles"].preferences
    cycles_preferences.refresh_devices()
    devices = cycles_preferences.devices
    if not devices:
        raise RuntimeError("Unsupported device type")
    activated_gpus = []
    for device in devices:
        if device.type == "CPU":
            device.use = use_cpus
        else:
            device.use = True
            activated_gpus.append(device.name)
            print('activated gpu', device.name)
    cycles_preferences.compute_device_type = device_type
    bpy.context.scene.cycles.device = "GPU"
    return activated_gpus
enable_gpus("CUDA")

Save the file and exit. Then, create a separate service to run Flamenco from systemd:

sudo nano /etc/systemd/system/flamenco-worker.service

[Unit]
Description=Flamenco Worker service
[Service]
User=usergpu
WorkingDirectory=/home/usergpu/flamenco-3.7-linux-amd64
ExecStart=/home/usergpu/flamenco-3.7-linux-amd64/flamenco-worker
Restart=always
[Install]
WantedBy=multi-user.target

Reload configuration and start the new service:

sudo systemctl daemon-reload

sudo systemctl start flamenco-worker.service

sudo systemctl status flamenco-worker.service

● flamenco-worker.service - Flamenco Worker service
Loaded: loaded (/etc/systemd/system/flamenco-worker.service; enabled; preset: enabled)
Active: active (running) since Tue 2023-10-17 13:56:18 EEST; 47s ago
Main PID: 636 (flamenco-worker)
 Tasks: 5 (limit: 23678)
Memory: 173.9M
   CPU: 302ms
CGroup: /system.slice/flamenco-worker.service
        └─636 /home/user/flamenco-3.7-linux-amd64/flamenco-worker

Enable automatic start when the system boots:

sudo systemctl enable flamenco-worker.service

Step 6. User node

User node can be managed with any operating system. For this guide, we show how to set up a node with Windows 11 and 4 necessary components:

VPN connection
Mounted remote directory
Blender installed
Flamenco add-on

Download and install Wireguard from the official website. Create a new text file and paste the configuration, generated for the client in Step 1. Rename the file to flamenco.conf and add it in Wireguard using the Add tunnel button:

Connect to your server by pressing the Activate button:

Let’s mount a remote directory. Right-click on This PC and select Map network drive…

Choose Z: as the drive letter, type the Samba share address \\10.0.0.4\private and don’t forget to tick Connect using different credentials. Then click Finish. The system will ask you to enter a username and password for the share. After that, the network directory will be mounted as the Z: drive.

Download and install Blender from the official website. Then, open the URL http://10.0.0.3:8080/flamenco3-addon.zip and install the Flamenco add-on. Activate it in preferences: Edit > Preferences > Add-ons. Tick System: Flamenco 3, enter the Manager URL http://10.0.0.3:8080, and click the refresh button. The system will connect to the manager node and load storage settings automatically:

Open the file that you need to render. On the Scene tab, choose Cycles from the Render Engine drop-down list. Don’t forget to save the file, because these settings are stored directly in the *.blend file:

Scroll down and find the Flamenco 3 section. Click Fetch job types to get a list of available types. Select Simple Blender Render from the drop-down list and set other options, such as the number of frames, chunk size, and output folder. Finally, click Submit to Flamenco:

The Flamenco add-on creates a new job and uploads a blend file to shared storage. The system will submit the job to an available worker and start the rendering process:

If you check the GPU’s load with nvtop or similar utilities, it shows that all GPUs have compute tasks:

You will find the result in a directory that you selected in the previous step. Example here (Ripple Dreams by James Redmond)

See also:

Photogrammetry with Meshroom

Tue, 21 Jan 2025 09:38:44 +0100

Photogrammetry is a method of transforming physical objects into three-dimensional digital models that can be edited with 3D software. This process typically uses specialized devices called 3D scanners, which come in two main types: optical and laser.

Optical scanners often use one or more digital cameras and special lighting to evenly illuminate the object during scanning. This allows for the creation of a 3D model. Laser scanners, on the other hand, use laser beams. These devices emit multiple laser beams and measure the time it takes for each beam to bounce back from the object. Using this data, along with information from position sensors, the scanner calculates the distance to each point on the object. This creates a “point cloud” that forms the basis of the 3D model.

Points cloud

To build the future framework of an object, the system needs to know the coordinates of each vertex in three-dimensional space. The set of vertices is called a point cloud. The more vertices there are, the more detailed the object will be. Creating a point cloud is the first and one of the most crucial steps in recreating a 3D model from photographs.

It’s important to note that each vertex in the point cloud is initially unconnected to other vertices. This allows for easy filtering: keeping the necessary points and removing the rest, before starting to recreate the object’s mesh.

Mesh objects

A mesh object is a type of 3D model consisting of triangular geometric primitives, often referred to as meshes or polymeshes. Once object points are formed, the application can independently compose triangular primitives from them. By connecting these primitives, it’s possible to create a 3D model of almost any shape. At this stage, the model lacks color and remains unpainted.

The subsequent texturing stage addresses this issue.

Texturing

The final stage is the application stretching the image texture extracted from the photos onto the prepared mesh object. The quality of the photos taken and their resolution play a key role here. If it is low, the final result will not look its best. But if a sufficient number of good quality shots were taken, then at the output you’ll receive a fully ready-to-use 3D model of a real object. Below we’ll give some useful tips on preparing the original photos.

Camera settings

To avoid disappointment with your first attempts at creating a 3D model from photographs, consider these simple basic rules. Each rule will help prevent issues that typically arise during the mesh object creation stage.

First, don’t rely on your digital camera’s automatic settings. Modern cameras try to balance four key parameters independently:

ISO,
white balance,
shutter speed,
aperture.

In automatic mode, even slight changes in external conditions can cause these settings to vary between frames. These variations can lead to noticeable inconsistencies during the texturing stage.

To maintain consistent parameters across frames, use the Manual mode (M). The aperture is a crucial setting here. Depending on your lens, aim for a position where it’s nearly closed. This helps to achieve maximum depth of field: the less open the aperture, the better. However, avoid extreme values. If your lens can be close to f/22, you’ll get good results using values between f/11 and f/20.

^{Left f/11, right f/22}

Closing the aperture, however, creates another problem: insufficient light. This can be addressed in two ways: by increasing ISO sensitivity or lengthening shutter speed. Both methods will affect the final result, albeit differently. Raising the ISO to 6400 introduces digital noise in the image, so it’s best to use the lowest possible values. For near-ideal results, setting the ISO to 100 makes sense. Yet, this means the issue of insufficient lighting persists:

^{Left ISO 100, right ISO 6400}

The most effective way to increase light passing through the camera sensor in low-light conditions is to lengthen the shutter speed. The longer the shutter remains open, the more photons hit the sensor, resulting in a better image quality. However, this approach presents a challenge: without a tripod, a shutter speed of 1/50 second or longer can blur the image. Using a tripod eliminates this problem.

White balance is the final crucial parameter. It’s important to disable the automatic setting and choose either a preset profile (such as “Sunny day”) or a custom value in Kelvin. For instance, 5200K is a common setting. Lower values shift the hue towards yellow, while higher values lean towards blue. To avoid time-consuming color corrections in post-processing, use the same white balance profile for all photos in a series.

^{WB profiles. Left “Sunny day”, right “Auto”}

In summary, to capture high-quality photos for photogrammetry:

Use a tripod when there is insufficient light.
Close the aperture nearly to its minimum.
Set the ISO to its minimum value.
Choose a shutter speed that gives you the desired result (or use your camera’s built-in exposure meter).
Use the same white balance preset.

Taking photos

Let’s discuss how many photos to take and from which angles. The type of object and its background significantly influences the final result. Objects without shiny, transparent, or reflective surfaces are ideal for photogrammetry. In practice, objects like windows and glass often require correction in a 3D editor later. However, the general shooting technique remains the same.

For small objects placed on a surface, imagine a sphere around the object. Take photos as if your camera is circling the object three times: once from below, once at the middle, and once from above.

It’s crucial that the object occupies at least half, preferably three-quarters of each frame. Instead of using zoom, try to get physically closer to the object. When creating a cloud point, the software needs as many pixels as possible.

When shooting, remember that the software combines frames into a single object for correct geometry. Make it a rule to take at least three frames from each angle. Once you’ve centered the object in the frame, mentally divide it vertically into three equal parts. Take three pictures, each focusing on one-third of the object. This provides the necessary overlap for the application to accurately calculate each point’s location in 3D space. After photographing the object from all possible sides and angles, you can start preparing the software.

Install Meshroom

Meshroom is a free, cross-platform application that sequentially performs all processing stages, utilizing CPU and GPU resources. While it can run on a standard home computer, each stage may be time-consuming. For large-scale projects involving 3D reconstruction of numerous objects, such as creating an impressive 3D scene, renting a dedicated GPU server might be a practical solution.

Let’s consider a LeaderGPU server with the following configuration: 2 x NVIDIA® RTX™ 3090, 2 x Intel® Xeon® Silver 4210 (3.20 GHz), 128GB RAM. We’ll use Windows Server 2022 as the operating system. Before installing Meshroom, you’ll need to perform some preliminary steps:

Visit the project’s official website to download Meshroom. Unpack the resulting archive to find a ready-to-use application that doesn’t require additional installation. Launch Meshroom.exe to begin.

Upload images

The main window of the application is divided into two parts: upper and lower. The upper section contains the Image Gallery, Image Viewer, and 3D Viewer. The lower section houses the Graph editor and Task Manager. To start, drag and drop your captured photos into the designated area. Both compressed (for example, JPG) and RAW file formats are supported. It is recommended to use RAW files because they contain significantly more data for each frame.

Please note that you already have a ready-made standard pipeline by default, which is schematically displayed in the Graph Editor. This is one of the most important controls that helps to configure all aspects of image processing at each stage. You can manually run each stage by right-clicking and selecting Compute from the drop-down menu.

But for the first time, you can simply click the green Start button, and the application will do everything for you. It will prompt you to save the project, so that you do not accidentally lose the results of the calculation. Click Save, specify a name and directory and save the project:

Next, the application transfers all processing stages from the Graph Editor to the Task Manager, which handles their execution in a specific order. To check the status of each stage, select the corresponding block in the Graph Editor and click the Log button in the lower right corner of the screen. You can also see in real time which stage is currently being processed:

On the right side, you can see the point cloud you’ve built. The final result, generated using the standard pipeline, is available in the directory:

[Your_Project_Path]\MeshroomCache\Texturing\[Random_Symbols]\texturedMesh.obj

Of course, if you fix the output path in the final node of the pipeline beforehand, the object will end up on the path you specified. Then you can import it into any text editor to fix surfaces, add light sources and other effects before rendering.

Integration

While the initial result may look impressive, it often requires refinement in a 3D editor. Meshroom simplifies this process by allowing you to import not just the model, but also the points cloud and camera positions into third-party editors like Houdini or Blender. In the following section we’ll explore how to do it.

Houdini

In fact, Meshroom is a user-friendly interface for the AliceVision engine, which handles all computation-related operations. This interface implements the corresponding pipeline and task manager. If you use Houdini, you can create your own pipeline directly within the application and use it alongside other tools, eliminating the need to launch Meshroom separately.

To get started, it’s best to download and install a dedicated launcher that will manage Houdini updates and plugins. Next, add the SideFX Labs plugin, which offers numerous additional tools, including specific nodes for AliceVision. To do this, click the + button, then select Shelves:

Scroll down the list and select SideFX Labs, then click the Update Toolset button:

To install a plugin, follow these steps: Click the Start Launcher button, navigate to the Labs/Packages section in the left menu, and select Install packages. This will open a window where you can choose packages to install:

Choose the Production Build for your version of Houdini and click Install. Afterward, restart the application to ensure the new effect icons appear at the top:

It’s crucial to note that you won’t find any mention of AliceVision or Meshlab here. This is because the corresponding plugin only functions within the geometry context pipeline. To verify this, click the + icon, then select New Pane Tab Type, and choose Network View:

Press the Tab key and add a Geometry node:

Double click to open the created node and type av on your keyboard. The system will instantly display a list of available nodes starting with the Labs AV symbols. These nodes allow you to control the AliceVision engine and integrate it into your own pipelines:

To create a proper pipeline, refer to the official documentation for the plugin. Additionally, consider adding the AliceVision directory to the list of environmental variables in the houdini.env file. For a standard installation using the launcher, this file is typically located in the directory C:\Users\Administrator\Documents\houdini20.5\

Open the houdini.env file with any text editor and add the following line:

ALICEVISION_PATH = [path to alicevision directory in Meshroom folder]

For example, if you installed Meshroom in the root directory of the D: drive, your path might look like this:

ALICEVISION_PATH = D:\Meshroom\aliceVision

Save the file, then restart the Houdini application.

Blender

For Blender users, we recommend the Meshroom2Blender plugin. While it functions differently from the Houdini plugin, it allows you to export point clouds and camera positions calculated by Meshroom to Blender. To access the plugin code, open the link in your browser:

https://raw.githubusercontent.com/tibicen/meshroom2blender/master/view3d_point_cloud_visualizer.py

Save the code as view3d_point_cloud_visualizer.py in a convenient directory. Next, open Blender and navigate to Edit - Preferences. From there, select the Add-ons tab:

Click the down arrow and select Install from Disk:

In the newly opened window, navigate to the directory where you saved the plugin. Select the plugin file and click the Install from Disk button:

The plugin is now installed. It’s recommended to restart the application. After restarting, you’ll see the Point Cloud Visualizer item in the viewing mode. The plugin requires you to specify the path to a file with the .ply extension:

By default, Meshroom doesn’t generate this type of file. To create it, open the pipeline and add the ConvertSfMFormat node. Use the SfMData from the StructureFromMotion node as input. For output, specify the Images Folder of the Texturing node.

The final step is to specify the format. Click on SfM File Format in the ConvertSfMFormat node and select ply from the drop-down list:

Right click on the created node and select Compute:

Once the process is complete, you’ll find the required file in the directory:

[Your_Project_Path]\MeshroomCache\ConvertSfMFormat\[Random_Symbols]\sfm.ply

You can load it into Blender in two ways: through the aforementioned plugin or via the standard import process File - Import - Stanford PLY (.ply):

For more information on using this plugin, we suggest consulting the project repository or on a specialized web resource.

Conclusion

Photogrammetry is a large field of knowledge, where we tried to tell only some basic techniques for converting 2D images into a 3D model. This is used in many industries, from architecture to the creation of computer games.

Having gained the first experience of shooting a dataset and its consistent transformation into a 3D model, you will be able to improve your skills and transfer physical objects into a virtual 3D space. Well, LeaderGPU will help you with computing power, reducing the calculation time and freeing up your workstation for other, often higher-priority tasks.

See also:

Open WebUI: All in one

Mon, 20 Jan 2025 15:21:46 +0100

Open WebUI was originally developed for Ollama, which we talked about in one of our articles. Previously, it was called Ollama WebUI, but over time, the focus shifted to universality of application, and the name was changed to Open WebUI. This software solves the key problem of convenient work with large neural network models placed locally or on user-controlled servers.

Installation

The main and most preferred installation method is to deploy a Docker container. This allows you not to think about the presence of dependencies or other components that ensure the correct operation of the software. However, you can install Open WebUI by cloning the project repository from GitHub and building it from source code. In this article, we’ll consider both options.

Before you begin, make sure that the GPU drivers are installed on the server. Our instruction Install NVIDIA® drivers in Linux will help you do this.

Using Docker

If you’ve just ordered a server, then the Docker Engine itself and the necessary set of tools for passing GPUs to the container will be missing. We don’t recommend installing Docker from the standard Ubuntu repository, since it may be outdated and not support all modern options. It would be better to use the installation script posted on the official website:

curl -sSL https://get.docker.com/ | sh

In addition to Docker, you need to install the NVIDIA® Container Toolkit, so enable the NVIDIA® repository:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Update your package cache and install NVIDIA® Container Toolkit:

sudo apt update && sudo apt -y install nvidia-container-toolkit

For the toolchain to work, you’ll need to restart the Docker daemon:

sudo systemctl restart docker

Now you can run the desired container. Note that the following command doesn't isolate containers from the host network because later you can enable additional options, such as generating images using the Stable Diffusion WebUI. This command will automatically download and run all layers of the image:

sudo docker run -d --network=host --gpus=all -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:ollama

Using Git

Ubuntu 22.04

First, you need to clone the contents of the repository:

git clone https://github.com/open-webui/open-webui.git

Open the downloaded directory:

cd open-webui/

Copy the example configuration (you can modify it if necessary), which will set the environment variables for the build:

cp -RPp .env.example .env

Install the NVM installer, which will help you install the required version of Node.js on the server:

curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.7/install.sh | bash

After that, you need to close and reopen the SSH session so that the next command works correctly.

Install Node Package Manager:

sudo apt -y install npm

Install Node.js version 22 (current at the time of writing this article):

npm install 22

Install the dependencies required for further assembly:

npm install

Let’s start the build. Please note that it requires more than 4GB of free RAM:

npm run build

The frontend is ready; now it’s time to prepare the backend. Go to the directory with the same name:

cd ./backend

Install pip and ffmpeg packages:

sudo apt -y install python3-pip ffmpeg

Before installation, you need to add a new path to the environment variable:

sudo nano ~/.bashrc

Add the following line to the end of the file:

export PATH="/home/usergpu/.local/bin:$PATH"

Let’s update it to the latest version:

python3 -m pip install --upgrade pip

Now you can install the dependencies:

pip install -r requirements.txt -U

Install Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Everything is ready to launch the application:

bash start.sh

Ubuntu 24.04 / 24.10

When installing OpenWebUI on Ubuntu 24.04/24.10, you'll face a key challenge: the operating system uses Python 3.12 by default, while OpenWebUI only supports version 3.11. You can't simply downgrade Python, doing so would break the operating system. Since the python3.11 package isn't available in the standard repositories, you'll need to create a virtual environment to use the correct Python version.

The best solution is to use the Conda package management system. Conda works like pip but adds virtual environment support similar to venv. Since you only need basic functionality, you'll use Miniconda, a lightweight distribution. Download the latest release from GitHub:

curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"

Run the script:

bash Miniforge3-$(uname)-$(uname -m).sh

Let's create a virtual environment named pyenv and specify the Python version 3.11:

conda create -n pyenv python=3.11

Activate the created environment:

conda activate pyenv

Now you can proceed with the standard OpenWebUI installation steps for Ubuntu 22.04. The virtual environment ensures that all installation scripts will run smoothly without any package version conflicts.

Models

Ollama library

Open WebUI allows you to upload models directly from the web interface, specifying only the name in the format model:size. To do this, navigate to http://192.168.88.20:8080/admin/settings and click Connections. Then click on wrench icon opposite the http://localhost:11434 string. After looking at the names of the models in the library, enter its name and click on the upload icon:

After that, the system will automatically download the required model, and it will immediately become available for use. Depending on the selected size, the download may take a different amount of time. Before downloading, make sure that there is enough space on the disk drive. For more information, see the article Disk partitioning in Linux.

Custom models

If you need to integrate a neural network model that is not in the Ollama library, you can use the experimental function and load any arbitrary model in GGUF format. To do this, go to Settings - Admin Settings - Connections and click on wrench icon opposite the http://localhost:11434. Click on Show in the Experimental section. By default, the file mode is activated, which allows you to load a file from your local computer. If you click File Mode, it will change to URL Mode, which allows you to specify the URL of the model file, and the server will download it automatically:

RAG

In addition to a convenient and functional web interface, Open WebUI helps expand the capabilities of different models, ensuring their joint use. For example, it’s easy to upload documents to form a RAG (Retrieval-augmented generation) vector database. In the process of generating a response to the user, LLM will be able to rely not only on data obtained directly as a result of training, but also on data placed in a similar vector database.

Documents

By default, Open WebUI scans the /data/docs directory for files that can be placed in the database vector space and performs the transformation using the built-in all-MiniLM-L6-v2 model. This is not the only model that is suitable for this task, so it makes sense to try other options, for example, from this list.

Text documents, cleared of tags and other special characters, are best suited for RAG. Of course, you can upload documents as is, but this can greatly affect the accuracy of the generated answers. For example, if you have a knowledge base in Markdown format, you can first clear it of formatting and only then upload it to /data/docs.

Web search

In addition to local documents, the neural network model can be instructed to use any websites as a data source. This will allow it to answer questions using not only the data it was trained on, but also data hosted on websites specified by the user.

In fact, this is a type of RAG, which receives HTML pages as input and then transforms them in a special way, taking their place in a vector database. Searching through such a database will be very fast; therefore, the neural network model will be able to quickly generate a response based on its results. Open WebUI supports different search engines but can only work with one at a time, which is specified in the settings.

To include web search results in neural network responses, click + (plus symbol) and slide the Web Search switch:

Image generating

The highlight of Open WebUI is that this software allows you to combine several neural networks with different tasks to solve a single problem. For example, Llama 3.1 perfectly conducts a dialogue with the user in several languages, but its answers will be exclusively text. It can’t generate images, so there is no way to illustrate its answers.

Stable Diffusion, which we often wrote about, is the opposite: this neural network generates images perfectly, but it can’t work with texts at all. The developers of Open WebUI tried to combine the strengths of both neural networks in one dialogue and implemented the following scheme of work.

When you conduct a dialogue in Open WebUI, a special button appears next to each neural network response. By clicking on it, you’ll receive an illustration of this response directly in the chat:

This is achieved by calling the Stable Diffusion WebUI API, and at the moment, a connection with versions from Automatic1111 and a connection to ComfyUI are available. You can also generate images via the Dall-E neural network, but it can’t be deployed locally - this is a paid image generation service with closed source code.

This feature will only work if, in addition to Open WebUI with Ollama, Stable Diffusion WebUI is installed on the server. You can find the installation instructions here. The only thing worth mentioning is that when running the ./webui.sh script, you’ll need to specify an additional key to enable the API:

./webui.sh --listen --api --gradio-auth user:password

Another pitfall may arise due to a lack of video memory. If you encounter this, you can use two useful keys: --medvram and --lowvram. This will avoid the Out-of-memory error when starting generation.

See also:

How does Ollama work

Mon, 20 Jan 2025 15:16:02 +0100

Ollama is a tool for running large neural network models locally. The use of public services is often perceived by businesses as a potential risk for leakage of confidential and sensitive data. Therefore, deploying LLM on a controlled server allows you to independently manage the data placed on it while utilizing the strengths of LLM.

This also helps avoid the unpleasant situation of vendor lock-in, where any public service can unilaterally stop providing services. Of course, the initial goal is to enable the use of generative neural networks in locations where internet access is absent or difficult (for example, on an airplane).

The idea was to simplify the launch, control and fine-tuning of LLMs. Instead of complex multi-step instructions, Ollama allows you to execute one simple command and receive the finished result after some time. It will be presented simultaneously in the form of a local neural network model, with which you can communicate using a web interface and API for easy integration into other applications.

For many developers, this became a very useful tool, as in most cases, it was possible to integrate Ollama with the IDE used and receive recommendations or ready-made code written directly while working on the application.

Ollama was originally intended only for computers with the macOS operating system, but was later ported to Linux and Windows. A special version has also been released for working in containerized environments such as Docker. Currently, it works equally well on both desktops and any dedicated server with a GPU. Ollama supports the ability to switch between different models out-of-the-box and maximizes all available resources. Of course, these models may not perform as well on a regular desktop, but they function quite adequately.

How to install Ollama

Ollama can be installed in two ways: without using containerization, using an installation script, and as a ready-made Docker container. The first method makes it easier to manage the components of the installed system and models, but is less fault-tolerant. The second method is more fault tolerant, but when using it, you need to take into account all the aspects inherent in containers: slightly more complex management and a different approach to data storage.

Regardless of the chosen method, several additional steps are needed to prepare the operating system.

Prerequisites

Update the package cache repository and installed packages:

sudo apt update && sudo apt -y upgrade

Install all necessary GPU drivers using auto install feature:

sudo ubuntu-drivers autoinstall

Reboot the server:

sudo shutdown -r now

Installation via script

The following script detects the current operating system architecture and installs the appropriate version of Ollama:

curl -fsSL https://ollama.com/install.sh | sh

During operation, the script will create a separate ollama user, under which the corresponding daemon will be launched. Incidentally, the same script functions well in WSL2, enabling the installation of the Linux version of Ollama on Windows Server.

Installation via Docker

There are various methods to install Docker Engine on a server. The easiest way is to use a specific script that installs the current Docker version. This approach is effective for Ubuntu Linux, from version 20.04 (LTS) up to the latest version, Ubuntu 24.04 (LTS):

curl -sSL https://get.docker.com/ | sh

For Docker containers to interact properly with the GPU, an additional toolkit must be installed. Since it’s not available in the basic Ubuntu repositories, you need to first add a third-party repository using the following command:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Update the package cache repository:

sudo apt update

And install the nvidia-container-toolkit package:

sudo apt install nvidia-container-toolkit

Don’t forget to restart the docker daemon via systemctl:

sudo systemctl restart docker

It’s time to download and run the Ollama with the Open-WebUI web interface:

sudo docker run -d -p 3000:8080 --gpus=all -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:ollama

Open the web browser and navigate to http://[server-ip]:3000:

Download and run the models

Via command line

Just run the following command:

ollama run llama3

Via WebUI

Open Settings > Models, type the necessary model name, for example, llama3 and click on the button with download symbol:

The model will download and install automatically. Once completed, close the settings window and select the downloaded model. After this you can begin a dialogue with it:

VSCode integration

If you have installed Ollama using the installation script, you can launch any of the supported models almost instantly. In the next example, we will run the default model expected by the Ollama Autocoder extension (openhermes2.5-mistral:7b-q4_K_M):

ollama run openhermes2.5-mistral:7b-q4_K_M

By default, Ollama permits working through an API, only allowing connections from the local host. Hence, before installing and using the extension for Visual Studio Code, port forwarding is required. Specifically, you need to forward remote port 11434 to your local computer. You can find an example of how to do this in our article about Easy Diffusion WebUI.

Type Ollama Autocoder in a search field, then click Install:

After installing the extension, a new item titled Autocomplete with Ollama will be available in the command palette. Begin coding and initiate this command.

The extension will connect to the LeaderGPU server using port forwarding and within a few seconds, the generated code will display on your screen:

You can assign this command to a hotkey. Use it whenever you want to supplement your code with a generated fragment. This is just one example of available VSCode extensions. The principle of port forwarding from a remote server to a local computer enables you to set up a single server with a running LLM for an entire developer team. This assurance prevents third-party companies or hackers from using the sent code.

See also:

PrivateGPT: AI for documents

Mon, 20 Jan 2025 12:01:00 +0100

Large language models have greatly evolved over the past few years and have become effective tools for many tasks. The only problem with their use is that most products based on these models utilize ready-made services from third-party companies. This usage has the potential to leak sensitive data, so many companies avoid uploading internal documents into public LLM services.

A project like PrivateGPT could be a solution. It is initially designed for completely local use. Its strength is that you can submit various documents as input, and the neural network will read them for you and provide its own comments in response to your requests. For example, you can “feed” large texts to it and ask it to draw some conclusions based on the user’s request. This allows you to significantly save time on proofreading.

This is particularly true for professional fields like medicine. For instance, a doctor can make a diagnosis and request the neural network to confirm it based on the uploaded array of documents. This enables obtaining an additional independent opinion, thereby reducing the number of medical errors. Since requests and documents do not leave the server, one can be assured that the received data will not appear in the public domain.

Today, we’ll show you how to deploy a neural network on dedicated LeaderGPU servers with the Ubuntu 22.04 LTS operating system in just 20 minutes.

System prepare

Begin by updating your packages to the latest version:

sudo apt update && sudo apt -y upgrade

Now, install additional packages, libraries, and the NVIDIA® graphics driver. All of these will be needed to successfully build the software and run it on the GPU:

sudo apt -y install build-essential git gcc cmake make openssl libssl-dev libbz2-dev libreadline-dev libsqlite3-dev zlib1g-dev libncursesw5-dev libgdbm-dev libc6-dev zlib1g-dev libsqlite3-dev tk-dev libssl-dev openssl libffi-dev lzma liblzma-dev libbz2-dev

CUDA® 12.4 install

In addition to the driver, you need to install the NVIDIA® CUDA® toolkit. These instructions were tested on CUDA® 12.4, but everything should also work on CUDA® 12.2. However, keep in mind that you’ll need to indicate the version you are installed when specifying the path to the executable files.

Run the following command sequentially:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin

sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600

wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda-repo-ubuntu2204-12-4-local_12.4.0-550.54.14-1_amd64.deb

sudo dpkg -i cuda-repo-ubuntu2204-12-4-local_12.4.0-550.54.14-1_amd64.deb

sudo cp /var/cuda-repo-ubuntu2204-12-4-local/cuda-*-keyring.gpg /usr/share/keyrings/

sudo apt-get update && sudo apt-get -y install cuda-toolkit-12-4

More information on installing CUDA® can be found in our Knowledge Base. Now, reboot the server:

sudo shutdown -r now

PyEnv install

It’s time to install a simple Python version control utility called PyEnv. This is a significantly improved fork of the similar project for Ruby (rbenv), configured to work with Python. It can be installed with one-line script:

curl https://pyenv.run | bash

Now, you need to add some variables to the end of the script file, which is executed at login. The first three lines are responsible for the correct operation of PyEnv, and the fourth is needed for Poetry, which will be installed later:

nano .bashrc

export PYENV_ROOT="$HOME/.pyenv"
[[ -d $PYENV_ROOT/bin ]] && export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init -)"
export PATH="/home/usergpu/.local/bin:$PATH"

Apply the settings you’ve made:

source .bashrc

Install Python version 3.11:

pyenv install 3.11

Create a virtual environment for Python 3.11:

pyenv local 3.11

Poetry install

The next piece of the puzzle is Poetry. This is an analogue of pip for managing dependencies in Python projects. The author of Poetry was tired of constantly dealing with different configuration methods, such as setup.cfg, requirements.txt, MANIFEST.ini, and others. This became the driver for the development of a new tool that uses a pyproject.toml file, which stores all the basic information about a project, not just a list of dependencies.

Install Poetry:

curl -sSL https://install.python-poetry.org | python3 -

PrivateGPT install

Now that everything is ready, you can clone the PrivateGPT repository:

git clone https://github.com/imartinez/privateGPT

Go to the downloaded repository:

cd privateGPT

Run dependency installation using Poetry while enabling additional components:

ui - adds a Gradio based management web interface to the backend application;
embedding-huggingface - enables support for embedding models downloaded from HuggingFace;
llms-llama-cpp - adds support for direct inference of models in GGUF format;
vector-stores-qdrant - adds the qdrant vector database.

poetry install --extras "ui embeddings-huggingface llms-llama-cpp vector-stores-qdrant"

Set your Hugging Face access token. For additional information please read this article:

export HF_TOKEN="YOUR_HUGGING_FACE_ACCESS_TOKEN"

Now, run the installation script, which will automatically download the model and weights (Meta Llama 3.1 8B Instruct by default):

poetry run python scripts/setup

The following command recompile llms-llama-cpp separately to enable NVIDIA® CUDA® support, in order to offload workloads to the GPU:

CUDACXX=/usr/local/cuda-12/bin/nvcc CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=native" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir --force-reinstall --upgrade

If you get an error like nvcc fatal : Unsupported gpu architecture 'compute_' just specify the exact architecture of the GPU you are using. For example: DCMAKE_CUDA_ARCHITECTURES=86 for NVIDIA® RTX™ 3090.

The final step before beginning is to install support for asynchronous calls (async/await):

pip install asyncio

PrivateGPT run

Run PrivateGPT using a single command:

make run

Open your web browser and go to the page http://[LeaderGPU_server_IP_address]:8001

See also:

Qwen 2 vs Llama 3

Mon, 20 Jan 2025 11:27:11 +0100

Large Language Models (LLMs) have significantly impacted our lives. Despite understanding their internal structure, these models remain a focal point for scientists who often liken them to a “black box”. The final result depends not only on the LLM’s design but also on its training and the data used for training.

While scientists find research opportunities, end-users are primarily interested in two things: speed and quality. These criteria play a crucial role in the selection process. To accurately compare two LLMs, many seemingly unrelated factors need to be standardized.

The equipment used for interference and the software environment, including the operating system, driver versions, and software packages, have the most significant impact. It’s essential to select an LLM version that operates on various equipment and choose a speed metric that’s easily comprehensible.

We selected ‘tokens per second’ (tokens/s) as this metric. It’s important to note that a token ≠ a word. The LLM breaks words into simpler components, typical of a specific language, referred to as tokens.

The statistical predictability of the next character varies across languages, so tokenization will differ. For instance, in English, approximately 100 tokens are derived from every 75 words. In languages using the Cyrillic alphabet, the number of tokens per word may be higher. So, 75 words in a Cyrillic language, like Russian, could equate to 120-150 tokens.

You can verify this using OpenAI’s Tokenizer tool. It shows how many tokens a text fragment is broken into, making ‘tokens per second’ a good indicator of an LLM’s natural language processing speed and performance.

Each test was conducted on the Ubuntu 22.04 LTS operating system with NVIDIA® drivers version 535.183.01 and the NVIDIA® CUDA® 12.5 Toolkit installed. Questions were formulated to assess the LLM’s quality and speed. The processing speed of each answer was recorded and will contribute to the average value for each tested configuration.

We began testing various GPUs, from the latest models to the older ones. A crucial condition for the test was that we measured the performance of only one GPU, even if the multiple ones were present in the server configuration. This is because the performance of a configuration with multiple GPUs depends on additional factors such as the presence of a high-speed interconnect between them (NVLink®).

In addition to speed, we also attempted to evaluate the quality of responses on a 5-point scale, where 5 represents the best outcome. This information is provided here for general understanding only. Each time, we’ll pose the same questions to the neural network and attempt to discern how accurately each one comprehends what the user wants from it.

Qwen 2

Recently, a team of developers from Alibaba Group presented the second version of their generative neural network Qwen. It understands 27 languages and is well optimized for them. Qwen 2 comes in different sizes to make it easy to deploy on any device (from highly resource-constrained embedded systems to a dedicated server with GPUs):

0.5B: suitable for IoT and embedded systems;
1.5B: an extended version for embedded systems, used where the capabilities of 0.5B will not be enough;
7B: medium-sized model, well suited for natural language processing;
57B: high-performance large model suitable for demanding applications;
72B: the ultimate Qwen 2 model, designed to solve the most complex problems and process large volumes of data.

Versions 0.5B and 1.5B were trained on datasets with a context length of 32K. Versions 7B and 72B were already trained on the 128K context. The compromise model 57B was trained on datasets with a context length of 64K. The creators position Qwen 2 as an analog of Llama 3 capable of solving the same problems, but much faster.

Llama 3

The third version of the generative neural network from the MetaAI Llama family was introduced in April 2024. It was released, unlike Qwen 2, in only two versions: 8B and 70B. These models were positioned as a universal tool for solving many problems in various cases. It continued the trend towards multilingualism and multimodality, while simultaneously becoming faster than the previous versions and supporting a longer context length.

The creators of Llama 3 tried to fine-tune the models to reduce the percentage of statistical hallucinations and increase the variety of answers. So Llama 3 is quite capable of giving practical advice, helping to write a business letter, or speculating on a topic specified by the user. The datasets on which Llama 3 models were trained had a context length of 128K and more than 5% included data in 30 languages. However, as stated in the press release, generation performance in English will be significantly higher than in any other language.

Comparison

NVIDIA® RTX™ A6000

Let’s start our speed measurements with the NVIDIA® RTX™ A6000 GPU, based on the Ampere architecture (not to be confused with the NVIDIA® RTX™ A6000 Ada). This card has very modest characteristics, but at the same time, it has 48 GB of VRAM, which allows it to operate with fairly large neural network models. Unfortunately, low clock speed and bandwidth are the reasons for the low inference speed of text LLMs.

Immediately after launch, the Qwen 2 neural network began to outperform Llama 3. When answering the same questions, the average difference in speed was 24% in favor of Qwen 2. The speed of generating answers was in the range of 11-16 tokens per second. This is 2-3 times faster than trying to run generation even on a powerful CPU, but in our rating, this is the most modest result.

NVIDIA® RTX™ 3090

The next GPU is also built on the Ampere architecture, has 2 times less video memory, but at the same time, it operates at a higher frequency (19500 MHz versus 16000 Mhz). Video memory bandwidth is also higher (936.2 GB/s versus 768 GB/s). Both of these factors seriously increase the performance of the RTX™ 3090, even taking into account the fact that it has 256 fewer CUDA® cores.

Here you can clearly see that Qwen 2 is much faster (up to 23%) than Llama 3 when performing the same tasks. Regarding the quality of generation, the multi language support of Qwen 3 is truly worthy of praise, and the model always answers in the same language in which the question was asked. With Llama 3, in this regard, it often happens that the model understands the question itself, but prefers to formulate answers in English.

NVIDIA® RTX™ 4090

Now the most interesting thing: let’s see how the NVIDIA® RTX™ 4090, built on the Ada Lovelace architecture, named after the English mathematician, Augusta Ada King, Countess of Lovelace, copes with the same task. She became famous for becoming the first programmer in the history of mankind, and at the time of writing her first program there was no assembled computer that could execute it. However, it was recognized that the algorithm described by Ada for calculating Bernoulli numbers was the first program in the world written to be played on a computer.

The graph clearly shows that the RTX™ 4090 coped with the inference of both models almost twice as fast. It’s interesting that in one of the iterations Llama 3 managed to outperform the Qwen 2 by 1.2%. However, taking into account the other iterations, Qwen 2 retained its leadership, remaining 7% faster than Llama 3. In all iterations, the quality of responses from both neural networks was high with a minimum number of hallucinations. The only defect is that in rare cases one or two Chinese characters were mixed into the answers, which did not in any way affect the overall meaning.

NVIDIA® RTX™ A40

The next NVIDIA® RTX™ A40 card, on which we ran similar tests, is again built on the Ampere architecture and has 48 GB of video memory on the motherboard. Compared to the RTX™ 3090, this memory is slightly faster (20000 MHz vs. 19500 MHz), but has lower bandwidth (695.8 GB/s versus 936.2 GB/s). The situation is compensated by the larger number of CUDA® cores (10752 versus 10496), which overall allows the RTX™ A40 to perform slightly faster than the RTX™ 3090.

As for comparing the speed of models, here Qwen 2 is also ahead of Llama 3 in all iterations. When running on RTX™ A40, the difference in speed is about 15% with the same answers. In some tasks, Qwen 2 gave a little more important information, while Llama 3 was as specific as possible and gave examples. Despite this, everything has to be double-checked, since sometimes both models begin to produce controversial answers.

NVIDIA® L20

The last participant in our testing was the NVIDIA® L20. This GPU is built like the RTX™ 4090, on the Ada Lovelace architecture. This is a fairly new model, presented in the fall of 2023. On board, it has 48 GB of video memory and 11776 CUDA® cores. Memory bandwidth is lower than the RTX™ 4090 (864 GB/s versus 936.2 GB/s), as is the effective frequency. So the NVIDIA® L20 inference scores of both models will be closer to 3090 than 4090.

The final test didn’t bring any surprises. Qwen 2 turned out to be faster than Llama 3 in all iterations.

Conclusion

Let’s combine all the collected results into one chart. Qwen 2 was faster than Llama 3 from 7% to 24% depending on the used GPU. Based on this, we can clearly conclude that if you need to get high-speed inference from models such as Qwen 2 or Llama 3 on single-GPU configurations, then the undoubted leader will be the RTX™ 3090. A possible alternative could be the A40 or L20. But it’s not worth running the inference of these models on A6000 generation Ampere cards.

We deliberately didn’t mention cards with a smaller amount of video memory, for example, NVIDIA® RTX™ 2080Ti, in the tests, since it isn’t possible to fit the above-mentioned 7B or 8B models there without quantization. Well, the 1.5B model Qwen 2, unfortunately, doesn’t have high-quality answers and can’t serve as a full replacement for 7B.

See also:

Your own Qwen using HF

Mon, 20 Jan 2025 09:43:46 +0100

Large neural network models, with their extraordinary abilities, are firmly rooted in our lives. Recognizing this as an opportunity for future development, large corporations began to develop their own versions of these models. The Chinese giant, Alibaba, didn’t stand by. They created their own model, QWen (Tongyi Qianwen), which became the basis for many other neural network models.

Prerequisites

Update cache and packages

Let’s update the package cache and upgrade your operating system before you start setting up Qwen. Also, we need to add Python Installer Packages (PIP), if it isn’t already present in the system. Please note that for this guide, we are using Ubuntu 22.04 LTS as the operating system:

sudo apt update && sudo apt -y upgrade && sudo apt install python3-pip

Install NVIDIA® drivers

You can use the automated utility that is included in Ubuntu distributions by default:

sudo ubuntu-drivers autoinstall

Alternatively, you can install NVIDIA® drivers manually using our step-by-step guide. Don’t forget to reboot the server:

sudo shutdown -r now

Text generation web UI

Clone the repository

Open the working directory on the SSD:

cd /mnt/fastdisk

Clone the project’s repository:

git clone https://github.com/oobabooga/text-generation-webui.git

Install requirements

Open the downloaded directory:

cd text-generation-webui

Check and install all missing components:

pip install -r requirements.txt

Add SSH key to HF

Before starting, you need to set up port forwarding (remote port 7860 to 127.0.0.1:7860) in your SSH-client. You can find additional information in the following article: Connect to Linux server.

Update the package cache repository and installed packages:

sudo apt update && sudo apt -y upgrade

Generate and add an SSH-key that you can use in Hugging Face:

cd ~/.ssh && ssh-keygen

When the keypair is generated, you can display the public key in the terminal emulator:

cat id_rsa.pub

Copy all information starting from ssh-rsa and ending with usergpu@gpuserver as shown in the following screenshot:

Open a web browser, type https://huggingface.co/ into the address bar and press Enter. Log into your HF-account and open Profile settings. Then choose SSH and GPG Keys and click on the Add SSH Key button:

Fill in the Key name and paste the copied SSH Public key from the terminal. Save the key by pressing Add key:

cd ~/

Download and run the shell script. This script installs a new third-party repository with git-lfs:

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash

Now, you can install it using the standard package manager:

sudo apt-get install git-lfs

Let’s configure git to use our HF nickname:

git config --global user.name "John"

And linked to the HF email account:

git config --global user.email "john.doe@example.com"

Download the model

The next step is to download the model using the repository cloning technique commonly used by software developers. The only difference is that the previously installed Git-LFS will automatically process the marked pointer files and download all the content. Open the necessary directory (/mnt/fastdisk in our example):

cd /mnt/fastdisk

This command may take some time to complete:

git clone git@hf.co:Qwen/Qwen1.5-32B-Chat-GGUF

Run the model

Execute a script that will start the web server and specify /mnt/fastdisk as the working directory with models. This script may download some additional components upon first launch.

./start_linux.sh --model-dir /mnt/fastdisk

Open your web browser and select the llama.cpp from the Model loader drop-down list:

Be sure to set the n-gpu-layers parameter. It is he who is responsible for what percentage of calculations will be offloaded to the GPU. If you leave the number at 0, then all calculations will be performed on the CPU, which is quite slow. Once all parameters are set, click the Load button. After that, go to the Chat tab and select Instruct mode. Now, you can enter any prompt and receive a response:

Processing will be performed by default on all available GPUs, taking into account the previously specified parameters:

See also:

Your own Vicuna in Linux

Mon, 20 Jan 2025 09:25:01 +0100

This article will guide you through the process of deploying a basic LLaMA alternative on a LeaderGPU server. We will utilize the FastChat project and the freely available Vicuna model for this purpose.

The model we'll be using is based on Meta's LLaMA architecture but has been optimized for efficient deployment on consumer hardware. This setup provides a good balance between performance and resource requirements, making it suitable for both testing and production environments.

Preinstallation

Let’s prepare to install FastChat by updating the packages cache repository:

sudo apt update && sudo apt -y upgrade

Install NVIDIA® drivers automatically using the following command:

sudo ubuntu-drivers autoinstall

You can also install these drivers manually with our step-by-step guide. Then, reboot the server:

sudo shutdown -r now

The next step is to install PIP (Package Installer for Python):

sudo apt install python3-pip

Install FastChat

From PyPi

There are two possible ways to install FastChat. You can install it directly from PyPi:

pip3 install "fschat[model_worker,webui]"

From GitHub

Alternatively, you can clone the FastChat repository from GitHub and install it:

git clone https://github.com/lm-sys/FastChat.git

cd FastChat

Don’t forget to upgrade PIP before proceeding:

pip3 install --upgrade pip

pip3 install -e ".[model_worker,webui]"

Run FastChat

First start

To ensure a successful initial launch, it’s recommended to manually call FastChat directly from the command line:

python3 -m fastchat.serve.cli --model-path lmsys/vicuna-7b-v1.5

This action automatically retrieves and downloads the designated model of your choice, which should be specified using the --model-path parameter. The 7b represents a model with 7 billion parameters. This is the lightest model, suitable for GPUs with 16 GB of video memory. Links to models with a larger number of parameters can be found in the project’s Readme file.

Now you have the option to engage in a conversation with the chatbot directly within the command line interface or you can set up a Web interface. It contains three components:

Controller
Workers
Gradio web server

Set up services

Let’s transform each component into a separate systemd service. Create 3 separate files with the following contents:

sudo nano /etc/systemd/system/vicuna-controller.service

[Unit]
Description=Vicuna controller service
[Service]
User=usergpu
WorkingDirectory=/home/usergpu
ExecStart=python3 -m fastchat.serve.controller
Restart=always
[Install]
WantedBy=multi-user.target

sudo nano /etc/systemd/system/vicuna-worker.service

[Unit]
Description=Vicuna worker service
[Service]
User=usergpu
WorkingDirectory=/home/usergpu
ExecStart=python3 -m fastchat.serve.model_worker --model-path lmsys/vicuna-7b-v1.5
Restart=always
[Install]
WantedBy=multi-user.target

sudo nano /etc/systemd/system/vicuna-webserver.service

[Unit]
Description=Vicuna web server
[Service]
User=usergpu
WorkingDirectory=/home/usergpu
ExecStart=python3 -m fastchat.serve.gradio_web_server
Restart=always
[Install]
WantedBy=multi-user.target

Systemd usually updates its daemons database during the system's startup process. However, you can do this manually using the following command:

sudo systemctl daemon-reload

Now, let’s add three new services to the startup and immediately launch them using the --now option:

sudo systemctl enable vicuna-controller.service --now && sudo systemctl enable vicuna-worker.service --now && sudo systemctl enable vicuna-webserver.service --now

However, if you attempt to open a web interface at http://[IP_ADDRESS]:7860, you’ll encounter a completely unusable interface with no available models. To resolve this issue, stop the Web interface service:

sudo systemctl stop vicuna-webserver.service

Execute the web service manually:

python3 -m fastchat.serve.gradio_web_server

Add an authentication

This action calls another script, which will register the previously downloaded model in a Gradio internal database. Wait a few seconds and interrupt the process using the Ctrl + C shortcut. We’ll also take care of security and activate a simple authentication mechanism for accessing the web interface. Open the following file if you installed FastChat from PyPI:

sudo nano /home/usergpu/.local/lib/python3.10/site-packages/fastchat/serve/gradio_web_server.py

sudo nano /home/usergpu/FastChat/fastchat/serve/gradio_web_server.py

Scroll down to the end. Find this line:

auth=auth,

Change it by setting any username or password whichever you want:

auth=(“username”,”password”),

Save the file and exit, using Ctrl + X shortcut. And finally start the web interface:

sudo systemctl start vicuna-webserver.service

Open http://[IP_ADDRESS]:7860 in your browser and enjoy FastChat with Vicuna:

See also:

Your own LLaMa 2 in Linux

Mon, 20 Jan 2025 09:13:25 +0100

Step 1. Prepare operating system

Update cache and packages

Let’s update the package cache and upgrade your operating system before you start setting up LLaMa 2. Please note that for this guide, we are using Ubuntu 22.04 LTS as the operating system:

sudo apt update && sudo apt -y upgrade

Also, we need to add Python Installer Packages (PIP), if it isn’t already present in the system:

sudo apt install python3-pip

Install NVIDIA® drivers

You can use the automated utility that is included in Ubuntu distributions by default:

sudo ubuntu-drivers autoinstall

Alternatively, you can install NVIDIA® drivers manually using our step-by-step guide. Don’t forget to reboot the server:

sudo shutdown -r now

Step 2. Get models from MetaAI

Official request

Open the following address in your browser: https://ai.meta.com/resources/models-and-libraries/llama-downloads/

Fill in all necessary fields, read user agreement and click on the Agree and Continue button. After a few minutes (hours, days), you’ll receive a special download URL, which grants you permission to download models for a 24-hours period.

Clone the repository

Before downloading, please check the available storage:

df -h

Filesystem      Size  Used Avail Use% Mounted on
tmpfs            38G  3.3M   38G   1% /run
/dev/sda2        99G   24G   70G  26% /
tmpfs           189G     0  189G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/nvme0n1    1.8T   26G  1.7T   2% /mnt/fastdisk
tmpfs            38G  8.0K   38G   1% /run/user/1000

If you have unmounted local disks, please follow the instructions in Disk partitioning in Linux. This is important because the downloaded models can be very large, and you need to plan their storage location in advance. In this example, we have a local SSD mounted in the /mnt/fastdisk directory. Let’s open it:

cd /mnt/fastdisk

Create a copy of the original LLaMa repository:

git clone https://github.com/facebookresearch/llama

If you encounter a permission error, simply grant permissions to the usergpu:

sudo chown -R usergpu:usergpu /mnt/fastdisk/

Download via script

Open the downloaded directory:

cd llama

Run the script:

./download.sh

Past the URL provided from MetaAI and select all necessary models. We recommend downloading all available models to avoid requesting permission again. However, if you need a specific model, then download only that one.

Fast test via example app

To begin, we can check for any missing components. If any libraries or applications are missing, the package manager will automatically install them:

pip install -e .

The next step is to add new binaries to PATH:

export PATH=/home/usergpu/.local/bin:$PATH

Run the demo example:

torchrun --nproc_per_node 1 /mnt/fastdisk/llama/example_chat_completion.py --ckpt_dir /mnt/fastdisk/llama-2-7b-chat/ --tokenizer_path /mnt/fastdisk/llama/tokenizer.model --max_seq_len 512 --max_batch_size 6

The application will create a compute process on the first GPU and simulate a simple dialog with typical requests, generating answers using LLaMa 2.

Step 3. Get llama.cpp

LLaMa C++ is a project created by Bulgarian physicist and software developer Georgi Gerganov. It has many useful utilities that make working with this neural network model easier. All parts of llama.cpp are open source software and are distributed under the MIT license.

Clone the repository

Open the working directory on the SSD:

cd /mnt/fastdisk

Clone the project’s repository:

git clone https://github.com/ggerganov/llama.cpp.git

Compile apps

Open the cloned directory:

cd llama.cpp

Start the compilation process with the following command:

make

Step 4. Get text-generation-webui

Clone the repository

Open the working directory on the SSD:

cd /mnt/fastdisk

Clone the project’s repository:

git clone https://github.com/oobabooga/text-generation-webui.git

Install requirements

Open the downloaded directory:

cd text-generation-webui

Check and install all missing components:

pip install -r requirements.txt

Step 5. Convert PTH to GGUF

Common formats

PTH (Python TorcH) — A consolidated format. Essentially, it’s a standard ZIP-archive with a serialized PyTorch state dictionary. However, this format has faster alternatives such as GGML and GGUF.

GGML (Georgi Gerganov’s Machine Learning) — This is a file format created by Georgi Gerganov, the author of llama.cpp. It is based on a library of the same name, written in C++, which has significantly increased the performance of large language models. It has now been replaced with the modern GGUF format.

GGUF (Georgi Gerganov’s Unified Format) — A widely used file format for LLMs, supported by various applications. It offers enhanced flexibility, scalability, and compatibility for most use cases.

llama.cpp convert.py script

Edit the parameters of the model before converting:

nano /mnt/fastdisk/llama-2-7b-chat/params.json

Correct "vocab_size": -1 to "vocab_size": 32000. Save the file and exit. Then, open the llama.cpp directory:

cd /mnt/fastdisk/llama.cpp

Execute the script which will convert model to GGUF format:

python3 convert.py /mnt/fastdisk/llama-2-7b-chat/ --vocab-dir /mnt/fastdisk/llama

If all the previous steps are correct, you’ll receive a message like this:

Wrote /mnt/fastdisk/llama-2-7b-chat/ggml-model-f16.gguf

Step 6. WebUI

How to start WebUI

Open the directory:

cd /mnt/fastdisk/text-generation-webui/

Execute the start script with some useful parameters:

--model-dir indicates the correct path to the models
--share creates a temporary public link (if you don’t want to forward a port through SSH)
--gradio-auth adds authorization with a login and password (replace user:password with your own)

./start_linux.sh --model-dir /mnt/fastdisk/llama-2-7b-chat/ --share --gradio-auth user:password

After successful launch, you’ll receive a local and temporary share link for access:

Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://e9a61c21593a7b251f.gradio.live

This share link expires in 72 hours.

Load the model

Authorize in the WebUI using the selected username and password and follow these 5 simple steps:

Navigate to the Model tab.
Select ggml-model-f16.gguf from the drop-down menu.
Choose how many layers you want to compute on the GPU (n-gpu-layers).
Choose how many threads you want to start (threads).
Click on the Load button.

Start the dialog

Change the tab to Chat, type your prompt, and click Generate:

See also:

Llama 3 using Hugging Face

Mon, 20 Jan 2025 09:05:10 +0100

On April 18, 2024, the newest major language model from MetaAI, Llama 3, was released. Two versions were presented to users: 8B and 70B. The first version contains more than 15K tokens and was trained on data valid until March 2023. The second, larger version was trained on data valid until December 2023.

Step 1. Prepare operating system

Update cache and packages

Let’s update the package cache and upgrade your operating system before you start setting up LLaMa 3. Please note that for this guide, we are using Ubuntu 22.04 LTS as the operating system:

sudo apt update && sudo apt -y upgrade

Also, we need to add Python Installer Packages (PIP), if it isn’t already present in the system:

sudo apt install python3-pip

Install NVIDIA® drivers

You can use the automated utility that is included in Ubuntu distributions by default:

sudo ubuntu-drivers autoinstall

Alternatively, you can install NVIDIA® drivers manually. Don’t forget to reboot the server:

sudo shutdown -r now

Step 2. Get the model

Log in to Hugging Face using your username and password. Go to the page corresponding to the desired LLM version: Meta-Llama-3-8B or Meta-Llama-3-70B. At the time of publication of this article, access to the model is provided on an individual basis. Fill a short form and click the Submit button:

Request access from HF

Then you will receive a message that your request has been submitted:

You will gain access after 30-40 minutes and will be notified about this via email.

Add SSH key to HF

Generate and add an SSH-key that you can use in Hugging Face:

cd ~/.ssh && ssh-keygen

When the keypair is generated, you can display the public key in the terminal emulator:

cat id_rsa.pub

Copy all information starting from ssh-rsa and ending with usergpu@gpuserver as shown in the following screenshot:

Open Hugging Face Profile settings. Then choose SSH and GPG Keys and click on the Add SSH Key button:

Fill in the Key name and paste the copied SSH Public key from the terminal. Save the key by pressing Add key:

cd ~/

Download and run the shell script. This script installs a new third-party repository with git-lfs:

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash

Now, you can install it using the standard package manager:

sudo apt-get install git-lfs

Let’s configure git to use our HF nickname:

git config --global user.name "John"

And linked to the HF email account:

git config --global user.email "john.doe@example.com"

Download the model

Open the target directory:

cd /mnt/fastdisk

And start downloading the repository. For this example we chose 8B version:

git clone git@hf.co:meta-llama/Meta-Llama-3-8B

This process takes up to 5 minutes.You can monitor this by executing the following command in another SSH-console:

watch -n 0.5 df -h

Here, you’ll see how the free disc space on the mounted disc is reduced, ensuring that the download is progressing and the data is being saved. The status will refresh every half-second. To manually stop viewing, press the Ctrl + C shortcut.

Alternatively, you can install btop and monitor the process using this utility:

sudo apt -y install btop && btop

To quit the btop utility, press the Esc key and select Quit.

Step 3. Run the model

Open the directory:

cd /mnt/fastdisk

Download the Llama 3 repository:

git clone https://github.com/meta-llama/llama3

Change the directory:

cd llama3

Run the example:

torchrun --nproc_per_node 1 example_text_completion.py \
--ckpt_dir /mnt/fastdisk/Meta-Llama-3-8B/original \
--tokenizer_path /mnt/fastdisk/Meta-Llama-3-8B/original/tokenizer.model \
--max_seq_len 128 \
--max_batch_size 4

Now you can use Llama 3 in your applications.

See also:

StarCoder: your local coding assistant

Fri, 17 Jan 2025 14:52:58 +0100

Microsoft CoPilot has brought about a revolution in the field of software development. This AI assistant greatly helps developers with various coding tasks, making their lives easier. However, one drawback is that it isn’t a standalone application but rather a cloud-based service. This means that users must agree to the terms and conditions of service and pay for a subscription.

Fortunately, the world of open-source software provides us with numerous alternatives. As of the time of writing this article, the most notable alternative to CoPilot is StarCoder, developed by the BigCode project. StarCoder is an extensive neural network model with 15.5B parameters, trained on over 80 programming languages.

This model is distributed on Hugging Face (HF) using a gated model under the BigCode OpenRAIL-M v1 license agreement. You can download and use this model for free, but you need to have an HF account with a linked SSH key. Before you can download, there are a few additional steps you need to take.

Add SSH key to HF

Before starting, you need to set up port forwarding (remote port 7860 to 127.0.0.1:7860) in your SSH-client. You can find additional information in the following articles:

Update the package cache repository and installed packages:

sudo apt update && sudo apt -y upgrade

Let’s install Python’s system package manager (PIP):

sudo apt install python3-pip

Generate and add an SSH-key that you can use in Hugging Face:

cd ~/.ssh && ssh-keygen

When the keypair is generated, you can display the public key in the terminal emulator:

cat id_rsa.pub

Copy all information starting from ssh-rsa and ending with usergpu@gpuserver as shown in the following screenshot:

Fill in the Key name and paste the copied SSH Public key from the terminal. Save the key by pressing Add key:

cd ~/

Download and run the shell script. This script installs a new third-party repository with git-lfs:

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash

Now, you can install it using the standard package manager:

sudo apt-get install git-lfs

Let’s configure git to use our HF nickname:

git config --global user.name "John"

And linked to the HF email account:

git config --global user.email "john.doe@example.com"

Download the model

Please note that StarCoder in binary format may take up a significant amount of disk space (>75 GB). Don’t forget to refer to this article to ensure you’re using the correct mounted partition.

Everything is ready for the model download. Open the target directory:

cd /mnt/fastdisk

And start downloading the repository:

git clone git@hf.co:bigcode/starcoder

This process takes up to 15 minutes. Please be patient. You can monitor this by executing the following command in another SSH-console:

watch -n 0.5 df -h

Run the full model with WebUI

Clone the project’s repository:

git clone https://github.com/oobabooga/text-generation-webui.git

Open the downloaded directory:

cd text-generation-webui

Execute the start script:

./start_linux.sh --model-dir /mnt/fastdisk

The script will check for the presence of the necessary dependencies on the server. Any missing dependencies will be installed automatically. When the application starts, open your web browser and type the following address:

http://127.0.0.1:7860

Open the Model tab and select the downloaded model starcoder from the drop-down list. Click on the Model loader list and choose Transformers. Set the maximum GPU-memory slider for each installed GPU. This is very important, as setting it to 0 restricts the use of VRAM and prevents the model from loading correctly. You also need to set the maximum RAM usage. Now, click the Load button and wait for the loading process to complete:

Switch to the Chat tab and test the conversation with the model. Please note that Starcoder isn’t intended for dialogue like ChatGPT. However, it can be useful for checking code for errors and suggesting solutions.

If you want to get a full-fledged dialogue model, you could try two other models: starchat-alpha and starchat-beta. These models were fine-tuned to conduct a dialogue just like ChatGPT does. The following commands helps to download and run these models:

For starchat-alpha:

git clone git@hf.co:HuggingFaceH4/starchat-alpha

For starchat-beta:

git clone git@hf.co:HuggingFaceH4/starchat-beta

The loading procedure is the same as described above. Also, you can find C++ implementation of starcoder, which will be effective for CPU inference.

See also:

Stable Diffusion Models: customization and options

Mon, 25 Nov 2024 13:30:16 +0100

Tuning is an excellent way to enhance every car or gadget. Generative neural networks can be tuned as well. Today, we don't want to delve deeply into the structure of Stable Diffusion, but we aim to achieve better results than a standard setup.

There are two easy ways to do this: installing custom models and utilizing standard optimization options. In this article, we’ll learn how to install new models into Stable Diffusion and which options allow us to use hardware more effectively.

If you want to share funny pictures of cute cats or great looking food, you usually post them on Instagram. If you develop applications and want to make the code available to everyone, you post it on GitHub. But if you train a graphical AI-model and want to share it, you should pay attention to CivitAI. This is a huge platform to share knowledge and results with community members.

Before you start downloading, you need to change the working directory. All AI models in Stable Diffusion are placed in the "models" directory:Before you start downloading, you need to change the working directory. All AI models in Stable Diffusion are placed in the "models" directory:

cd stable-diffusion-webui/models/Stable-diffusion

Let's check which models are provided by default:

ls -a

'Put Stable Diffusion checkpoints here.txt'
v1-5-pruned-emaonly.safetensors

There is only one model with the name “v1-5-pruned-emaonly” and the extension “safetensors”. This model is a good starting point, but we have five more interesting models. Let’s download and compare them with the standard model.

Stable diffusion prompts

To visually show the difference between them, we came up with simple prompts:

princess, magic, fairy tales, portrait, 85mm, colorful

For many models, accurately representing geometry and facial features can be a significant challenge. To address this, add negative prompts to ensure images are generated without these characteristics:

poorly rendered face, poorly drawn face, poor facial details, poorly drawn hands, poorly rendered hands, low resolution, bad composition, mutated body parts, blurry image, disfigured, oversaturated, bad anatomy, deformed body features

Set the maximum value of sampling steps (150) to get more details in the result.

Standard model

The standard model performs well in such tasks. However, some details are not quite accurate. For example, there is a problem with the eyes: they are clearly out of proportion:

If you look at the diadem, it is also crooked and asymmetrical. The rest of the details are well-executed and correspond to the given prompts. The background is blurry because we set the prompt “85mm”. This is a very commonly used focal length for portraits in professional photography.

Realistic Vision

This model is great for portraits. The image appears as if taken with a quality lens with the specified focal length. The proportions of the face and body are accurate, the dress fits perfectly, and the diadem on the head looks aesthetically pleasing:

By the way, the author recommends using the following template for negative prompts:

deformed iris, deformed pupils, semi-realistic, cgi, 3d, render, sketch, cartoon, drawing, anime:1.4), text, close up, cropped, out of frame, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck

But even with our quite simple prompts, the result is excellent.

Download the model here: Realistic Vision

Deliberate

Another amazing model for such purposes. The details are also well worked out here, but be careful and monitor the number of fingers. This is a very common problem with neural networks: they can often draw extra fingers or even entire limbs.

Creating visual lines is one of favorite movie techniques. So, this model also chose to draw a person against the background of a forest path.

Download the model here: Deliberate

OpenJourney

Among generative neural networks, Midjourney (MJ) has received special attention. MJ was a pioneer in this field and is often held up as an example to others. The images it creates have a unique style. OpenJourney is inspired by the MJ style and is a suitably tuned Stable Diffusion.

Generated images that look like a cartoon. They are vibrant and bright. For better results, add the mdjrny-v4 style prompt

Download the model here: OpenJourney

Anything

This model creates images akin to a professional manga artist (a person who draws comics). Thus, we got an anime-style princess.

This model is trained on images with a resolution of 768x768. You may set this resolution to get better results than standard 512x512.

Download the model here: Anything

Corporate Memphis

This style of images gained wild popularity in the early 2020s and was widely used as a corporate style in different high-tech companies. Despite criticism, it is often found in presentations and websites.

The princess turned out to be minimalistic, but quite pretty. Particularly amusing were the details that the model placed on the background.

Download the model here: Corporate Memphis

Stable Diffusion Options

Stable Diffusion consumes a lot of resources, so many options have been developed for it. The most popular of them is --xformers. This option enables two optimization mechanisms. The first reduces memory consumption and the second is used to increase speed.

If you try to add --xformers without additional steps, you will get an error saying that the packages (torch and torchvision) are compiled for different versions of CUDA®. To fix this, we need to enter the Python virtual environment (venv) which is used for Stable Diffusion. After that, install the packages for the desired version of CUDA® (v1.18).

First we must update apt packages cache and install package installer for Python (pip). Next step is to activate Python venv with the activate script:

source stable-diffusion-webui/venv/bin/activate

After that, the command line prompt changes to (venv) username@hostname:~$ Let’s install the packages torch and torchvision with CUDA® 11.8:

pip install torch==2.0.0+cu118 torchvision==0.15.1+cu118 --index-url https://download.pytorch.org/whl/cu118

This process may take several minutes because the packages are quite large. You'll have just enough time to pour yourself some coffee. Finally, you can deactivate virtual environment and start Stable Diffusion with the --xformers option (replace [user] and [password] with your own values):

deactivate

./webui --xformers --listen --gradio-auth [user]:[password]

The faster alternative for --xformers is --opt-sdp-no-mem-attention. It consumes more memory but works a bit faster. You can use this option without additional steps.

Conclusion

Today, we examined the capabilities of Stable Diffusion when combined with other models added and optimization options. Remember, by increasing or decreasing the number of sampling steps, you can adjust the level of detail in the final image.

Of course, this is only a small part of what you can do with such a generative neural network. So order a GPU-server right now and start experimenting. Many more discoveries and opportunities await you. High-speed and powerful video cards will help you save time and generate cool images.

See also:

Stable Diffusion WebUI

Mon, 25 Nov 2024 13:24:45 +0100

Generative neural networks seem magical. They answer questions, create images, and even write code in various programming languages. The success of these networks has two components: pre-trained models and hardware accelerators. Certainly, it's possible to use CPU cores for this workload, but it would be like a snail race. Generating one small picture can take a significant amount of time - tens of minutes. Generating the same picture on a GPU would take hundreds of times less.

The first secret lies in the number of cores. CPU cores are universal and can handle complex instructions. However, conventional server processors have a maximum of 64 cores. Even in multiprocessor systems, the number of cores rarely exceeds 256. GPU cores are simpler but as a result, many more of them fit on the chip. For example, one NVIDIA® RTX™ 4090 has 16,384 cores.

The second secret is that the workload can be divided into many simple tasks, which can be run in parallel threads on dedicated GPU cores. This trick significantly speeds up data processing. Today, we will see how it works and deploy a generative neural network Stable Diffusion Web UI on the LeaderGPU infrastructure. Take, for example, a server with an NVIDIA® RTX™ 4090 which has 16,384 GPU cores. As an operating system, we selected the current LTS-release Ubuntu 22.04 and chose the “Install NVIDIA® drivers and CUDA® 11.8” option.

System prepare

Before we start, let's consider memory. Stable Diffusion is a large system which can occupy up to 13G on your hard disk. The standard virtual disk in a LeaderGPU installation is 100G. The operating system takes up 25G. If we deploy Stable Diffusion without extending the home partition, we’ll exhaust all free memory and encounter a "No space left on device" error. It's a good idea to extend our home directory.

Extend home directory

First, we need to check all available disks.

sudo fdisk -l

Disk /dev/sda: 447.13 GiB, 480103981056 bytes, 937703088 sectors
Disk model: INTEL SSDSC2KB48
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Disk /dev/sdb: 50 GiB, 53687091200 bytes, 104857600 sectors
Disk model: VIRTUAL-DISK
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 9D4C1F0C-D4A7-406E-AECB-BF57E4726437

Then we need to create a new Linux partition on our physical SSD-drive, /dev/sda:

sudo fdisk /dev/sda

Press the following keys, one by one: g → n → Enter → Enter → Enter → w. This will result in a new /dev/sda1 partition without a filesystem. Now, create an ext4 filesystem on it:

sudo mkfs.ext4 /dev/sda1

When the process is finished, we move to the next step.

Warning! Please proceed with the following operation with great care. Any mistake made while modifying the fstab file can result in your server being unable to boot normally and may require a complete reset of the operating system.

sudo blkid

/dev/sdb2: UUID="6b17e542-0934-4dba-99ca-a00bd260c247" BLOCK_SIZE="4096" TYPE="ext4" PARTUUID="70030755-75d8-4339-a4e0-26a97f1d1c5d"
/dev/loop1: TYPE="squashfs"
/dev/sdb1: PARTUUID="63ff1714-bd29-4062-be04-21af32423c0a"
/dev/loop4: TYPE="squashfs"
/dev/loop0: TYPE="squashfs"
/dev/sda1: UUID="fb2ba455-2b8d-4da0-8719-ce327d0026bc" BLOCK_SIZE="4096" TYPE="ext4" PARTUUID="6e0108df-b000-5848-8328-b187daf37a4f"
/dev/loop5: TYPE="squashfs"
/dev/loop3: TYPE="squashfs"

Copy UUID (fb2ba455-2b8d-4da0-8719-ce327d0026bc in example) of the /dev/sda1 partition. Next, we will instruct the system to automatically mount this drive by its UUID at boot time:

sudo nano /etc/fstab

Enter this line before /swap.img… string:

/dev/disk/by-uuid/ /home/usergpu ext4 defaults defaults

Example:

# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
#                
# / was on /dev/sdb2 during curtin installation
/dev/disk/by-uuid/6b17e542-0934-4dba-99ca-a00bd260c247 / ext4 defaults,_netdev 0 1
/dev/disk/by-uuid/fb2ba455-2b8d-4da0-8719-ce327d0026bc /home/usergpu ext4 defaults defaults
/swap.img       none    swap    sw      0       0

Exit with the Ctrl + X keyboard shortcut and confirm the file save by pressing Enter. The new settings will be applied at the next system start. Let’s reboot the server:

sudo shutdown -r now

After rebooting, we can check all mounted directories with the following command:

df -h

Filesystem      Size  Used Avail Use% Mounted on
tmpfs           6.3G  1.7M  6.3G   1% /run
/dev/sdb2        49G   23G   24G  50% /
tmpfs            32G     0   32G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda1       440G   28K  417G   1% /home/usergpu
tmpfs           6.3G  4.0K  6.3G   1% /run/user/1000

Superb! But now we don't have access to write something in our home directory because it was changed by the fstab configuration file. It’s time to reclaim ownership of the directory:

sudo chown -R usergpu /home/usergpu

Good job! Let’s move to the next step.

Install basic packages

Update the software cache from the official Ubuntu repositories and upgrade some packages:

sudo apt update && sudo apt -y upgrade

The system has informed that a new kernel was installed and it will be operational after system reboot. Select OK twice.

Next, we need to resolve dependencies, which require Stable Diffusion. The first package adds Python virtual environment functionality:

sudo apt install python3-venv

The second package adds an implementation of the C programming language’s malloc() function customized by Google. It prevents “Cannot locate TCMalloc” error and improves CPU memory usage.

sudo apt install -y --no-install-recommends google-perftools

Finally, reboot the server again:

sudo shutdown -r now

Stable diffusion automatic 1111: install script

The easiest way to install Stable Diffusion with WebUI is by using the premade script written by GitHub user AUTOMATIC1111. This script downloads and sets up these two parts while resolving all necessary dependencies.

Let’s download the script:

wget https://raw.githubusercontent.com/AUTOMATIC1111/stable-diffusion-webui/master/webui.sh

Then, give it access to change data and execute as a program:

chmod a+x webui.sh

Execute the downloaded script:

./webui.sh

This process may take a couple of minutes. Everything is ready to create perfect images with Stable Diffusion.

Troubleshooting

If you encounter the error “Torch is not able to use GPU”, you can fix it by reinstalling via apt:

sudo apt -y install nvidia-driver-535

You need to reboot the operating system to enable the driver:

sudo shutdown -r now

Generate

The installation script ./webui.sh has another function. It simultaneously serves the server part of Stable Diffusion and WebUI. However, if you use it without arguments, the server will be available as a local daemon at http://127.0.0.1:7860. This can be solved in two ways: port forwarding through an SSH-tunnel or allowing connections from external IPs.

The second way is simpler: just add the --listen option and you can connect to the web interface at http://[YOUR_LEADERGPU_SERVER_IP_ADDRESS]:7860. However, this is completely insecure, as every internet user will have access. To prevent unauthorized usage, add --gradio-auth option alongside the username and password, separated by colon:

./webui.sh --listen --gradio-auth user:password

This adds a login page to your WebUI instance.The script will download basic models and required dependencies for the first time:

You can enjoy the result. Just enter a few prompts, separate them by commas, and click the Generate button. After a few seconds, an image generated by the neural network will be displayed.

Conclusion

We've come all the way from an empty LeaderGPU server with just a pre-installed operating system to a ready instance with Stable Diffusion and a WebUI interface. Next time, we’ll learn more about software performance tuning and how to properly enhance your Stable Diffusion instance with new versions of drivers and packages.

See also: