Immich GPU Acceleration: Empowering the Home AI Photo Library with Integrated Graphics

3 minutes read

Published:

Engraving memories with computing power.

Requirements

Previous post has already outlined the rough architecture of my NAS. The photo library is mainly managed with Synology’s built-in Synology Photos, which is a fairly established and reliable solution. However, its AI fuzzy search is almost nonexistent, so I still need iOS Photos / Google Photos as a supplement. Some NAS systems like FeiNiu OS may offer such newer features, but for Synology you can also find a fully local open-source alternative, namely Immich. Immich’s leading AI feature is primarily text-to-image search. Modern cross-modal retrieval is basically implemented with CLIP-like models, which align vision and text in the same space. You just need to pre-embed all images in the album into vectors, and at search time embed the query text into a vector as well, then perform cosine-similarity ranking; the closer the distance, the better the image matches the text description. Of course, the same model can also be used for image-to-image search or duplicate detection—simply check whether image vectors are overly similar. Other AI features like OCR and face recognition can also be implemented, but those models don’t consume as many resources as CLIP, so they are not the focus here.

How it works

After deploying Immich, the default model is OpenAI’s original CLIP, whose multilingual support (for example, Chinese) is not great. Most Chinese tutorials recommend switching to one of two multilingual models: XLM-Roberta-Large-Vit-B-16Plus or nllb-clip-large-siglip__v1. The latter is newer, larger, and scores higher in benchmarks, but due to limited memory I chose the former.

Another issue is that, except for PP-OCRv5_mobile (a model from Baidu), the other CLIP and face-recognition models all need to be downloaded from HuggingFace. Within mainland China you have to use HF-Mirror, or directly download from cloud-drive links provided in those Chinese tutorials (with no guarantees on security), and then mount them under the /cache directory of the Immich-Machine-Learning container. My directory structure is as follows; if everything is set up correctly, you should now be able to run CPU inference:

❯ tree -L 3
.
├── clip
│   ├── XLM-Roberta-Large-Vit-B-16Plus
│   │   ├── README.md
│   │   ├── config.json
│   │   ├── textual
│   │   └── visual
│   └── nllb-clip-large-siglip__v1
│       ├── README.md
│       ├── config.json
│       ├── textual
│       └── visual
├── facial-recognition
│   └── buffalo_l
│       ├── README.md
│       ├── detection
│       └── recognition
└── ocr
    └── PP-OCRv5_mobile
        ├── detection
        └── recognition

GPU Acceleration

The next natural step is to accelerate inference with a GPU. Besides CPU, Immich also provides CUDA, ROCm, and OpenVINO options. I’m using an i5-13500H processor, which allows me to use OpenVINO to accelerate on the 80EU Xe iGPU. It’s not exactly a powerful AI card, but in theory it should still beat the CPU and offer better performance per watt.

Previously, for Jellyfin hardware transcoding, I had already passed through the host’s iGPU into the LXC that runs Docker on PVE, so I only needed to pass it through again inside Docker:

services:
  immich-machine-learning:
    container_name: immich_machine_learning
    # For hardware acceleration, add one of -[armnn, cuda, rocm, openvino, rknn] to the image tag.
    # Example tag: ${IMMICH_VERSION:-release}-cuda
    # 这里的镜像链接使用了非官方镜像站以便国内拉取
    image: ghcr.precu.re/immich-app/immich-machine-learning:${IMMICH_VERSION:-release}-openvino
    device_cgroup_rules:
      - 'c 189:* rmw'
    devices:
      - /dev/dri:/dev/dri
    volumes:
      - /dev/bus/usb:/dev/bus/usb

If everything goes smoothly, when Immich’s smart-search indexing task is running, you should see load on the GPU:

screenshot_of_intel_gpu_top.png

This indicates that the current ML workload is being handled on the GPU. For my roughly 300K photos, the total time to embed everything into vectors appears to be less than one night, which is quite acceptable in terms of efficiency.