Not long after my recent experience training LoRAs using kohya_ss scripts for Stable Diffusion, I noticed that a new version was released, claiming, “The issues in multi-GPU training are fixed.” This statement piqued my interest in giving multi-GPU training a shot to see what challenges I might encounter and to determine what performance benefits could be realized.
To train Stable Diffusion effectively, I prefer using kohya-ss/sd-scripts, a collection of scripts designed to streamline the training process. These scripts support a variety of training methods, including native fine-tuning, Dreambooth, and LoRA. The bmaltais/kohya_ss implementation adds a Gradio GUI to the scripts, which I find incredibly helpful for navigating the myriad training options, providing a more user-friendly alternative to the manual process of discovering, choosing, and inputting training arguments.
With these tools at my disposal, I aimed to investigate whether utilizing multiple GPUs is now a viable option for training. And if it works, how much time could it save compared to a single GPU setup?
For the most part, I used the same hardware and software configuration that was employed in our previous LoRA training analyses. However, I updated the kohya_ss UI to v22.4.1 and PyTorch to 2.1.2. For the GPU configuration, I used two NVIDIA GeForce RTX 4090 Founder’s Edition cards. The dataset comprised thirteen 1024×1024 photos, configured for 40 repeats apiece, resulting in a total of 520 steps in a training run. Consistent with our earlier LoRA testing results, I employed SDPA cross-attention in all tests.
Additionally, based on the release notes for scripts version 22.4.0, two new arguments are recommended for multi-GPU training:
--ddp_gradient_as_bucket_view
--ddp_bucket_view
Both options were suggested for use in sdxl_train.py, but I found that --ddp_bucket_view
was not recognized as a valid argument and doesn't appear anywhere in the code. This left me uncertain about the accuracy of that statement. Moreover, these arguments were actually added to train_util.py, with a comment indicating they should eventually be moved to SDXL training, as they are not supported by SD1/2. Consequently, I opted to include only the --ddp_gradient_as_bucket_view
argument in my training setup.
It’s important to note that all training results provided were obtained with full bf16 training enabled, as it was essential for completing Dreambooth and Fine-tuning. Without it, the training would run out of memory, hindering any progress.
According to kohya-ss, “In multiple GPU training, the number of images multiplied by GPU count is trained in a single step. Therefore, it is recommended to use --max_train_epochs
for training the same amount as with single GPU training.”
This statement implies that using the same configuration as we would for single GPU training results in twice as many epochs than we have configured. For instance, if we expect a single epoch of 1000 steps with a single GPU, with two GPUs, we would get two epochs of 500 steps each. This effectively doubles the total workload, as each GPU processes an image during each step. Thus, the first scenario can be summarized as:
Conversely, the second scenario would be:
To mitigate this issue, one straightforward solution is to set max_train_epochs
to 1, as suggested by kohya-ss. Referring back to the previous example, this adjustment results in a single epoch with 500 steps. Since each step consists of one image trained per GPU, we can equate these 500 steps to the 1000 steps of the single GPU training run.
However, being able to train over multiple epochs and compare the results against each other is incredibly helpful for fine-tuning the output. Therefore, to use multiple epochs without inflating the step count, the training data could instead be prepared with half as many steps per training image, leading to the same number of total steps as a dataset prepared for training with a single GPU.
I also attempted to train using distributed training optimizations like DeepSpeed and FSDP. However, I could not complete any training runs despite various configurations I tested. This suggests that there may be additional performance optimizations available that could be leveraged if these options are configured correctly.
To kick off my testing, I focused on how a second GPU impacts performance while training a LoRA with 128 dimensions. I have two charts to illustrate this: one for iterations per second and another for total training time. Alongside the single-GPU results, the training time chart includes two multi-GPU results, demonstrating the difference in total training time between runs where maximum epochs were uncapped and those limited to one.
At first glance, it appears there is a performance decrease when we solely consider the raw iterations per second or the duration of a training run without capping epochs. However, once we set the maximum epochs to one, we can observe the performance benefits of training with two GPUs. Although each iteration takes roughly 23% longer, we ultimately complete the training run 36% faster than with a single GPU due to processing two training images simultaneously.
When evaluating the Dreambooth performance, the results tell a different story. Contrary to the LoRA results, where a moderate performance drop was evident with the introduction of another GPU, the Dreambooth results revealed a stark decrease in performance. The distributed training ran at only about one-third of the speed of a single card. Consequently, even when limiting the multi-GPU training run to a single epoch, it still fell short of outperforming a single GPU.
The finetuning results echoed what I discovered with Dreambooth. The significant reduction in iterations per second meant that even with the simultaneous completion of training steps, the training time did not decrease compared to the single GPU training run.
While it is possible to achieve performance benefits when training Stable Diffusion with kohya's scripts and multiple GPUs, the process is not as straightforward as merely adding a second GPU and launching a training run. Beyond configuring Accelerate to utilize multiple GPUs, we must also factor in the multiplication of epochs. This can be managed by either capping the maximum epochs to 1 or preparing our dataset with fewer repeats per image.
Furthermore, I only realized performance benefits in LoRA training; both Dreambooth and Finetuning exhibited significantly reduced performance. At this point, I remain uncertain if these results stem from a lack of multi-GPU optimizations, such as DeepSpeed, in my configuration or if they result from issues inherent in the scripts themselves.
If any readers have successfully utilized DeepSpeed or other distributed training optimizations with kohya's scripts, I would love to hear from you! Please share your insights or experiences in the comments. Your feedback could greatly aid others navigating similar challenges in multi-GPU training setups.
Read More Related Topics:
Benchmarking with TensorRT-LLM
AMD Zen4 Threadripper PRO vs Intel Xeon W-9: A Performance Comparison for Science and Engineering
Best Graphic Designing Workstations in 2024
Share this: