A PyTorch implementation of conditional Denoising Diffusion Probabilistic Models (DDPM) for generating i-CLEVR dataset images based on object specifications.
- Conditional Generation: Generate images based on object specifications (shapes, colors)
- Advanced UNet Architecture: Custom UNet with attention mechanisms and residual blocks
- Flexible Training: Support for different beta schedules, loss functions, and optimization strategies
- Comprehensive Evaluation: Built-in metrics for assessing generation quality
- Memory Efficient: Gradient accumulation and memory optimization options
- Experiment Tracking: Integration with Weights & Biases for monitoring training progress
- UNet with Attention: Custom UNet architecture with self-attention layers for improved long-range dependencies
- Conditional DDPM: Diffusion model that supports conditioning on multiple object labels
- Multi-Object Support: Handle images with 1-3 objects using multi-hot encoding
- Classifier Guidance: Optional classifier-guided sampling for improved label accuracy
- Residual Blocks: Enhanced with time embedding and up/down sampling
- Self-Attention: Captures long-range spatial dependencies
- Time Embedding: Sinusoidal embedding for diffusion timesteps
- Class Embedding: Multi-hot encoding for conditional generation
- Python 3.8+
- PyTorch 2.0+
- CUDA (optional, for GPU acceleration)
# Clone the repository
git clone <repository-url>
cd diffusion-model
# Install dependencies
pip install -r requirements.txt
# Prepare your data directory structure:
# data/
# βββ iclevr/
# β βββ train.json
# β βββ test.json
# β βββ new_test.json
# β βββ objects.json
# β βββ images/# Basic training
python -m src.main --mode train --run_name my_experiment
# Training with custom parameters
python -m src.main \
--mode train \
--run_name ddpm_experiment \
--data_dir ./data/iclevr \
--batch_size 32 \
--epochs 200 \
--lr 2e-4 \
--timesteps 400 \
--beta_schedule cosine \
--use_attention \
--use_wandb# Generate samples using trained model
python -m src.main \
--mode sample \
--model_ckpt ./results/my_experiment/checkpoints/best_model.pth \
--data_dir ./data/iclevr
# Run complete evaluation pipeline
python -m src.main \
--mode inference \
--model_ckpt ./results/my_experiment/checkpoints/best_model.pth \
--evaluator_ckpt ./pretrained/evaluator.pth \
--data_dir ./data/iclevrdata/iclevr/
βββ train.json # Training labels: {"filename": ["object1", "object2", ...]}
βββ test.json # Test conditions: [["object1"], ["object1", "object2"], ...]
βββ new_test.json # Additional test conditions
βββ objects.json # Object definitions: {"object_name": index}
βββ images/ # Training images directory
- Shapes: cube, sphere, cylinder
- Colors: red, green, blue, yellow, cyan, purple, brown, gray
- Total Objects: 24 (3 shapes Γ 8 colors)
- Objects per Image: 1-3 objects
| Parameter | Default | Description |
|---|---|---|
--batch_size |
64 | Training batch size |
--epochs |
100 | Number of training epochs |
--lr |
1e-4 | Learning rate |
--timesteps |
400 | Number of diffusion timesteps |
--sampling_timesteps |
50 | Sampling steps (DDIM acceleration) |
--beta_schedule |
cosine | Beta schedule ('linear' or 'cosine') |
--use_attention |
False | Enable attention layers |
| Parameter | Default | Description |
|---|---|---|
--base_channels |
128 | Base number of UNet channels |
--num_res_blocks |
2 | Residual blocks per level |
--channel_multipliers |
[1,2,4,8] | Channel scaling factors |
--attention_resolutions |
[16,8] | Resolutions for attention |
| Parameter | Default | Description |
|---|---|---|
--gradient_accumulation_steps |
1 | Gradient accumulation |
--reduce_memory |
False | Enable memory optimization |
--grad_clip |
1.0 | Gradient clipping threshold |
- Training logs:
results/{run_name}/logs/training.log - Checkpoints:
results/{run_name}/checkpoints/ - Generated samples:
results/{run_name}/samples/ - Training curves:
results/{run_name}/training_curves.png
# Enable W&B logging
python -m src.main \
--mode train \
--use_wandb \
--wandb_project "diffusion-iclevr" \
--wandb_name "experiment-1"For comprehensive experimental analysis, preprocessing optimization results, model architecture comparisons, and detailed technical insights, see TECHNICAL_REPORT.md.
- Batch Size: Use largest batch size that fits in memory (32-128)
- Learning Rate: Start with 2e-4, adjust based on convergence
- Beta Schedule: Cosine generally works better than linear
- Attention: Enable for better quality but slower training
- Gradient Accumulation: Use when memory is limited
# For limited GPU memory
python -m src.main \
--batch_size 16 \
--gradient_accumulation_steps 4 \
--reduce_memory \
--num_workers 2# High quality (slower)
--timesteps 1000 --sampling_timesteps 250 --use_attention
# Balanced (recommended)
--timesteps 400 --sampling_timesteps 50 --use_attention
# Fast (lower quality)
--timesteps 200 --sampling_timesteps 25diffusion-model/
βββ src/
β βββ models/
β β βββ __init__.py
β β βββ ddpm.py # DDPM implementation
β β βββ unet.py # UNet architecture
β β βββ layers.py # Neural network layers
β βββ training/
β β βββ __init__.py
β β βββ trainer.py # Main training loop
β β βββ config.py # Configuration management
β β βββ dataset.py # Data loading utilities
β β βββ utils.py # Training utilities
β βββ evaluation/
β β βββ __init__.py
β β βββ evaluator.py # Model evaluation
β β βββ metrics.py # Evaluation metrics
β βββ evaluator.py # Pre-trained ResNet18 evaluator
β βββ preprocess_images.py # Image preprocessing utilities
β βββ main.py # Entry point
βββ demo/
β βββ train.sh # Training demo script
β βββ inference.sh # Inference demo script
β βββ quickstart.py # Python demo
βββ data/ # Dataset directory
βββ results/ # Training outputs
βββ requirements.txt # Dependencies
βββ TECHNICAL_REPORT.md # Detailed technical implementation report
βββ README.md # This file
βββ .gitignore # Git ignore rules
-
CUDA Out of Memory
# Reduce batch size and enable gradient accumulation --batch_size 16 --gradient_accumulation_steps 4 -
Slow Training
# Disable attention or reduce model size --base_channels 64 --num_res_blocks 1 -
Poor Generation Quality
# Increase model capacity and training time --base_channels 256 --epochs 300 --use_attention
- Monitor training loss for convergence
- Check generated samples during training
- Use W&B for real-time monitoring
- Validate with evaluation metrics
- Denoising Diffusion Probabilistic Models
- Classifier-Free Diffusion Guidance
- Attention Is All You Need
This project is licensed under the MIT License - see the LICENSE file for details.
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
For questions and support, please open an issue on the repository.