Fix `SpatialTransformer` #578

ydshieh · 2022-09-19T19:32:15Z

hidden_states = hidden_states.permute(0, 2, 3, 1).reshape(batch, height * weight, channel)

should be

hidden_states = hidden_states.permute(0, 2, 3, 1).reshape(batch, height * weight, -1)

where -1 is inner_dim instead of the channels in the initial hidden_states, as it is already projected by

hidden_states = self.proj_in(hidden_states)

ydshieh · 2022-09-19T19:33:12Z

src/diffusers/models/attention.py

@@ -144,10 +144,10 @@ def forward(self, hidden_states, context=None):
        residual = hidden_states
        hidden_states = self.norm(hidden_states)
        hidden_states = self.proj_in(hidden_states)
-        hidden_states = hidden_states.permute(0, 2, 3, 1).reshape(batch, height * weight, channel)
+        hidden_states = hidden_states.permute(0, 2, 3, 1).reshape(batch, height * weight, -1)


self.proj_in(hidden_states) changes the number of channels.

HuggingFaceDocBuilderDev · 2022-09-19T19:36:06Z

The documentation is not available anymore as the PR was closed or merged.

patil-suraj

Thanks for the PR. As can be seen below d_head is computed as in_channels // n_head
so inner_dim = n_head * d_head = in_channels.

diffusers/src/diffusers/models/unet_blocks.py

Lines 507 to 510 in c01ec2d

    
           SpatialTransformer( 
        
               out_channels, 
        
               attn_num_head_channels, 
        
               out_channels // attn_num_head_channels,

So it's fine to leave it as is because specifieng size as variable is much readable than -1.

Maybe we could make this more clear that inner_dim = n_head * d_head = in_channels

ydshieh · 2022-09-20T15:26:10Z

Hi @patil-suraj:

We can force inner_dim = n_head * d_head = in_channels - if this is the only case that will be used. We just to make this clear.

Otherwise, let's fix it :-)

I do agree that specifieng size as variable is much readable than -1. In this case, we can do

        hidden_states = self.proj_in(hidden_states)
        inner_dim = hidden_states.shape[1]
        hidden_states = hidden_states.permute(0, 2, 3, 1).reshape(batch, height * weight, inner_dim)

ydshieh · 2022-09-20T15:33:09Z

I made the change to make the shape more clear.

patrickvonplaten · 2022-09-22T15:06:20Z

Hmmm - I'm also not too sure about his here @ydshieh, are we fixing a bug here? If the current code is not buggy, it's the better more readable option IMO. +1 on what @patil-suraj said

ydshieh · 2022-09-22T16:58:28Z

Hi @patrickvonplaten It is indeed buggy as long as SpatialTransformer get in_channels different from n_heads * d_head, see the To reproduce section below.

In terms of readability, the latest change should be fine.

    inner_dim = hidden_states.shape[1]
    hidden_states = hidden_states.permute(0, 2, 3, 1).reshape(batch, height * weight, inner_dim)

Leaving the current code as it is makes the code somehow confusing, and also not good for testing purpose.

To reproduce

import numpy as np
import torch

from diffusers.models.attention import SpatialTransformer

N, H, W, C = (1, 16, 16, 6)
heads = 2
dim_head = 8
context_dim = 4
context_seq_len = 3

sample = np.random.default_rng().standard_normal(size=(N, C, H, W), dtype=np.float32)
context = np.random.default_rng().standard_normal(size=(N, context_seq_len, context_dim), dtype=np.float32)

pt_sample = torch.tensor(sample, dtype=torch.float32)

pt_context = torch.tensor(context, dtype=torch.float32)
tf_context = tf.constant(context)

pt_layer = SpatialTransformer(in_channels=C, context_dim=context_dim, n_heads=heads, d_head=dim_head, num_groups=3)

with torch.no_grad():
    pt_output = pt_layer(pt_sample, context=pt_context)

Error

    hidden_states = hidden_states.permute(0, 2, 3, 1).reshape(batch, height * weight, channel)
RuntimeError: shape '[1, 256, 6]' is invalid for input of size 4096

patrickvonplaten · 2022-09-27T09:08:12Z

Hey @ydshieh,

Sorry I think in_channels is defined by n_heads * d_head, so I don't really think this is an issue 😅

ydshieh · 2022-09-27T11:44:51Z

Hi @patrickvonplaten

The usage seems to be the case, but this is not mentioned in the __init__ method, where in_channels has nothing to do with n_heads * d_head, and we have inner_dim = n_heads * d_head.

However, feel free to close this PR if you and @patil-suraj think the change is not really necessary :-).

patrickvonplaten

Upon second reflection this is actually quite clean!

* Fix SpatialTransformer * Fix SpatialTransformer Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

Fix SpatialTransformer

1fbb034

ydshieh marked this pull request as ready for review September 19, 2022 19:32

ydshieh commented Sep 19, 2022

View reviewed changes

ydshieh requested review from patil-suraj and patrickvonplaten September 19, 2022 19:33

patil-suraj reviewed Sep 20, 2022

View reviewed changes

Fix SpatialTransformer

9a69d67

patrickvonplaten approved these changes Sep 27, 2022

View reviewed changes

patrickvonplaten merged commit d886e49 into main Sep 27, 2022

patil-suraj deleted the fix_SpatialTransformer branch September 28, 2022 13:14

prathikr pushed a commit to prathikr/diffusers that referenced this pull request Oct 26, 2022

Fix SpatialTransformer (huggingface#578)

ee622c0

* Fix SpatialTransformer * Fix SpatialTransformer Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

yoonseokjin pushed a commit to yoonseokjin/diffusers that referenced this pull request Dec 25, 2023

Fix SpatialTransformer (huggingface#578)

3e6214d

* Fix SpatialTransformer * Fix SpatialTransformer Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix `SpatialTransformer` #578

Fix `SpatialTransformer` #578

Uh oh!

ydshieh commented Sep 19, 2022

Uh oh!

ydshieh Sep 19, 2022

Uh oh!

HuggingFaceDocBuilderDev commented Sep 19, 2022 •

edited

Loading

Uh oh!

patil-suraj left a comment

Uh oh!

ydshieh commented Sep 20, 2022 •

edited

Loading

Uh oh!

ydshieh commented Sep 20, 2022

Uh oh!

patrickvonplaten commented Sep 22, 2022

Uh oh!

ydshieh commented Sep 22, 2022 •

edited

Loading

Uh oh!

patrickvonplaten commented Sep 27, 2022

Uh oh!

ydshieh commented Sep 27, 2022 •

edited

Loading

Uh oh!

patrickvonplaten left a comment

Uh oh!

Uh oh!

	SpatialTransformer(
	out_channels,
	attn_num_head_channels,
	out_channels // attn_num_head_channels,

Fix SpatialTransformer #578

Fix SpatialTransformer #578

Uh oh!

Conversation

ydshieh commented Sep 19, 2022

Uh oh!

ydshieh Sep 19, 2022

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Sep 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patil-suraj left a comment

Choose a reason for hiding this comment

Uh oh!

ydshieh commented Sep 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ydshieh commented Sep 20, 2022

Uh oh!

patrickvonplaten commented Sep 22, 2022

Uh oh!

ydshieh commented Sep 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

To reproduce

Error

Uh oh!

patrickvonplaten commented Sep 27, 2022

Uh oh!

ydshieh commented Sep 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patrickvonplaten left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Fix `SpatialTransformer` #578

Fix `SpatialTransformer` #578

HuggingFaceDocBuilderDev commented Sep 19, 2022 •

edited

Loading

ydshieh commented Sep 20, 2022 •

edited

Loading

ydshieh commented Sep 22, 2022 •

edited

Loading

ydshieh commented Sep 27, 2022 •

edited

Loading