Skip to content

Commit 62d4462

Browse files
committed
* Remove (de)parallelize stuff
* Edit shape comments * Update README.md * make fix-copies
1 parent 37270ae commit 62d4462

File tree

6 files changed

+45
-268
lines changed

6 files changed

+45
-268
lines changed

README_ko.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -259,7 +259,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
259259
1. **[LayoutXLM](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)** (from Microsoft Research Asia) released with the paper [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/abs/2104.08836) by Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei.
260260
1. **[LED](https://huggingface.co/docs/transformers/model_doc/led)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
261261
1. **[Longformer](https://huggingface.co/docs/transformers/model_doc/longformer)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
262-
1. **[LongT5](https://huggingface.co/docs/transformers/model_doc/longt5)** (from Google AI) released with the paper [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://arxiv.org/abs/2112.07916) by Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang.
262+
1. **[LongT5](https://huggingface.co/docs/transformers/main/model_doc/longt5)** (from Google AI) released with the paper [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://arxiv.org/abs/2112.07916) by Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang.
263263
1. **[LUKE](https://huggingface.co/docs/transformers/model_doc/luke)** (from Studio Ousia) released with the paper [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://arxiv.org/abs/2010.01057) by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto.
264264
1. **[LXMERT](https://huggingface.co/docs/transformers/model_doc/lxmert)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal.
265265
1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.

README_zh-hans.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -283,7 +283,7 @@ conda install -c huggingface transformers
283283
1. **[LayoutXLM](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)** (来自 Microsoft Research Asia) 伴随论文 [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/abs/2104.08836) 由 Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei 发布。
284284
1. **[LED](https://huggingface.co/docs/transformers/model_doc/led)** (来自 AllenAI) 伴随论文 [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) 由 Iz Beltagy, Matthew E. Peters, Arman Cohan 发布。
285285
1. **[Longformer](https://huggingface.co/docs/transformers/model_doc/longformer)** (来自 AllenAI) 伴随论文 [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) 由 Iz Beltagy, Matthew E. Peters, Arman Cohan 发布。
286-
1. **[LongT5](https://huggingface.co/docs/transformers/model_doc/longt5)** (来自 Google AI) released 伴随论文 [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://arxiv.org/abs/2112.07916) 由 Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang 发布。
286+
1. **[LongT5](https://huggingface.co/docs/transformers/main/model_doc/longt5)** (来自 Google AI) released 伴随论文 [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://arxiv.org/abs/2112.07916) 由 Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang 发布。
287287
1. **[LUKE](https://huggingface.co/docs/transformers/model_doc/luke)** (来自 Studio Ousia) 伴随论文 [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://arxiv.org/abs/2010.01057) 由 Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto 发布。
288288
1. **[LXMERT](https://huggingface.co/docs/transformers/model_doc/lxmert)** (来自 UNC Chapel Hill) 伴随论文 [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) 由 Hao Tan and Mohit Bansal 发布。
289289
1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (来自 Facebook) 伴随论文 [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) 由 Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin 发布。

README_zh-hant.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -295,7 +295,7 @@ conda install -c huggingface transformers
295295
1. **[LayoutXLM](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)** (from Microsoft Research Asia) released with the paper [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/abs/2104.08836) by Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei.
296296
1. **[LED](https://huggingface.co/docs/transformers/model_doc/led)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
297297
1. **[Longformer](https://huggingface.co/docs/transformers/model_doc/longformer)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
298-
1. **[LongT5](https://huggingface.co/docs/transformers/model_doc/longt5)** (from Google AI) released with the paper [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://arxiv.org/abs/2112.07916) by Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang.
298+
1. **[LongT5](https://huggingface.co/docs/transformers/main/model_doc/longt5)** (from Google AI) released with the paper [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://arxiv.org/abs/2112.07916) by Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang.
299299
1. **[LUKE](https://huggingface.co/docs/transformers/model_doc/luke)** (from Studio Ousia) released with the paper [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://arxiv.org/abs/2010.01057) by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto.
300300
1. **[LXMERT](https://huggingface.co/docs/transformers/model_doc/lxmert)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal.
301301
1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.

src/transformers/models/longt5/configuration_longt5.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -29,9 +29,9 @@
2929

3030
class LongT5Config(PretrainedConfig):
3131
r"""
32-
This is the configuration class to store the configuration of a [`LongT5Model`] or a [`FlaxLongT5Model`]. It is used
33-
to instantiate a LongT5 model according to the specified arguments, defining the model architecture. Instantiating
34-
a configuration with the defaults will yield a similar configuration to that of the LongT5
32+
This is the configuration class to store the configuration of a [`LongT5Model`] or a [`FlaxLongT5Model`]. It is
33+
used to instantiate a LongT5 model according to the specified arguments, defining the model architecture.
34+
Instantiating a configuration with the defaults will yield a similar configuration to that of the LongT5
3535
[](https://huggingface.co/) architecture.
3636
3737
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the

src/transformers/models/longt5/modeling_flax_longt5.py

Lines changed: 2 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -138,7 +138,8 @@ def _get_local_attention_mask(attention_mask: np.ndarray, block_len: int) -> jnp
138138
# [batch_size, num_block, block_len, 3 * block_len]
139139
local_attention_mask = jnp.logical_and(_blocked_attention_mask, _3blocked_attention_mask)
140140
local_attention_mask = _mask_local_attention_mask(local_attention_mask, block_len)
141-
return local_attention_mask[:, None, ...] # [batch_size, 1, num_block, block_len, 3 * block_len]
141+
# [batch_size, 1, num_block, block_len, 3 * block_len]
142+
return local_attention_mask[:, None, ...]
142143

143144

144145
def _make_global_fixed_block_ids(attention_mask: np.ndarray, global_block_size: int) -> Tuple[jnp.ndarray, np.ndarray]:
@@ -712,39 +713,6 @@ def _split_heads(self, hidden_states):
712713
def _merge_heads(self, hidden_states):
713714
return hidden_states.reshape(hidden_states.shape[0], -1, self.inner_dim)
714715

715-
@nn.compact
716-
def _concatenate_to_cache(self, key, value, query, attention_mask):
717-
"""
718-
This function takes projected key, value states from a single input token and concatenates the states to cached
719-
states from previous steps. This function is slighly adapted from the official Flax repository:
720-
https://github.com/google/flax/blob/491ce18759622506588784b4fca0e4bf05f8c8cd/flax/linen/attention.py#L252
721-
"""
722-
# detect if we're initializing by absence of existing cache data.
723-
is_initialized = self.has_variable("cache", "cached_key")
724-
cached_key = self.variable("cache", "cached_key", jnp.zeros, key.shape, key.dtype)
725-
cached_value = self.variable("cache", "cached_value", jnp.zeros, value.shape, value.dtype)
726-
cache_index = self.variable("cache", "cache_index", lambda: jnp.array(0, dtype=jnp.int32))
727-
728-
if is_initialized:
729-
*batch_dims, max_length, num_heads, depth_per_head = cached_key.value.shape
730-
# update key, value caches with our new 1d spatial slices
731-
cur_index = cache_index.value
732-
indices = (0,) * len(batch_dims) + (cur_index, 0, 0)
733-
key = jax.lax.dynamic_update_slice(cached_key.value, key, indices)
734-
value = jax.lax.dynamic_update_slice(cached_value.value, value, indices)
735-
cached_key.value = key
736-
cached_value.value = value
737-
num_updated_cache_vectors = query.shape[1]
738-
cache_index.value = cache_index.value + num_updated_cache_vectors
739-
# causal mask for cached decoder self-attention: our single query position should only attend to those key positions
740-
# that have already been generated and cached, not the remaining zero elements.
741-
pad_mask = jnp.broadcast_to(
742-
jnp.arange(max_length) < cur_index + num_updated_cache_vectors,
743-
tuple(batch_dims) + (1, num_updated_cache_vectors, max_length),
744-
)
745-
attention_mask = combine_masks(pad_mask, attention_mask)
746-
return key, value, attention_mask
747-
748716
def _create_position_bias(self, block_len: int, attention_mask: Optional[np.ndarray]) -> np.ndarray:
749717
# position_bias shape: # (1, 1, n_heads, block_len, 3 * block_len)
750718
if self.has_relative_attention_bias:
@@ -995,39 +963,6 @@ def _split_heads(self, hidden_states):
995963
def _merge_heads(self, hidden_states):
996964
return hidden_states.reshape(hidden_states.shape[0], -1, self.inner_dim)
997965

998-
@nn.compact
999-
def _concatenate_to_cache(self, key, value, query, attention_mask):
1000-
"""
1001-
This function takes projected key, value states from a single input token and concatenates the states to cached
1002-
states from previous steps. This function is slighly adapted from the official Flax repository:
1003-
https://github.com/google/flax/blob/491ce18759622506588784b4fca0e4bf05f8c8cd/flax/linen/attention.py#L252
1004-
"""
1005-
# detect if we're initializing by absence of existing cache data.
1006-
is_initialized = self.has_variable("cache", "cached_key")
1007-
cached_key = self.variable("cache", "cached_key", jnp.zeros, key.shape, key.dtype)
1008-
cached_value = self.variable("cache", "cached_value", jnp.zeros, value.shape, value.dtype)
1009-
cache_index = self.variable("cache", "cache_index", lambda: jnp.array(0, dtype=jnp.int32))
1010-
1011-
if is_initialized:
1012-
*batch_dims, max_length, num_heads, depth_per_head = cached_key.value.shape
1013-
# update key, value caches with our new 1d spatial slices
1014-
cur_index = cache_index.value
1015-
indices = (0,) * len(batch_dims) + (cur_index, 0, 0)
1016-
key = jax.lax.dynamic_update_slice(cached_key.value, key, indices)
1017-
value = jax.lax.dynamic_update_slice(cached_value.value, value, indices)
1018-
cached_key.value = key
1019-
cached_value.value = value
1020-
num_updated_cache_vectors = query.shape[1]
1021-
cache_index.value = cache_index.value + num_updated_cache_vectors
1022-
# causal mask for cached decoder self-attention: our single query position should only attend to those key positions
1023-
# that have already been generated and cached, not the remaining zero elements.
1024-
pad_mask = jnp.broadcast_to(
1025-
jnp.arange(max_length) < cur_index + num_updated_cache_vectors,
1026-
tuple(batch_dims) + (1, num_updated_cache_vectors, max_length),
1027-
)
1028-
attention_mask = combine_masks(pad_mask, attention_mask)
1029-
return key, value, attention_mask
1030-
1031966
def _create_position_bias(self, block_len: int, attention_mask: Optional[np.ndarray]) -> np.ndarray:
1032967
# position_bias shape: # (1, 1, n_heads, block_len, 3 * block_len)
1033968
if self.has_relative_attention_bias:

0 commit comments

Comments
 (0)