Skip to content

Download重构 #8020

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 44 commits into from
Mar 8, 2024
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
66744bb
download
LOVE-YOURSELF-1 Feb 23, 2024
40b27c4
modified file
LOVE-YOURSELF-1 Feb 23, 2024
68b5f8c
modified from_pretrained
LOVE-YOURSELF-1 Feb 26, 2024
e342983
modified config
LOVE-YOURSELF-1 Feb 26, 2024
fcc392b
modified download
LOVE-YOURSELF-1 Feb 26, 2024
3aa76ab
test_tokenizer
LOVE-YOURSELF-1 Feb 27, 2024
d6dfcf0
Delete tests/transformers/from_pretrained/run.sh
LOVE-YOURSELF-1 Feb 27, 2024
0705617
Update test_tokenizer.py
LOVE-YOURSELF-1 Feb 27, 2024
f9c5af7
Update tokenizer_utils_base.py
LOVE-YOURSELF-1 Feb 27, 2024
275e52b
test_model
LOVE-YOURSELF-1 Feb 27, 2024
76cd0da
test_model
LOVE-YOURSELF-1 Feb 27, 2024
9bdc94e
test_model
LOVE-YOURSELF-1 Feb 27, 2024
df82769
Remove comments
LOVE-YOURSELF-1 Feb 28, 2024
5148bc6
Remove comments
LOVE-YOURSELF-1 Feb 28, 2024
6a0085b
add requirements
LOVE-YOURSELF-1 Feb 28, 2024
7006332
update bos download
JunnYu Feb 28, 2024
620aacc
Update test_model.py
LOVE-YOURSELF-1 Feb 28, 2024
ae6169f
clear unused import
LOVE-YOURSELF-1 Feb 29, 2024
7268671
modified bug tokenizer_utils_base.py
LOVE-YOURSELF-1 Feb 29, 2024
fe24034
change safetensors
LOVE-YOURSELF-1 Feb 29, 2024
85f37cb
modified load generation config
LOVE-YOURSELF-1 Feb 29, 2024
37b3c25
add requestion
LOVE-YOURSELF-1 Feb 29, 2024
d8c552d
更新
JunnYu Mar 1, 2024
c22851a
modified error
LOVE-YOURSELF-1 Mar 1, 2024
e392644
fix bug
LOVE-YOURSELF-1 Mar 1, 2024
40842fd
Merge branch 'PaddlePaddle:develop' into download
LOVE-YOURSELF-1 Mar 1, 2024
b44f8ed
add \n
JunnYu Mar 1, 2024
a18ca41
Update __init__.py
LOVE-YOURSELF-1 Mar 4, 2024
03d5047
Merge branch 'PaddlePaddle:develop' into download
LOVE-YOURSELF-1 Mar 4, 2024
6bb0544
Merge branch 'PaddlePaddle:develop' into download
LOVE-YOURSELF-1 Mar 4, 2024
0364a65
Merge branch 'PaddlePaddle:develop' into download
LOVE-YOURSELF-1 Mar 5, 2024
b60d218
add requestion
LOVE-YOURSELF-1 Mar 5, 2024
850796f
modified download
LOVE-YOURSELF-1 Mar 5, 2024
8ce5dfe
重测
LOVE-YOURSELF-1 Mar 5, 2024
af7bb9d
Merge branch 'PaddlePaddle:develop' into download
LOVE-YOURSELF-1 Mar 6, 2024
3109368
Update test_tokenizer.py
LOVE-YOURSELF-1 Mar 6, 2024
d25e6cd
Update requirements-dev.txt
LOVE-YOURSELF-1 Mar 6, 2024
ee497e5
Update requirements.txt
LOVE-YOURSELF-1 Mar 6, 2024
ed4d372
Merge branch 'PaddlePaddle:develop' into download
LOVE-YOURSELF-1 Mar 6, 2024
d829bc5
delete from_pretrained
LOVE-YOURSELF-1 Mar 6, 2024
eb06571
Merge branch 'PaddlePaddle:develop' into download
LOVE-YOURSELF-1 Mar 6, 2024
793784f
make superior
LOVE-YOURSELF-1 Mar 7, 2024
286b80a
Merge branch 'PaddlePaddle:develop' into download
LOVE-YOURSELF-1 Mar 7, 2024
119c648
Update run_pretrain_trainer.py
LOVE-YOURSELF-1 Mar 7, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 18 additions & 25 deletions paddlenlp/experimental/model_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
from paddle.framework import core

from paddlenlp.transformers import PretrainedModel
from paddlenlp.utils.download import get_file

# TODO(fangzeyang) Temporary fix and replace by paddle framework downloader later
from paddlenlp.utils.downloader import COMMUNITY_MODEL_PREFIX, get_path_from_url
Expand Down Expand Up @@ -96,6 +97,11 @@ def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs):
pretrained_models = list(cls.pretrained_init_configuration.keys())
resource_files = {}
init_configuration = {}
pretrained_model_name_or_path = str(pretrained_model_name_or_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paddlenlp/experimental/model_utils.py 这些代码有CI测试覆盖吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

experimental目录下没有专门新增单测,但是transformers下有新增单测,只是加上单测会导致ci失败,但是在本地是可以正常运行的

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JunnYu 这里CE可以覆盖吗?对推理而言风向比较大。

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我那里的CE都是动态图的,不会涉及到experimental的部分

cache_dir = kwargs.pop("cache_dir", None)
from_hf_hub = kwargs.pop("from_hf_hub", False)
from_aistudio = kwargs.pop("from_aistudio", False)
subfolder = kwargs.pop("subfolder", "")

# From built-in pretrained models
if pretrained_model_name_or_path in pretrained_models:
Expand All @@ -106,40 +112,27 @@ def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs):
elif os.path.isdir(pretrained_model_name_or_path):
for file_id, file_name in cls.resource_files_names.items():
full_file_name = os.path.join(pretrained_model_name_or_path, file_name)
resource_files[file_id] = full_file_name
if os.path.isfile(full_file_name):
resource_files[file_id] = full_file_name
resource_files["model_config_file"] = os.path.join(pretrained_model_name_or_path, cls.model_config_file)
else:
# Assuming from community-contributed pretrained models
for file_id, file_name in cls.resource_files_names.items():
full_file_name = "/".join([COMMUNITY_MODEL_PREFIX, pretrained_model_name_or_path, file_name])
resource_files[file_id] = full_file_name
resource_files["model_config_file"] = "/".join(
[COMMUNITY_MODEL_PREFIX, pretrained_model_name_or_path, cls.model_config_file]
)
resource_files[file_id] = file_name

default_root = os.path.join(MODEL_HOME, pretrained_model_name_or_path)
# default_root = os.path.join(MODEL_HOME, pretrained_model_name_or_path)
resolved_resource_files = {}
for file_id, file_path in resource_files.items():
if file_path is None or os.path.isfile(file_path):
resolved_resource_files[file_id] = file_path
continue
path = os.path.join(default_root, file_path.split("/")[-1])
if os.path.exists(path):
logger.info("Already cached %s" % path)
resolved_resource_files[file_id] = path
else:
logger.info("Downloading %s and saved to %s" % (file_path, default_root))
try:
resolved_resource_files[file_id] = get_path_from_url(file_path, default_root)
except RuntimeError as err:
logger.error(err)
raise RuntimeError(
f"Can't load weights for '{pretrained_model_name_or_path}'.\n"
f"Please make sure that '{pretrained_model_name_or_path}' is:\n"
"- a correct model-identifier of built-in pretrained models,\n"
"- or a correct model-identifier of community-contributed pretrained models,\n"
"- or the correct path to a directory containing relevant modeling files(model_weights and model_config).\n"
)
resolved_resource_files[file_id] = get_file(
pretrained_model_name_or_path,
[file_path],
subfolder,
cache_dir=cache_dir,
from_aistudio=from_aistudio,
from_hf_hub=from_hf_hub,
)

# Prepare model initialization kwargs
# Did we saved some inputs and kwargs to reload ?
Expand Down
6 changes: 2 additions & 4 deletions paddlenlp/experimental/transformers/chatglm/modeling.py
Original file line number Diff line number Diff line change
Expand Up @@ -581,12 +581,10 @@ def __init__(self, config: ChatGLMConfig):
self.lm_head = self.model.get_input_embeddings()

@classmethod
def from_pretrained(
cls, pretrained_model_name_or_path, from_hf_hub: bool = False, subfolder: str | None = None, *args, **kwargs
):
def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs):
# TODO: Support safetensors loading.
kwargs["use_safetensors"] = False
return super().from_pretrained(pretrained_model_name_or_path, from_hf_hub, subfolder, *args, **kwargs)
return super().from_pretrained(pretrained_model_name_or_path, *args, **kwargs)

@classmethod
def get_cache_kvs_shape(
Expand Down
6 changes: 2 additions & 4 deletions paddlenlp/experimental/transformers/gpt/modeling.py
Original file line number Diff line number Diff line change
Expand Up @@ -444,12 +444,10 @@ def __init__(self, config):
self.gpt = GPTInferenceModel(config)

@classmethod
def from_pretrained(
cls, pretrained_model_name_or_path, from_hf_hub: bool = False, subfolder: str | None = None, *args, **kwargs
):
def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs):
# TODO: Support safetensors loading.
kwargs["use_safetensors"] = False
return super().from_pretrained(pretrained_model_name_or_path, from_hf_hub, subfolder, *args, **kwargs)
return super().from_pretrained(pretrained_model_name_or_path, *args, **kwargs)

@classmethod
def get_cache_kvs_shape(
Expand Down
14 changes: 5 additions & 9 deletions paddlenlp/experimental/transformers/llama/modeling.py
Original file line number Diff line number Diff line change
Expand Up @@ -865,12 +865,10 @@ def __init__(self, config):
self.lm_head = LlamaLMHead(config)

@classmethod
def from_pretrained(
cls, pretrained_model_name_or_path, from_hf_hub: bool = False, subfolder: str | None = None, *args, **kwargs
):
def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs):
# TODO: Support safetensors loading.
kwargs["use_safetensors"] = False
return super().from_pretrained(pretrained_model_name_or_path, from_hf_hub, subfolder, *args, **kwargs)
return super().from_pretrained(pretrained_model_name_or_path, *args, **kwargs)

@classmethod
def get_cache_kvs_shape(
Expand Down Expand Up @@ -1106,17 +1104,15 @@ def get_tensor_parallel_split_mappings(num_layers):
return mappings

@classmethod
def from_pretrained(
cls, pretrained_model_name_or_path, from_hf_hub: bool = False, subfolder: str | None = None, *args, **kwargs
):
def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs):
# TODO: Support safetensors loading.
kwargs["use_safetensors"] = False
from paddlenlp.transformers.utils import (
ContextManagers,
is_safetensors_available,
resolve_cache_dir,
)

from_hf_hub = kwargs.pop("from_hf_hub", False)
config = kwargs.pop("config", None)
from_aistudio = kwargs.get("from_aistudio", False)
subfolder = kwargs.get("subfolder", None)
Expand All @@ -1125,7 +1121,7 @@ def from_pretrained(
convert_from_torch = kwargs.pop("convert_from_torch", None)
cache_dir = kwargs.pop("cache_dir", None)

cache_dir = resolve_cache_dir(pretrained_model_name_or_path, from_hf_hub, cache_dir)
# cache_dir = resolve_cache_dir(pretrained_model_name_or_path, from_hf_hub, cache_dir)

init_contexts = []
with ContextManagers(init_contexts):
Expand Down
6 changes: 2 additions & 4 deletions paddlenlp/experimental/transformers/opt/modeling.py
Original file line number Diff line number Diff line change
Expand Up @@ -327,12 +327,10 @@ def __init__(self, config: OPTConfig, **kwargs):
self.lm_head = OPTLMHead(config)

@classmethod
def from_pretrained(
cls, pretrained_model_name_or_path, from_hf_hub: bool = False, subfolder: str | None = None, *args, **kwargs
):
def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs):
# TODO: Support safetensors loading.
kwargs["use_safetensors"] = kwargs.get("use_safetensors", False)
return super().from_pretrained(pretrained_model_name_or_path, from_hf_hub, subfolder, *args, **kwargs)
return super().from_pretrained(pretrained_model_name_or_path, *args, **kwargs)

@classmethod
def get_cache_kvs_shape(
Expand Down
58 changes: 12 additions & 46 deletions paddlenlp/generation/configuration_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
from paddlenlp import __version__
from paddlenlp.transformers.configuration_utils import PretrainedConfig
from paddlenlp.transformers.utils import resolve_cache_dir
from paddlenlp.utils.download import get_file
from paddlenlp.utils.log import logger

from ..transformers.aistudio_utils import aistudio_download
Expand Down Expand Up @@ -413,52 +414,17 @@ def from_pretrained(
if subfolder is None:
subfolder = ""

cache_dir = resolve_cache_dir(from_hf_hub, from_aistudio, cache_dir)

# 1. get the configuration file from local file, eg: /cache/path/model_config.json
if os.path.isfile(pretrained_model_name_or_path):
resolved_config_file = pretrained_model_name_or_path

# 2. get the configuration file from url, eg: https://ip/path/to/model_config.json
elif is_url(pretrained_model_name_or_path):
resolved_config_file = get_path_from_url_with_filelock(
pretrained_model_name_or_path,
cache_dir=os.path.join(cache_dir, pretrained_model_name_or_path, subfolder),
check_exist=not force_download,
)
# 3. get the configuration file from local dir with default name, eg: /local/path
elif os.path.isdir(pretrained_model_name_or_path):
configuration_file = os.path.join(pretrained_model_name_or_path, subfolder, config_file_name)
if os.path.exists(configuration_file):
resolved_config_file = configuration_file
else:
# try to detect old-school config file
raise FileNotFoundError("please make sure there is `generation_config.json` under the dir")
# 4. get the configuration file from aistudio
elif from_aistudio:
resolved_config_file = aistudio_download(
repo_id=pretrained_model_name_or_path,
filename=config_file_name,
cache_dir=cache_dir,
subfolder=subfolder,
)
# 5. get the configuration file from HF hub
elif from_hf_hub:
resolved_config_file = resolve_hf_generation_config_path(
repo_id=pretrained_model_name_or_path, cache_dir=cache_dir, subfolder=subfolder
)
else:
url_list = [COMMUNITY_MODEL_PREFIX, pretrained_model_name_or_path, config_file_name]
cache_dir = os.path.join(cache_dir, pretrained_model_name_or_path, subfolder)
if subfolder != "":
url_list.insert(2, subfolder)
community_url = "/".join(url_list)
if url_file_exists(community_url):
resolved_config_file = get_path_from_url_with_filelock(
community_url, cache_dir, check_exist=not force_download
)
else:
raise FileNotFoundError(f"configuration file<{GENERATION_CONFIG_NAME}> not found")
# cache_dir = resolve_cache_dir(from_hf_hub, from_aistudio, cache_dir)

resolved_config_file = get_file(
pretrained_model_name_or_path,
[config_file_name],
subfolder,
cache_dir=cache_dir,
force_download=force_download,
from_aistudio=from_aistudio,
from_hf_hub=from_hf_hub,
)

try:
logger.info(f"Loading configuration file {resolved_config_file}")
Expand Down
95 changes: 26 additions & 69 deletions paddlenlp/transformers/auto/configuration.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
from huggingface_hub import hf_hub_download

from ... import __version__
from ...utils.download import get_file
from ...utils.downloader import (
COMMUNITY_MODEL_PREFIX,
get_path_from_url_with_filelock,
Expand Down Expand Up @@ -170,13 +171,8 @@ def from_pretrained(cls, pretrained_model_name_or_path: str, *model_args, **kwar
config = AutoConfig.from_pretrained("bert-base-uncased")
config.save_pretrained('./bert-base-uncased')
"""
subfolder = kwargs.get("subfolder", "")
if subfolder is None:
subfolder = ""
from_aistudio = kwargs.pop("from_aistudio", False)
from_hf_hub = kwargs.pop("from_hf_hub", False)
cache_dir = kwargs.pop("cache_dir", None)
cache_dir = resolve_cache_dir(from_hf_hub=from_hf_hub, from_aistudio=from_aistudio, cache_dir=cache_dir)

# cache_dir = resolve_cache_dir(from_hf_hub=from_hf_hub, from_aistudio=from_aistudio, cache_dir=cache_dir)

if not cls.name2class:
cls.name2class = {}
Expand All @@ -192,72 +188,33 @@ def from_pretrained(cls, pretrained_model_name_or_path: str, *model_args, **kwar
pretrained_model_name_or_path, *model_args, **kwargs
)

# From local dir path
elif os.path.isdir(pretrained_model_name_or_path):
config_file = os.path.join(pretrained_model_name_or_path, subfolder, cls.config_file)
if not os.path.exists(config_file):
# try to load legacy config file
legacy_config_file = os.path.join(pretrained_model_name_or_path, subfolder, cls.legacy_config_file)
if not os.path.exists(legacy_config_file):
raise ValueError(
f"config file<{cls.config_file}> or legacy config file<{cls.legacy_config_file}> not found"
)
subfolder = kwargs.get("subfolder", "")
if subfolder is None:
subfolder = ""
from_aistudio = kwargs.pop("from_aistudio", False)
from_hf_hub = kwargs.pop("from_hf_hub", False)
cache_dir = kwargs.pop("cache_dir", None)

logger.warning(f"loading legacy config file<{cls.legacy_config_file}> ...")
config_file = legacy_config_file
config_file = get_file(
pretrained_model_name_or_path,
[cls.config_file, cls.legacy_config_file],
subfolder,
cache_dir=cache_dir,
from_hf_hub=from_hf_hub,
from_aistudio=from_aistudio,
)

if os.path.exists(config_file):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是否一定是 exists 的?不存在的话,报错是不是在 get_file 内部?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果下载失败的话是在get_file内部报错,如果repo没有该文件get_file会返回None,会在这报错

config_class = cls._get_config_class_from_config(pretrained_model_name_or_path, config_file)
logger.info("We are using %s to load '%s'." % (config_class, pretrained_model_name_or_path))
if config_class is cls:
return cls.from_file(config_file)
return config_class.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
elif from_aistudio:
file = aistudio_download(
repo_id=pretrained_model_name_or_path,
filename=cls.config_file,
subfolder=subfolder,
cache_dir=cache_dir,
)
return cls.from_pretrained(os.path.dirname(file))
elif from_hf_hub:
file = hf_hub_download(
repo_id=pretrained_model_name_or_path,
filename=cls.config_file,
cache_dir=cache_dir,
subfolder=subfolder,
library_name="PaddleNLP",
library_version=__version__,
)
# from local dir path
return cls.from_pretrained(os.path.dirname(file))

# Assuming from community-contributed pretrained models
return config_class.from_pretrained(config_file, *model_args, **kwargs)
else:
url_list = [COMMUNITY_MODEL_PREFIX, pretrained_model_name_or_path, cls.config_file]
legacy_url_list = [COMMUNITY_MODEL_PREFIX, pretrained_model_name_or_path, cls.legacy_config_file]
cache_dir = os.path.join(cache_dir, pretrained_model_name_or_path, subfolder)
if subfolder != "":
url_list.insert(2, subfolder)
legacy_url_list.insert(2, subfolder)
community_config_path = "/".join(url_list)
legacy_community_config_path = "/".join(legacy_url_list)

if not url_file_exists(community_config_path):
if not url_file_exists(legacy_community_config_path):
raise RuntimeError(
f"Can't load Config for '{pretrained_model_name_or_path}'.\n"
f"Please make sure that '{pretrained_model_name_or_path}' is:\n"
"- a correct model-identifier of built-in pretrained models,\n"
"- or a correct model-identifier of community-contributed pretrained models,\n"
"- or the correct path to a directory containing relevant config files.\n"
)
logger.warning(f"loading legacy config file<{cls.legacy_config_file}> ...")
community_config_path = legacy_community_config_path

resolved_config_file = get_path_from_url_with_filelock(community_config_path, cache_dir)
config_class = cls._get_config_class_from_config(pretrained_model_name_or_path, resolved_config_file)
logger.info("We are using %s to load '%s'." % (config_class, pretrained_model_name_or_path))
if config_class is cls:
return cls.from_file(resolved_config_file, **kwargs)

return config_class.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
raise RuntimeError(
f"Can't load config for '{pretrained_model_name_or_path}'.\n"
f"Please make sure that '{pretrained_model_name_or_path}' is:\n"
"- a correct model-identifier of built-in pretrained models,\n"
"- or a correct model-identifier of community-contributed pretrained models,\n"
"- or the correct path to a directory containing relevant config files.\n"
)
Loading