-
Notifications
You must be signed in to change notification settings - Fork 29
Suggestion: Drop .pth
models and move to safetensors
#42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
There're many outside pths like uvr5. We can drop those support first, then considering the safetensor implementation. P.S. The hash is not only for the safety, but also for the models are large. In case of the models are broken during downloading, a check at start time is necessary. Well, we can wrap a lazy-check to check the hash when the model is to be loaded. |
That sounds like a good plan!
Huh, I see. Lazy check does sound like the best solution in that case. |
Check just one pth can only spend less than 1s, maybe acceptable. |
I mean, the check is simply just "Calculate hash of the pth and compare to the stored hash", this can be done in C for a check that takes <0.1s to do. Not sure if this is a good solution right now, since RVC is not yet structured like a module, so the hash checker has to be built externally and used as a python module (at least until the RVC is made into a module) I can start working on the hash checker in C and bind it to python functions for easy access as a module. |
Well, I don't know you know it or not, the python standard hashlib is not a pure python one, but written in c, if my memory is correct. |
Huh, I didn't know that. Since the number of files is known beforehand and all of their hashes are accessible, this might improve performance since we'll do a single call to C for hash-checking of all of the files. So, if we don't hardcode the values in, it'll be: import RVC_Hash
def check_all_assets(update=False) -> bool:
BASE_DIR = Path(__file__).resolve().parent.parent.parent
logger.info("checking hubret & rmvpe...")
if not check_model(
BASE_DIR / "assets" / "hubert",
"hubert_base.pt",
os.environ["sha256_hubert_base_pt"],
update,
):
return False
if not check_model(
BASE_DIR / "assets" / "rmvpe",
"rmvpe.pt",
os.environ["sha256_rmvpe_pt"],
update,
):
return False
if not check_model(
BASE_DIR / "assets" / "rmvpe",
"rmvpe.onnx",
os.environ["sha256_rmvpe_onnx"],
update,
):
return False
rvc_models_dir = BASE_DIR / "assets" / "pretrained"
logger.info("checking pretrained models...")
model_names = [
"D32k.pth",
"D40k.pth",
"D48k.pth",
"G32k.pth",
"G40k.pth",
"G48k.pth",
"f0D32k.pth",
"f0D40k.pth",
"f0D48k.pth",
"f0G32k.pth",
"f0G40k.pth",
"f0G48k.pth",
]
RVC_Hash.check_hashes(model_names) Or in case if we hardcode the values in: import RVC_Hash
def check_all_assets(update=False) -> bool:
BASE_DIR = Path(__file__).resolve().parent.parent.parent
logger.info("checking hubret & rmvpe...")
if not check_model(
BASE_DIR / "assets" / "hubert",
"hubert_base.pt",
os.environ["sha256_hubert_base_pt"],
update,
):
return False
if not check_model(
BASE_DIR / "assets" / "rmvpe",
"rmvpe.pt",
os.environ["sha256_rmvpe_pt"],
update,
):
return False
if not check_model(
BASE_DIR / "assets" / "rmvpe",
"rmvpe.onnx",
os.environ["sha256_rmvpe_onnx"],
update,
):
return False
rvc_models_dir = BASE_DIR / "assets" / "pretrained"
logger.info("checking pretrained models...")
RVC_Hash.check_hashes() Instead of: def check_all_assets(update=False) -> bool:
BASE_DIR = Path(__file__).resolve().parent.parent.parent
logger.info("checking hubret & rmvpe...")
if not check_model(
BASE_DIR / "assets" / "hubert",
"hubert_base.pt",
os.environ["sha256_hubert_base_pt"],
update,
):
return False
if not check_model(
BASE_DIR / "assets" / "rmvpe",
"rmvpe.pt",
os.environ["sha256_rmvpe_pt"],
update,
):
return False
if not check_model(
BASE_DIR / "assets" / "rmvpe",
"rmvpe.onnx",
os.environ["sha256_rmvpe_onnx"],
update,
):
return False
rvc_models_dir = BASE_DIR / "assets" / "pretrained"
logger.info("checking pretrained models...")
model_names = [
"D32k.pth",
"D40k.pth",
"D48k.pth",
"G32k.pth",
"G40k.pth",
"G48k.pth",
"f0D32k.pth",
"f0D40k.pth",
"f0D48k.pth",
"f0G32k.pth",
"f0G40k.pth",
"f0G48k.pth",
]
for model in model_names:
menv = model.replace(".", "_")
if not check_model(
rvc_models_dir, model, os.environ[f"sha256_v1_{menv}"], update
):
return False As Python's |
Well, if you want to write a specialized program to do this stuff, I will not refuse it but it should be a platform-independent program, which can be run under Windows, Linux, MacOS, etc. and with the architecture of amd64, arm64, etc. |
Noted! Will also attempt to make it work under both little endian and big endian! |
some interesting info here https://huggingface.co/docs/hub/security-pickle |
Alternatively, you can explicitly call |
This idea has been around for quite some time, the main reasoning is that
.pth
files are fundamentally unsafe, allowing for remote code execution if injected. This brings us to the solution the original repo had - usintg SHA256 for the main weights to ensure that at least the official weights haven't been tampered with.However, this problem has a much simpler fix: Using safetensors instead of raw weights,
pth
. This way no validity check is even required, therefore startup time is sped up significantly. This also improves security as users will not be able to inject any malicious code in the raw weights.I, sadly, don't remember who was the author of the original idea; however i believe it was
notfelt
from discord server AIHubThe text was updated successfully, but these errors were encountered: