Suggestion: Drop `.pth` models and move to safetensors #42

alexlnkp · 2024-06-12T11:44:27Z

This idea has been around for quite some time, the main reasoning is that .pth files are fundamentally unsafe, allowing for remote code execution if injected. This brings us to the solution the original repo had - usintg SHA256 for the main weights to ensure that at least the official weights haven't been tampered with.
However, this problem has a much simpler fix: Using safetensors instead of raw weights, pth. This way no validity check is even required, therefore startup time is sped up significantly. This also improves security as users will not be able to inject any malicious code in the raw weights.

I, sadly, don't remember who was the author of the original idea; however i believe it was notfelt from discord server AIHub

The text was updated successfully, but these errors were encountered:

fumiama · 2024-06-12T12:06:41Z

There're many outside pths like uvr5. We can drop those support first, then considering the safetensor implementation.

P.S. The hash is not only for the safety, but also for the models are large. In case of the models are broken during downloading, a check at start time is necessary. Well, we can wrap a lazy-check to check the hash when the model is to be loaded.

alexlnkp · 2024-06-12T12:13:52Z

There're many outside pths like uvr5. We can drop those support first, then considering the safetensor implementation.

That sounds like a good plan!

P.S. The hash is not only for the safety, but also for the models are large. In case of the models are broken during downloading, a check at start time is necessary. Well, we can wrap a lazy-check to check the hash when the model is to be loaded.

Huh, I see. Lazy check does sound like the best solution in that case.

fumiama · 2024-06-12T12:22:26Z

Lazy check does sound like the best solution in that case.

Check just one pth can only spend less than 1s, maybe acceptable.

alexlnkp · 2024-06-12T12:38:15Z

Check just one pth can only spend less than 1s, maybe acceptable.

I mean, the check is simply just "Calculate hash of the pth and compare to the stored hash", this can be done in C for a check that takes <0.1s to do. Not sure if this is a good solution right now, since RVC is not yet structured like a module, so the hash checker has to be built externally and used as a python module (at least until the RVC is made into a module)

I can start working on the hash checker in C and bind it to python functions for easy access as a module.

fumiama · 2024-06-12T12:49:51Z

Well, I don't know you know it or not, the python standard hashlib is not a pure python one, but written in c, if my memory is correct.

alexlnkp · 2024-06-12T12:56:56Z

Well, I don't know you know it or not, the python standard hashlib is not a pure python one, but written in c, if my memory is correct.

Huh, I didn't know that.
However, I meant more of fully RVC-specific implementation, with checking all of the files against their hashes, instead of using hashlib for all of the files independently.

Since the number of files is known beforehand and all of their hashes are accessible, this might improve performance since we'll do a single call to C for hash-checking of all of the files.

So, if we don't hardcode the values in, it'll be:

import RVC_Hash
def check_all_assets(update=False) -> bool:
    BASE_DIR = Path(__file__).resolve().parent.parent.parent

    logger.info("checking hubret & rmvpe...")

    if not check_model(
        BASE_DIR / "assets" / "hubert",
        "hubert_base.pt",
        os.environ["sha256_hubert_base_pt"],
        update,
    ):
        return False
    if not check_model(
        BASE_DIR / "assets" / "rmvpe",
        "rmvpe.pt",
        os.environ["sha256_rmvpe_pt"],
        update,
    ):
        return False
    if not check_model(
        BASE_DIR / "assets" / "rmvpe",
        "rmvpe.onnx",
        os.environ["sha256_rmvpe_onnx"],
        update,
    ):
        return False

    rvc_models_dir = BASE_DIR / "assets" / "pretrained"
    logger.info("checking pretrained models...")
    model_names = [
        "D32k.pth",
        "D40k.pth",
        "D48k.pth",
        "G32k.pth",
        "G40k.pth",
        "G48k.pth",
        "f0D32k.pth",
        "f0D40k.pth",
        "f0D48k.pth",
        "f0G32k.pth",
        "f0G40k.pth",
        "f0G48k.pth",
    ]
	RVC_Hash.check_hashes(model_names)

Or in case if we hardcode the values in:

import RVC_Hash
def check_all_assets(update=False) -> bool:
    BASE_DIR = Path(__file__).resolve().parent.parent.parent

    logger.info("checking hubret & rmvpe...")

    if not check_model(
        BASE_DIR / "assets" / "hubert",
        "hubert_base.pt",
        os.environ["sha256_hubert_base_pt"],
        update,
    ):
        return False
    if not check_model(
        BASE_DIR / "assets" / "rmvpe",
        "rmvpe.pt",
        os.environ["sha256_rmvpe_pt"],
        update,
    ):
        return False
    if not check_model(
        BASE_DIR / "assets" / "rmvpe",
        "rmvpe.onnx",
        os.environ["sha256_rmvpe_onnx"],
        update,
    ):
        return False

    rvc_models_dir = BASE_DIR / "assets" / "pretrained"
    logger.info("checking pretrained models...")
	RVC_Hash.check_hashes()

Instead of:

def check_all_assets(update=False) -> bool:
    BASE_DIR = Path(__file__).resolve().parent.parent.parent

    logger.info("checking hubret & rmvpe...")

    if not check_model(
        BASE_DIR / "assets" / "hubert",
        "hubert_base.pt",
        os.environ["sha256_hubert_base_pt"],
        update,
    ):
        return False
    if not check_model(
        BASE_DIR / "assets" / "rmvpe",
        "rmvpe.pt",
        os.environ["sha256_rmvpe_pt"],
        update,
    ):
        return False
    if not check_model(
        BASE_DIR / "assets" / "rmvpe",
        "rmvpe.onnx",
        os.environ["sha256_rmvpe_onnx"],
        update,
    ):
        return False

    rvc_models_dir = BASE_DIR / "assets" / "pretrained"
    logger.info("checking pretrained models...")
    model_names = [
        "D32k.pth",
        "D40k.pth",
        "D48k.pth",
        "G32k.pth",
        "G40k.pth",
        "G48k.pth",
        "f0D32k.pth",
        "f0D40k.pth",
        "f0D48k.pth",
        "f0G32k.pth",
        "f0G40k.pth",
        "f0G48k.pth",
    ]
    for model in model_names:
        menv = model.replace(".", "_")
        if not check_model(
            rvc_models_dir, model, os.environ[f"sha256_v1_{menv}"], update
        ):
            return False

As Python's for loops might have a small overhead unlike C-loops.

fumiama · 2024-06-12T14:30:10Z

Well, if you want to write a specialized program to do this stuff, I will not refuse it but it should be a platform-independent program, which can be run under Windows, Linux, MacOS, etc. and with the architecture of amd64, arm64, etc.

alexlnkp · 2024-06-12T14:35:43Z

Well, if you want to write a specialized program to do this stuff, I will not refuse it but it should be a platform-independent program, which can be run under Windows, Linux, MacOS, etc. and with the architecture of amd64, arm64, etc.

Noted! Will also attempt to make it work under both little endian and big endian!

blaisewf · 2024-06-12T20:04:17Z

some interesting info here https://huggingface.co/docs/hub/security-pickle

TheTrustedComputer · 2024-08-27T20:21:51Z

Alternatively, you can explicitly call torch.load with the argument weights_only=True. This will be the default in future PyTorch releases.

fumiama added enhancement New feature or request help wanted Extra attention is needed labels Jun 12, 2024

alexlnkp mentioned this issue Jun 12, 2024

Testing with custom C hasher #47

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: Drop `.pth` models and move to safetensors #42

Suggestion: Drop `.pth` models and move to safetensors #42

alexlnkp commented Jun 12, 2024 •

edited

Loading

fumiama commented Jun 12, 2024

alexlnkp commented Jun 12, 2024

fumiama commented Jun 12, 2024

alexlnkp commented Jun 12, 2024 •

edited

Loading

fumiama commented Jun 12, 2024

alexlnkp commented Jun 12, 2024 •

edited

Loading

fumiama commented Jun 12, 2024

alexlnkp commented Jun 12, 2024

blaisewf commented Jun 12, 2024

TheTrustedComputer commented Aug 27, 2024

Suggestion: Drop .pth models and move to safetensors #42

Suggestion: Drop .pth models and move to safetensors #42

Comments

alexlnkp commented Jun 12, 2024 • edited Loading

fumiama commented Jun 12, 2024

alexlnkp commented Jun 12, 2024

fumiama commented Jun 12, 2024

alexlnkp commented Jun 12, 2024 • edited Loading

fumiama commented Jun 12, 2024

alexlnkp commented Jun 12, 2024 • edited Loading

fumiama commented Jun 12, 2024

alexlnkp commented Jun 12, 2024

blaisewf commented Jun 12, 2024

TheTrustedComputer commented Aug 27, 2024

Suggestion: Drop `.pth` models and move to safetensors #42

Suggestion: Drop `.pth` models and move to safetensors #42

alexlnkp commented Jun 12, 2024 •

edited

Loading

alexlnkp commented Jun 12, 2024 •

edited

Loading

alexlnkp commented Jun 12, 2024 •

edited

Loading