added conversion script and example #1

robertgshaw2-neuralmagic · 2024-01-17T17:08:20Z

Added simple example to load GPTQ model from HF hub into Marlin format.

rosario-purple · 2024-01-20T14:39:13Z

@rib-2 Thanks for this! Unfortunately it doesn't work on my machine (8xA100), presumably because it's designed for only one GPU?

alyssavance@7e72bd4e-02:/scratch/brr$ python3 marlin/conversion/convert.py --model-id "TheBloke/Llama-2-7B-Chat-GPTQ" --save-path "./marlin-chat" --do-generation
Loading gptq model...
generation_config.json: 100%|█████████████████████████████████████████████████████| 137/137 [00:00<00:00, 987kB/s]
tokenizer_config.json: 100%|█████████████████████████████████████████████████████| 727/727 [00:00<00:00, 7.70MB/s]
tokenizer.model: 100%|█████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 41.1MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 64.4MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████| 411/411 [00:00<00:00, 4.56MB/s]
Validating compatibility...
Converting model...
--- Converting Module: model.layers.0.self_attn.k_proj
Traceback (most recent call last):
  File "/scratch/brr/marlin/conversion/convert.py", line 143, in <module>
    model = convert_model(model).to("cpu")
  File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/scratch/brr/marlin/conversion/convert.py", line 80, in convert_model
    new_module.pack(linear_module, scales=copy.deepcopy(module.scales.data.t()))
  File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/marlin/__init__.py", line 117, in pack
    w = torch.round(w / s).int()
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cuda:1!
/scratch/miniconda3/envs/brr/lib/python3.10/tempfile.py:860: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpxyeacbfe'>
  _warnings.warn(warn_message, ResourceWarning)

robertgshaw2-neuralmagic · 2024-01-20T14:44:59Z

@rosario-purple just set CUDA_VISIBLE_DEVICES=0, you don't need multiple gpus for this

robertgshaw2-neuralmagic added 2 commits January 17, 2024 17:07

added conversion script and example

56eafc3

typo

2e87035

robertgshaw2-neuralmagic mentioned this pull request Jan 20, 2024

Integrate Marlin Kernels for Int4 GPTQ inference vllm-project/vllm#2497

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added conversion script and example #1

added conversion script and example #1

robertgshaw2-neuralmagic commented Jan 17, 2024 •

edited

Loading

rosario-purple commented Jan 20, 2024

robertgshaw2-neuralmagic commented Jan 20, 2024

added conversion script and example #1

Are you sure you want to change the base?

added conversion script and example #1

Conversation

robertgshaw2-neuralmagic commented Jan 17, 2024 • edited Loading

rosario-purple commented Jan 20, 2024

robertgshaw2-neuralmagic commented Jan 20, 2024

robertgshaw2-neuralmagic commented Jan 17, 2024 •

edited

Loading