Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantization failed #1237

Open
2 of 4 tasks
endomorphosis opened this issue Aug 11, 2024 · 8 comments
Open
2 of 4 tasks

Quantization failed #1237

endomorphosis opened this issue Aug 11, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@endomorphosis
Copy link

System Info

The examples provided do not work correctly, I think there has been updates in the intel neural compressor toolkit, which is now 3.0. and the habana quantization toolkit, and the documentation is out of date, I will look into fixing this on my own in the meanwhile.

I did run the neural compressor toolkit 2.4.1 and got some config files from it, I have not grokked the entire habana stack and am just trying to work my way through different packages, so I can get an idea of how it all works together as a unified hole.

https://github.com/endomorphosis/optimum-habana/tree/main/examples/text-generation

root@c6a6613a6f4c:~/optimum-habana/examples/text-generation#   USE_INC=0  QUANT_CONFIG=./quantization_config/maxabs_quant.json TQDM_DISABLE=1 python run_generation.py --model_name_or_path meta-llama/Meta-Llama-3.1-70B-Instruct --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 2048 --max_input_tokens 2048 --bf16 --batch_size 1 --disk_offload --use_flash_attention --flash_attention_recompute

/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
08/11/2024 03:47:15 - INFO - __main__ - Single-device run.
08/11/2024 03:47:32 - WARNING - accelerate.big_modeling - Some parameters are on the meta device device because they were offloaded to the cpu and disk.
QUANT PACKAGE: Loading ./quantization_config/maxabs_quant.json
HQT Git revision =  16.0.526

HQT Configuration =  Fp8cfg(cfg={'dump_stats_path': './hqt_output/measure', 'fp8_config': torch.float8_e4m3fn, 'hp_dtype': torch.bfloat16, 'blocklist': {'names': [], 'types': []}, 'allowlist': {'names': [], 'types': []}, 'mode': <QuantMode.QUANTIZE: 1>, 'scale_method': <ScaleMethod.MAXABS_HW: 4>, 'scale_params': {}, 'observer': 'maxabs', 'mod_dict': {'Matmul': 'matmul', 'Linear': 'linear', 'FalconLinear': 'linear', 'KVCache': 'kv_cache', 'Conv2d': 'linear', 'LoRACompatibleLinear': 'linear', 'LoRACompatibleConv': 'linear', 'Softmax': 'softmax', 'ModuleFusedSDPA': 'fused_sdpa', 'LinearLayer': 'linear', 'LinearAllreduce': 'linear', 'ScopedLinearAllReduce': 'linear', 'LmHeadLinearAllreduce': 'linear'}, 'local_rank': None, 'global_rank': None, 'world_size': 1, 'seperate_measure_files': True, 'verbose': False, 'device_type': 4, 'measure_exclude': <MeasureExclude.OUTPUT: 4>, 'method': 'HOOKS', 'dump_stats_base_path': './hqt_output/', 'shape_file': './hqt_output/measure_hooks_shape', 'scale_file': './hqt_output/measure_hooks_maxabs_MAXABS_HW', 'measure_file': './hqt_output/measure_hooks_maxabs'})

Total modules : 961
Traceback (most recent call last):
  File "/root/optimum-habana/examples/text-generation/run_generation.py", line 692, in <module>
    main()
  File "/root/optimum-habana/examples/text-generation/run_generation.py", line 337, in main
    model, assistant_model, tokenizer, generation_config = initialize_model(args, logger)
  File "/root/optimum-habana/examples/text-generation/utils.py", line 633, in initialize_model
    setup_model(args, model_dtype, model_kwargs, logger)
  File "/root/optimum-habana/examples/text-generation/utils.py", line 265, in setup_model
    model = setup_quantization(model, args)
  File "/root/optimum-habana/examples/text-generation/utils.py", line 206, in setup_quantization
    habana_quantization_toolkit.prep_model(model)
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/prepare_quant/prepare_model.py", line 34, in prep_model
    return _prep_model_with_predefined_config(model, config=config)
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/prepare_quant/prepare_model.py", line 14, in _prep_model_with_predefined_config
    prepare_model(model)
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_core/__init__.py", line 57, in prepare_model
    return quantize(model, mod_list)
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_core/quantize.py", line 62, in quantize
    measurement=load_measurements(model, config.cfg['measure_file'])
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_core/measure.py", line 136, in load_measurements
    d = load_file(fname_np, np.ndarray, fail_on_file_not_exist=config['scale_method'] not in [ScaleMethod.WITHOUT_SCALE, ScaleMethod.UNIT_SCALE])
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_core/common.py", line 106, in load_file
    raise FileNotFoundError(f"Failed to load file {fname}")
FileNotFoundError: Failed to load file ./hqt_output/measure_hooks_maxabs.npz

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

QUANT_CONFIG=./quantization_config/maxabs_quant.json TQDM_DISABLE=1 python run_generation.py --model_name_or_path meta-llama/Meta-Llama-3.1-70B-Instruct --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 2048 --max_input_tokens 2048 --bf16 --batch_size 1 --disk_offload --use_flash_attention --flash_attention_recompute

Expected behavior

trying to use quantized llama 3.1 70b models

@endomorphosis endomorphosis added the bug Something isn't working label Aug 11, 2024
@endomorphosis
Copy link
Author

also the json files in the example are no longer supported in the intel neural compressor, it claims that this key value pair is invalid (as of version 3.0)

"method": "HOOKS",

https://github.com/endomorphosis/optimum-habana/blob/main/examples/text-generation/quantization_config/maxabs_quant.json

@endomorphosis
Copy link
Author

endomorphosis commented Aug 11, 2024

root@8fb421541c5d:~/optimum-habana/examples/text-generation# QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py \

--model_name_or_path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8
--use_hpu_graphs
--use_kv_cache
--limit_hpu_graphs
--bucket_size 128
--max_new_tokens 2048
--batch_size 16
--bf16
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
08/11/2024 06:00:46 - INFO - main - Single-device run.
Traceback (most recent call last):
File "/root/optimum-habana/examples/text-generation/run_generation.py", line 692, in
main()
File "/root/optimum-habana/examples/text-generation/run_generation.py", line 337, in main
model, assistant_model, tokenizer, generation_config = initialize_model(args, logger)
File "/root/optimum-habana/examples/text-generation/utils.py", line 633, in initialize_model
setup_model(args, model_dtype, model_kwargs, logger)
File "/root/optimum-habana/examples/text-generation/utils.py", line 261, in setup_model
model = AutoModelForCausalLM.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
return model_class.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3376, in from_pretrained
hf_quantizer.validate_environment(
File "/usr/local/lib/python3.10/dist-packages/transformers/quantizers/quantizer_fbgemm_fp8.py", line 68, in validate_environment
raise RuntimeError("Using FP8 quantized models with fbgemm kernels requires a GPU")
RuntimeError: Using FP8 quantized models with fbgemm kernels requires a GPU

@endomorphosis
Copy link
Author

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py
--model_name_or_path meta-llama/Meta-Llama-3.1-8B
--use_hpu_graphs
--use_kv_cache
--limit_hpu_graphs
--bucket_size 128
--max_new_tokens 2048
--batch_size 16
--bf16

tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50.5k/50.5k [00:00<00:00, 891kB/s]
tokenizer.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.09M/9.09M [00:00<00:00, 19.3MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 73.0/73.0 [00:00<00:00, 639kB/s]
08/11/2024 06:03:56 - INFO - main - Args: Namespace(device='hpu', model_name_or_path='meta-llama/Meta-Llama-3.1-8B', bf16=True, max_new_tokens=2048, max_input_tokens=0, batch_size=16, warmup=3, n_iterations=5, local_rank=0, use_kv_cache=True, use_hpu_graphs=True, dataset_name=None, column_name=None, do_sample=False, num_beams=1, top_k=None, penalty_alpha=None, trim_logits=False, seed=27, profiling_warmup_steps=0, profiling_steps=0, profiling_record_shapes=False, prompt=None, bad_words=None, force_words=None, assistant_model=None, peft_model=None, num_return_sequences=1, token=None, model_revision='main', attn_softmax_bf16=False, output_dir=None, bucket_size=128, bucket_internal=False, dataset_max_samples=-1, limit_hpu_graphs=True, reuse_cache=False, verbose_workers=False, simulate_dyn_prompt=None, reduce_recompile=False, use_flash_attention=False, flash_attention_recompute=False, flash_attention_causal_mask=False, flash_attention_fast_softmax=False, book_source=False, torch_compile=False, ignore_eos=True, temperature=1.0, top_p=1.0, const_serialization_path=None, disk_offload=False, trust_remote_code=False, load_quantized_model=False, parallel_strategy='none', quant_config='', world_size=0, global_rank=0)
08/11/2024 06:03:56 - INFO - main - device: hpu, n_hpu: 0, bf16: True
08/11/2024 06:03:56 - INFO - main - Model initialization took 23.027s
08/11/2024 06:03:56 - INFO - main - Graph compilation...
Warming up iteration 1/3
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:567: UserWarning: do_sample is set to False. However, temperature is set to 0.6 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:572: UserWarning: do_sample is set to False. However, top_p is set to 0.9 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p.
warnings.warn(
The attention layers in this model are transitioning from computing the RoPE embeddings internally through position_ids (2D tensor with the indexes of the tokens), to using externally computed position_embeddings (Tuple of tensors, containing cos and sin). In v4.45 position_ids will be removed and position_embeddings will be mandatory.
Traceback (most recent call last):
File "/root/optimum-habana/examples/text-generation/run_generation.py", line 692, in
main()
File "/root/optimum-habana/examples/text-generation/run_generation.py", line 461, in main
generate(None, args.reduce_recompile)
File "/root/optimum-habana/examples/text-generation/run_generation.py", line 432, in generate
outputs = model.generate(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/generation/utils.py", line 1287, in generate
result = self._sample(
File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/generation/utils.py", line 2333, in _sample
model_kwargs = self._update_model_kwargs_for_generation(
File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/generation/utils.py", line 358, in _update_model_kwargs_for_generation
cache_name, cache = self._extract_past_from_model_output(
TypeError: GenerationMixin._extract_past_from_model_output() got an unexpected keyword argument 'standardize_cache_format'

@endomorphosis
Copy link
Author

root@8fb421541c5d:/optimum-habana/examples/text-generation# python quantization_tools/unify_measurements.py -g 01234567 -m /root/optimum-habana/examples/text-generation/quantization_config/ -o /root/optimum-h
abana/examples/text-generation/test_1x_measure/
Traceback (most recent call last):
File "/root/optimum-habana/examples/text-generation/quantization_tools/unify_measurements.py", line 198, in
main(sys.argv[1:])
File "/root/optimum-habana/examples/text-generation/quantization_tools/unify_measurements.py", line 187, in main
unify_measurements(
File "/root/optimum-habana/examples/text-generation/quantization_tools/unify_measurements.py", line 38, in unify_measurements
with open(measurement_path, "r") as f:
TypeError: expected str, bytes or os.PathLike object, not NoneType
root@8fb421541c5d:
/optimum-habana/examples/text-generation#

@endomorphosis
Copy link
Author

SRAM_SLICER_SHARED_MME_INPUT_EXPANSION_ENABLED=false ENABLE_EXPERIMENTAL_FLAGS=1 python run_lm_eval.py -o llama_405b_load_uint4_model.txt --model_name_or_path hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 --use_hpu_graphs --use_kv_cache --trim_logits --batch_size 1 --bf16 --attn_softmax_bf16 --bucket_size=128 --bucket_internal

Traceback (most recent call last):
File "/root/optimum-habana/examples/text-generation/run_lm_eval.py", line 229, in
main()
File "/root/optimum-habana/examples/text-generation/run_lm_eval.py", line 195, in main
model, _, tokenizer, generation_config = initialize_model(args, logger)
File "/root/optimum-habana/examples/text-generation/utils.py", line 633, in initialize_model
setup_model(args, model_dtype, model_kwargs, logger)
File "/root/optimum-habana/examples/text-generation/utils.py", line 261, in setup_model
model = AutoModelForCausalLM.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
return model_class.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3366, in from_pretrained
config.quantization_config = AutoHfQuantizer.merge_quantization_configs(
File "/usr/local/lib/python3.10/dist-packages/transformers/quantizers/auto.py", line 161, in merge_quantization_configs
quantization_config = AutoQuantizationConfig.from_dict(quantization_config)
File "/usr/local/lib/python3.10/dist-packages/transformers/quantizers/auto.py", line 91, in from_dict
return target_cls.from_dict(quantization_config_dict)
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/quantization_config.py", line 97, in from_dict
config = cls(**config_dict)
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/quantization_config.py", line 814, in init
self.post_init()
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/quantization_config.py", line 821, in post_init
raise ValueError("AWQ is only available on GPU")
ValueError: AWQ is only available on GPU

@endomorphosis
Copy link
Author

endomorphosis commented Aug 11, 2024

SRAM_SLICER_SHARED_MME_INPUT_EXPANSION_ENABLED=false ENABLE_EXPERIMENTAL_FLAGS=1 python run_lm_eval.py -o acc_load_uint4_model.txt --model_name_or_path hugging-quants/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4 --use_hpu_graphs --use_kv_cache --trim_logits --batch_size 1 --bf16 --attn_softmax_bf16 --bucket_size=128 --bucket_internal --load_quantized_model

/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
08/11/2024 21:43:21 - INFO - __main__ - Single-device run.
/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py:158: UserWarning: torch.hpu.setDeterministic is deprecated and will be removed in next release. Please use torch.use_deterministic_algorithms instead.
  warnings.warn(
[2024-08-11 21:43:23,310] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to hpu (auto detect)
============================= HABANA PT BRIDGE CONFIGURATION ===========================
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH =
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG =
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM       : 1056375276 KB
------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/optimum-habana/examples/text-generation/run_lm_eval.py", line 229, in <module>
    main()
  File "/root/optimum-habana/examples/text-generation/run_lm_eval.py", line 195, in main
    model, _, tokenizer, generation_config = initialize_model(args, logger)
  File "/root/optimum-habana/examples/text-generation/utils.py", line 633, in initialize_model
    setup_model(args, model_dtype, model_kwargs, logger)
  File "/root/optimum-habana/examples/text-generation/utils.py", line 250, in setup_model
    from neural_compressor.torch.quantization import load
ImportError: cannot import name 'load' from 'neural_compressor.torch.quantization' (/usr/local/lib/python3.10/dist-packages/neural_compressor/torch/quantization/__init__.py)```

@regisss
Copy link
Collaborator

regisss commented Dec 16, 2024

@endomorphosis Are we still having these issues?

@endomorphosis
Copy link
Author

@endomorphosis Are we still having these issues?

I am just now doing some work on openvino for the AI PC, and no longer have access to the Habana systems to test that this works, I gave up in frustration of trying to get the 405B fp8 to fit on a single Gaudi node, which I was going to use for synthetic data generation, to convert wikipedia text into knowledge graph data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants