Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama 3.1 8b not working on SRP #2887

Open
dhandhalyabhavik opened this issue Nov 29, 2024 · 5 comments
Open

llama 3.1 8b not working on SRP #2887

dhandhalyabhavik opened this issue Nov 29, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@dhandhalyabhavik
Copy link

Describe the bug
while running model from ovms server docker image, its not running properly. Logs suggested there was problem with model conversion. So here I am providing all the logs for debug.

To Reproduce
Steps to reproduce the behavior:

Model Conversion

model_server/demos/continuous_batching$ python3 ../../demos/common/export_models/export_model.py text_generation --source_model meta-llama/Llama-3.1-8B-Instruct --weight-format int8 --kv_cache_precision u8 --config_file_path models/config.json --model_repository_path models
template params: {'task': 'text_generation', 'target_device': 'CPU', 'kv_cache_precision': 'u8', 'enable_prefix_caching': False, 'dynamic_split_fuse': True, 'max_num_batched_tokens': None, 'max_num_seqs': None, 'cache_size': 10}
Exporting LLM model to  models/meta-llama/Llama-3.1-8B-Instruct
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  9.25it/s]
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
/home/vpp/miniforge3/lib/python3.10/site-packages/optimum/exporters/openvino/model_patcher.py:506: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if sequence_length != 1:
INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│ Weight compression mode   │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
┝━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│ int8_asym                 │ 100% (226 / 226)            │ 100% (226 / 226)                       │
┕━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
Applying Weight Compression ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 0:01:04 • 0:00:00
Exporting tokenizer to  models/meta-llama/Llama-3.1-8B-Instruct
Loading Huggingface Tokenizer...
Converting Huggingface Tokenizer to OpenVINO...
Saved OpenVINO Tokenizer: models/meta-llama/Llama-3.1-8B-Instruct/openvino_tokenizer.xml, models/meta-llama/Llama-3.1-8B-Instruct/openvino_tokenizer.bin
Saved OpenVINO Detokenizer: models/meta-llama/Llama-3.1-8B-Instruct/openvino_detokenizer.xml, models/meta-llama/Llama-3.1-8B-Instruct/openvino_detokenizer.bin
Created graph models/meta-llama/Llama-3.1-8B-Instruct/graph.pbtxt
models/config.json meta-llama/Llama-3.1-8B-Instruct meta-llama/Llama-3.1-8B-Instruct
Creating new config file
Added servable to config file models/config.json

It has generated these files

ls -lh Llama-3.1-8B-Instruct/
total 7.6G
-rw-rw-r-- 1 vpp vpp  910 Nov 29 21:52 config.json
-rw-rw-r-- 1 vpp vpp  184 Nov 29 21:52 generation_config.json
-rw-rw-r-- 1 vpp vpp  977 Nov 29 21:54 graph.pbtxt
-rw-rw-r-- 1 vpp vpp 1.3M Nov 29 21:54 openvino_detokenizer.bin
-rw-rw-r-- 1 vpp vpp  15K Nov 29 21:54 openvino_detokenizer.xml
-rw-rw-r-- 1 vpp vpp 7.5G Nov 29 21:53 openvino_model.bin
-rw-rw-r-- 1 vpp vpp 2.6M Nov 29 21:53 openvino_model.xml
-rw-rw-r-- 1 vpp vpp 5.4M Nov 29 21:54 openvino_tokenizer.bin
-rw-rw-r-- 1 vpp vpp  28K Nov 29 21:54 openvino_tokenizer.xml
-rw-rw-r-- 1 vpp vpp  296 Nov 29 21:52 special_tokens_map.json
-rw-rw-r-- 1 vpp vpp  55K Nov 29 21:52 tokenizer_config.json
-rw-rw-r-- 1 vpp vpp 8.7M Nov 29 21:52 tokenizer.json

Logs from docker image,

model_server/demos/continuous_batching$ docker run --rm -p 8000:8000 -v $(pwd)/models:/workspace:ro openvino/model_server:latest --rest_port 8000 --config_path /workspace/config.json
[2024-11-29 16:27:54.840][1][serving][info][server.cpp:84] OpenVINO Model Server 2025.0.74acd710
[2024-11-29 16:27:54.840][1][serving][info][server.cpp:85] OpenVINO backend 6733cc320915
[2024-11-29 16:27:54.840][1][serving][info][pythoninterpretermodule.cpp:35] PythonInterpreterModule starting
[2024-11-29 16:27:54.901][1][serving][info][pythoninterpretermodule.cpp:46] PythonInterpreterModule started
[2024-11-29 16:27:54.902][1][serving][info][modelmanager.cpp:128] Loading tokenizer CPU extension from libopenvino_tokenizers.so
[2024-11-29 16:27:54.926][1][modelmanager][info][modelmanager.cpp:143] Available devices for Open VINO: CPU
[2024-11-29 16:27:54.926][1][serving][info][grpcservermodule.cpp:159] GRPCServerModule starting
[2024-11-29 16:27:54.926][1][serving][info][grpcservermodule.cpp:228] GRPCServerModule started
[2024-11-29 16:27:54.926][1][serving][info][grpcservermodule.cpp:229] Started gRPC server on port 9178
[2024-11-29 16:27:54.926][1][serving][info][httpservermodule.cpp:33] HTTPServerModule starting
[2024-11-29 16:27:54.926][1][serving][info][httpservermodule.cpp:37] Will start 768 REST workers
[evhttp_server.cc : 253] NET_LOG: Entering the event loop ...
[2024-11-29 16:27:54.965][1][serving][info][http_server.cpp:285] REST server listening on port 8000 with 768 threads
[2024-11-29 16:27:54.965][1][serving][info][httpservermodule.cpp:47] HTTPServerModule started
[2024-11-29 16:27:54.965][1][serving][info][httpservermodule.cpp:48] Started REST server at 0.0.0.0:8000
[2024-11-29 16:27:54.965][1][serving][info][servablemanagermodule.cpp:51] ServableManagerModule starting
[2024-11-29 16:27:54.966][1][modelmanager][info][modelmanager.cpp:554] Configuration file doesn't have custom node libraries property.
[2024-11-29 16:27:54.966][1][modelmanager][info][modelmanager.cpp:597] Configuration file doesn't have pipelines property.
[2024-11-29 16:27:54.966][1][serving][info][mediapipegraphdefinition.cpp:425] MediapipeGraphDefinition initializing graph nodes
[2024-11-29 16:27:55.366][1][serving][error][llmnoderesources.cpp:168] Error during llm node initialization for models_path: /workspace/meta-llama/Llama-3.1-8B-Instruct/./ exception: Exception from src/inference/src/cpp/infer_request.cpp:245:
Exception from src/plugins/intel_cpu/src/graph.cpp:1365:
Node VocabDecoder_122 of type Reference
Check 'inputs.size() == 4' failed at /openvino_tokenizers/src/vocab_decoder.cpp:33:
Too few inputs passed to VocabDecoder, it means it is not converted properly or it is not used in the supported pattern



[2024-11-29 16:27:55.366][1][serving][error][mediapipegraphdefinition.cpp:475] Failed to process LLM node graph meta-llama/Llama-3.1-8B-Instruct
[2024-11-29 16:27:55.366][1][modelmanager][info][pipelinedefinitionstatus.hpp:59] Mediapipe: meta-llama/Llama-3.1-8B-Instruct state changed to: LOADING_PRECONDITION_FAILED after handling: ValidationFailedEvent: 
[2024-11-29 16:27:55.367][1][serving][info][servablemanagermodule.cpp:55] ServableManagerModule started
[2024-11-29 16:27:55.367][879][modelmanager][info][modelmanager.cpp:1097] Started model manager thread
[2024-11-29 16:27:55.367][880][modelmanager][info][modelmanager.cpp:1116] Started cleaner thread

Configuration
cloned main repo today (29th Nov), build docker image (CPU) from scratch. And I faced above error.

Please help me find the issue.

Additional Recommendation/Thoughts
I am just wondering, why we are not pushing this converted models to Huggingface and pulling from there? Because this is not the first time I have faced issue with conversion of model to quantized version.

@dhandhalyabhavik dhandhalyabhavik added the bug Something isn't working label Nov 29, 2024
@dtrawins
Copy link
Collaborator

@dhandhalyabhavik The fix has be pushed to the main branch today. There was an issue with the tokenizer which got updated in the commit d44aeb3
The safest way it to use the released version of the model server and export the model use export tool dependencies from the release branch.
Thank you for the recommendations. We'll work to avoid such issue with compatibility and make the export or pulling easier.

@dhandhalyabhavik
Copy link
Author

Its fixed, Thanks @dtrawins. Closing the issue now.

@dhandhalyabhavik
Copy link
Author

Hi @dtrawins

Re-opening the issue because I am not able to make it working on NUC device.

I tried 2 things,

1

first tried using already released 2024.5 ovms image, that resulted in same above error,

model_server/demos/continuous_batching$ docker run --rm -p 8000:8000 -v $(pwd)/models:/workspace:ro openvino/model_server:latest --rest_port 8000 --config_path /workspace/config.json
[2024-11-29 16:27:54.840][1][serving][info][server.cpp:84] OpenVINO Model Server 2025.0.74acd710
[2024-11-29 16:27:54.840][1][serving][info][server.cpp:85] OpenVINO backend 6733cc320915
[2024-11-29 16:27:54.840][1][serving][info][pythoninterpretermodule.cpp:35] PythonInterpreterModule starting
[2024-11-29 16:27:54.901][1][serving][info][pythoninterpretermodule.cpp:46] PythonInterpreterModule started
[2024-11-29 16:27:54.902][1][serving][info][modelmanager.cpp:128] Loading tokenizer CPU extension from libopenvino_tokenizers.so
[2024-11-29 16:27:54.926][1][modelmanager][info][modelmanager.cpp:143] Available devices for Open VINO: CPU
[2024-11-29 16:27:54.926][1][serving][info][grpcservermodule.cpp:159] GRPCServerModule starting
[2024-11-29 16:27:54.926][1][serving][info][grpcservermodule.cpp:228] GRPCServerModule started
[2024-11-29 16:27:54.926][1][serving][info][grpcservermodule.cpp:229] Started gRPC server on port 9178
[2024-11-29 16:27:54.926][1][serving][info][httpservermodule.cpp:33] HTTPServerModule starting
[2024-11-29 16:27:54.926][1][serving][info][httpservermodule.cpp:37] Will start 768 REST workers
[evhttp_server.cc : 253] NET_LOG: Entering the event loop ...
[2024-11-29 16:27:54.965][1][serving][info][http_server.cpp:285] REST server listening on port 8000 with 768 threads
[2024-11-29 16:27:54.965][1][serving][info][httpservermodule.cpp:47] HTTPServerModule started
[2024-11-29 16:27:54.965][1][serving][info][httpservermodule.cpp:48] Started REST server at 0.0.0.0:8000
[2024-11-29 16:27:54.965][1][serving][info][servablemanagermodule.cpp:51] ServableManagerModule starting
[2024-11-29 16:27:54.966][1][modelmanager][info][modelmanager.cpp:554] Configuration file doesn't have custom node libraries property.
[2024-11-29 16:27:54.966][1][modelmanager][info][modelmanager.cpp:597] Configuration file doesn't have pipelines property.
[2024-11-29 16:27:54.966][1][serving][info][mediapipegraphdefinition.cpp:425] MediapipeGraphDefinition initializing graph nodes
[2024-11-29 16:27:55.366][1][serving][error][llmnoderesources.cpp:168] Error during llm node initialization for models_path: /workspace/meta-llama/Llama-3.1-8B-Instruct/./ exception: Exception from src/inference/src/cpp/infer_request.cpp:245:
Exception from src/plugins/intel_cpu/src/graph.cpp:1365:
Node VocabDecoder_122 of type Reference
Check 'inputs.size() == 4' failed at /openvino_tokenizers/src/vocab_decoder.cpp:33:
Too few inputs passed to VocabDecoder, it means it is not converted properly or it is not used in the supported pattern


[2024-11-29 16:27:55.366][1][serving][error][mediapipegraphdefinition.cpp:475] Failed to process LLM node graph meta-llama/Llama-3.1-8B-Instruct
[2024-11-29 16:27:55.366][1][modelmanager][info][pipelinedefinitionstatus.hpp:59] Mediapipe: meta-llama/Llama-3.1-8B-Instruct state changed to: LOADING_PRECONDITION_FAILED after handling: ValidationFailedEvent: 
[2024-11-29 16:27:55.367][1][serving][info][servablemanagermodule.cpp:55] ServableManagerModule started
[2024-11-29 16:27:55.367][879][modelmanager][info][modelmanager.cpp:1097] Started model manager thread
[2024-11-29 16:27:55.367][880][modelmanager][info][modelmanager.cpp:1116] Started cleaner thread

2

Tried building image from scratch and mid way resulted into this error,

[4,984 / 5,015] Compiling src/test/pythonnode_test.cpp; 43s local, remote-cache ... (24 actions running)
INFO: From Compiling src/test/pythonnode_test.cpp:
In file included from bazel-out/k8-opt/bin/external/linux_openvino/_virtual_includes/openvino_new_headers/openvino/runtime/intel_gpu/ocl/ocl_wrapper.hpp:50,
                 from bazel-out/k8-opt/bin/external/linux_openvino/_virtual_includes/openvino_new_headers/openvino/runtime/intel_gpu/ocl/ocl.hpp:17,
                 from src/test/../modelinstance.hpp:44,
                 from src/test/test_utils.hpp:52,
                 from src/test/c_api_test_utils.hpp:22,
                 from src/test/pythonnode_test.cpp:52:
/usr/include/CL/cl2.hpp:18:151: note: #pragma message: cl2.hpp has been renamed to opencl.hpp to make it clear that it supports all versions of OpenCL. Please include opencl.hpp directly.
   18 | #pragma message("cl2.hpp has been renamed to opencl.hpp to make it clear that it supports all versions of OpenCL. Please include opencl.hpp directly.")
      |                                                                                                                                                       ^
[4,985 / 5,015] Compiling src/test/http_openai_handler_test.cpp; 44s local, remote-cache ... (24 actions running)
INFO: From Compiling src/test/http_openai_handler_test.cpp:
In file included from bazel-out/k8-opt/bin/external/linux_openvino/_virtual_includes/openvino_new_headers/openvino/runtime/intel_gpu/ocl/ocl_wrapper.hpp:50,
                 from bazel-out/k8-opt/bin/external/linux_openvino/_virtual_includes/openvino_new_headers/openvino/runtime/intel_gpu/ocl/ocl.hpp:17,
                 from src/test/../modelinstance.hpp:44,
                 from src/test/test_utils.hpp:52,
                 from src/test/http_openai_handler_test.cpp:36:
/usr/include/CL/cl2.hpp:18:151: note: #pragma message: cl2.hpp has been renamed to opencl.hpp to make it clear that it supports all versions of OpenCL. Please include opencl.hpp directly.
   18 | #pragma message("cl2.hpp has been renamed to opencl.hpp to make it clear that it supports all versions of OpenCL. Please include opencl.hpp directly.")
      |                                                                                                                                                       ^
[4,986 / 5,015] Compiling src/test/prediction_service_test.cpp; 30s local, remote-cache ... (23 actions, 22 running)
[4,986 / 5,015] Compiling src/test/prediction_service_test.cpp; 31s local, remote-cache ... (24 actions running)
[4,986 / 5,015] Compiling src/test/prediction_service_test.cpp; 41s local, remote-cache ... (24 actions running)
[4,986 / 5,015] Compiling src/test/prediction_service_test.cpp; 72s local, remote-cache ... (24 actions running)
[4,986 / 5,015] Compiling src/test/prediction_service_test.cpp; 134s local, remote-cache ... (24 actions running)

Server terminated abruptly (error code: 14, error message: 'Socket closed', log file: '/root/.cache/bazel/_bazel_root/bc57d4817a53cab8c785464da57d1983/server/jvm.out')

The command '/bin/bash -xo pipefail -c if [ "$FUZZER_BUILD" == "0" ]; then bazel build --jobs=$JOBS ${debug_bazel_flags} ${minitrace_flags} //src:ovms //src:ovms_test; fi;' returned a non-zero code: 37
make: *** [Makefile:468: release_image] Error 37

My NUC details:

bhavik@nuc:~/model_server$ lscpu                                                                                                                  
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  16
  On-line CPU(s) list:   0-15
Vendor ID:               GenuineIntel 
  Model name:            12th Gen Intel(R) Core(TM) i7-1260P
    CPU family:          6      
    Model:               154                                                                                                                           
    Thread(s) per core:  2                                                                                                                             
    Core(s) per socket:  12                                                                                                                            
    Socket(s):           1                                                                                                                             
    Stepping:            3                                                                                                                             
    CPU max MHz:         4700.0000                                                                                                                     
    CPU min MHz:         400.0000                                                                                                                      
    BogoMIPS:            4992.00                                           
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscal
                         l nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_f
                         req pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_de
                         adline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow
                          vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_
                         pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_
                         epp hwp_pkg_req umip pku ospke waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize arch_lbr flush_l1d
                          arch_capabilities
Virtualization features:  
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   448 KiB (12 instances)
  L1i:                   640 KiB (12 instances)
  L2:                    9 MiB (6 instances)
  L3:                    18 MiB (1 instance)
NUMA:                                                                      
  NUMA node(s):          1                                                 
  NUMA node0 CPU(s):     0-15

Please help me answer the question.

Also if possible answer the previous unanswered question,
"Why you are not pushing these converted models to Huggingface and pulling from there? Because this is not the first time I have faced issue with conversion of model to quantized version."

@dhandhalyabhavik
Copy link
Author

Hi @dtrawins, can you please help?

@dtrawins
Copy link
Collaborator

dtrawins commented Dec 6, 2024

@dhandhalyabhavik If you are using the public image 2024.5, you should also export the models using the export tool from that release branch https://github.com/openvinotoolkit/model_server/tree/releases/2024/5/demos/common/export_models
There is no guarantee of the compatibility between different versions of the execution runtime and model export tool.
Serving the models directly from HF hub is in plans but the schedule is not confirmed so far.
Regarding the build issue on your NUC, probably you don't have enough RAM so your building container get OOM kill. You can reduce the memory consumption by reducing the number of parallel build jobs: make release_image JOBS=1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants