-
Notifications
You must be signed in to change notification settings - Fork 3k
datanode崩溃问题 #38198
Replies: 5 comments · 41 replies
-
有没有mixcoord的日志? |
Beta Was this translation helpful? Give feedback.
All reactions
-
[2024/12/04 02:19:15.551 +00:00] [WARN] [grpcclient/client.go:248] ["failed to get client address"] [error="find no available querycoord, check querycoord state"] |
Beta Was this translation helpful? Give feedback.
All reactions
-
[2024/12/04 02:19:15.990 +00:00] [WARN] [rootcoord/root_coord.go:1179] ["failed to describe collection"] [collectionName=] [dbName=] [id=454078395778423185] [ts=18446744073709551615] [allowUnavailable=false] [error="collection not found[collection=454078395778423185]"] |
Beta Was this translation helpful? Give feedback.
All reactions
-
[2024/12/04 02:19:17.138 +00:00] [WARN] [proxyutil/proxy_client_manager.go:288] ["proxy client is empty, GetMetrics will not send to any client"] |
Beta Was this translation helpful? Give feedback.
All reactions
-
[2024/12/04 02:19:21.994 +00:00] [WARN] [datacoord/handler.go:444] ["datacoord ServerHandler GetCollection finally failed"] [collectionID=454078395778423185] [error="collection not found[collection=454078395778423185]"] |
Beta Was this translation helpful? Give feedback.
All reactions
-
[2024/12/04 02:19:21.995 +00:00] [WARN] [rootcoord/root_coord.go:1179] ["failed to describe collection"] [collectionName=] [dbName=] [id=454078395800104160] [ts=18446744073709551615] [allowUnavailable=false] [error="collection not found[collection=454078395800104160]"] |
Beta Was this translation helpful? Give feedback.
All reactions
-
像是querycoord在初始化的过程中崩了,你用这个脚本把集群的日志都导出来发给我们看看,最好是用文件的形式attach上来 |
Beta Was this translation helpful? Give feedback.
All reactions
-
[2024/12/04 02:19:21.995 +00:00] [WARN] [datacoord/handler.go:439] ["failed to load collection from rootcoord"] [collectionID=454078395800104160] [error="collection not found[collection=454078395800104160]"] there seems to be collecton information missing. |
Beta Was this translation helpful? Give feedback.
All reactions
-
@silear Right now, I fixed could be clean all the meta on etcd for collection 454078395800104160 using birdwatcher |
Beta Was this translation helpful? Give feedback.
All reactions
-
正常情况下,如果flush执行完毕,是不是datanode的cpu利用率会降下来?flush执行大概多长时间?与数据量成正比吗?我发现在插入过程中,cpu升起来,我停止插入,cpu很久都不会降下来,是flush没有执行完吗,还是什么原因 |
Beta Was this translation helpful? Give feedback.
All reactions
-
flush由内部逻辑决定,默认growing segment如果达到100M左右就触发flush,growing segment如果存在时间超过10分钟,就触发flush。flush仅仅是写磁盘,没什么计算量的,磁盘快就flush快,最多也就秒级。我测试下来datanode的cpu利用率保持在10%以下,datanode真没什么计算的事情。 |
Beta Was this translation helpful? Give feedback.
All reactions
-
确定是datanode的cpu么,还是index node?indexnode构建索引时间比较长,不降下来是正常的。如果datanode不降下来,大概说明是消费积压了,从流里面消费不过来 |
Beta Was this translation helpful? Give feedback.
All reactions
-
确定是datanode,消费积压原因是什么?与我的kafka规格有关吗?要如何解决呢 |
Beta Was this translation helpful? Give feedback.
All reactions
-
方便用pprof 连接9091,分析下cpu开销来源么? |
Beta Was this translation helpful? Give feedback.
All reactions
-
1c4g通常情况不够,尤其是indexnode,建议至少2c8g。 |
Beta Was this translation helpful? Give feedback.
-
您好,我在集群上部署milvus2.4.14,平稳运行了一段时间。做了数据量上限的测试,插入了1000w数据,datanode崩溃,重启服务,datanode反复重启,报oomkilled,调大了datanode的内存,现在mixcoord、proxy、datanode均无法启动,日志如下:
Welcome to use Milvus!
Version: v2.4.14
Built: Tue Oct 29 09:50:17 UTC 2024
GoVersion: go version go1.21.11 linux/amd64
open pid file: /run/milvus/datanode.pid
lock pid file: /run/milvus/datanode.pid
[2024/12/04 02:19:12.317 +00:00] [INFO] [roles/roles.go:333] ["starting running Milvus components"]
[2024/12/04 02:19:12.317 +00:00] [INFO] [roles/roles.go:182] ["Enable Jemalloc"] ["Jemalloc Path"=/milvus/lib/libjemalloc.so]
[2024/12/04 02:19:12.325 +00:00] [DEBUG] [config/etcd_source.go:52] ["init etcd source"] [etcdInfo="{"UseEmbed":false,"EnableAuth":false,"UserName":"","PassWord":"","UseSSL":false,"Endpoints":["t-milvus-etcd-0.t-milvus-etcd-headless.default.svc.cluster.local:2379","t-milvus-etcd-1.t-milvus-etcd-headless.default.svc.cluster.local:2379","t-milvus-etcd-2.t-milvus-etcd-headless.default.svc.cluster.local:2379"],"KeyPrefix":"by-dev","CertFile":"/path/to/etcd-client.pem","KeyFile":"/path/to/etcd-client-key.pem","CaCertFile":"/path/to/ca.pem","MinVersion":"1.3","RefreshInterval":5000000000}"]
[2024/12/04 02:19:12.325 +00:00] [INFO] [etcd/etcd_util.go:47] ["create etcd client"] [useEmbedEtcd=false] [useSSL=false] [endpoints="[t-milvus-etcd-0.t-milvus-etcd-headless.default.svc.cluster.local:2379,t-milvus-etcd-1.t-milvus-etcd-headless.default.svc.cluster.local:2379,t-milvus-etcd-2.t-milvus-etcd-headless.default.svc.cluster.local:2379]"] [minVersion=1.3]
[2024/12/04 02:19:12.325 +00:00] [DEBUG] [config/refresher.go:67] ["start refreshing configurations"] [source=FileSource]
[2024/12/04 02:19:12.328 +00:00] [DEBUG] [config/etcd_source.go:92] ["etcd refreshConfigurations"] [prefix=by-dev/config] [endpoints="[t-milvus-etcd-0.t-milvus-etcd-headless.default.svc.cluster.local:2379,t-milvus-etcd-1.t-milvus-etcd-headless.default.svc.cluster.local:2379,t-milvus-etcd-2.t-milvus-etcd-headless.default.svc.cluster.local:2379]"]
[2024/12/04 02:19:12.329 +00:00] [DEBUG] [config/refresher.go:67] ["start refreshing configurations"] [source=EtcdSource]
[2024/12/04 02:19:12.330 +00:00] [INFO] [paramtable/component_param.go:4243] ["DeployModeEnv is not set, use default"] [default=0.5]
[2024/12/04 02:19:12.330 +00:00] [INFO] [paramtable/hook_config.go:21] ["hook config"] [hook={}]
[2024/12/04 02:19:12.331 +00:00] [INFO] [logutil/logutil.go:163] ["Log directory"] [configDir=]
[2024/12/04 02:19:12.331 +00:00] [INFO] [logutil/logutil.go:164] ["Set log file to "] [path=]
[2024/12/04 02:19:12.331 +00:00] [INFO] [roles/roles.go:282] [setupPrometheusHTTPServer]
[2024/12/04 02:19:12.331 +00:00] [INFO] [http/server.go:160] ["management listen"] [addr=:9091]
[2024/12/04 02:19:12.331 +00:00] [INFO] [gc/gc_tuner.go:137] ["GC Helper initialized."] ["Initial GoGC"=100] [minimumGOGC=30] [maximumGOGC=200] [memoryThreshold=7730941132]
[2024/12/04 02:19:12.331 +00:00] [INFO] [datanode/service.go:102] ["DataNode listen on"] [address="[::]:21124"] [port=21124]
[2024/12/04 02:19:12.332 +00:00] [INFO] [etcd/etcd_util.go:47] ["create etcd client"] [useEmbedEtcd=false] [useSSL=false] [endpoints="[t-milvus-etcd-0.t-milvus-etcd-headless.default.svc.cluster.local:2379,t-milvus-etcd-1.t-milvus-etcd-headless.default.svc.cluster.local:2379,t-milvus-etcd-2.t-milvus-etcd-headless.default.svc.cluster.local:2379]"] [minVersion=1.3]
[2024/12/04 02:19:12.334 +00:00] [INFO] [datanode/service.go:253] ["DataNode address"] [address=12.11.0.120:21124]
[2024/12/04 02:19:12.334 +00:00] [INFO] [datanode/service.go:254] ["DataNode serverID"] [serverID=0]
[2024/12/04 02:19:12.435 +00:00] [INFO] [datanode/service.go:263] ["initializing RootCoord client for DataNode"]
[2024/12/04 02:19:12.435 +00:00] [INFO] [etcd/etcd_util.go:47] ["create etcd client"] [useEmbedEtcd=false] [useSSL=false] [endpoints="[t-milvus-etcd-0.t-milvus-etcd-headless.default.svc.cluster.local:2379,t-milvus-etcd-1.t-milvus-etcd-headless.default.svc.cluster.local:2379,t-milvus-etcd-2.t-milvus-etcd-headless.default.svc.cluster.local:2379]"] [minVersion=1.3]
[2024/12/04 02:19:12.442 +00:00] [WARN] [grpcclient/client.go:404] ["call received grpc error"] [clientRole=rootcoord] [error="rpc error: code = Unknown desc = node not match[expectedNodeID=1045][actualNodeID=1048]"]
[2024/12/04 02:19:12.442 +00:00] [WARN] [grpcclient/client.go:488] ["start to reset connection because of specific reasons"] [client_role=rootcoord] [error="rpc error: code = Unknown desc = node not match[expectedNodeID=1045][actualNodeID=1048]"]
[2024/12/04 02:19:12.443 +00:00] [INFO] [grpcclient/client.go:238] ["previous client closed"] [role=rootcoord] [addr=]
[2024/12/04 02:19:12.444 +00:00] [WARN] [retry/retry.go:106] ["retry func failed"] [retried=0] [error="rpc error: code = Unknown desc = node not match[expectedNodeID=1045][actualNodeID=1048]"]
[2024/12/04 02:19:12.645 +00:00] [WARN] [grpcclient/client.go:404] ["call received grpc error"] [clientRole=rootcoord] [error="rpc error: code = Unknown desc = node not match[expectedNodeID=1045][actualNodeID=1048]"]
[2024/12/04 02:19:12.645 +00:00] [WARN] [grpcclient/client.go:488] ["start to reset connection because of specific reasons"] [client_role=rootcoord] [error="rpc error: code = Unknown desc = node not match[expectedNodeID=1045][actualNodeID=1048]"]
[2024/12/04 02:19:12.645 +00:00] [INFO] [grpcclient/client.go:238] ["previous client closed"] [role=rootcoord] [addr=]
[2024/12/04 02:19:13.047 +00:00] [WARN] [grpcclient/client.go:404] ["call received grpc error"] [clientRole=rootcoord] [error="rpc error: code = Unknown desc = node not match[expectedNodeID=1045][actualNodeID=1048]"]
[2024/12/04 02:19:13.047 +00:00] [WARN] [grpcclient/client.go:488] ["start to reset connection because of specific reasons"] [client_role=rootcoord] [error="rpc error: code = Unknown desc = node not match[expectedNodeID=1045][actualNodeID=1048]"]
[2024/12/04 02:19:13.047 +00:00] [INFO] [grpcclient/client.go:238] ["previous client closed"] [role=rootcoord] [addr=]
[2024/12/04 02:19:13.849 +00:00] [WARN] [grpcclient/client.go:404] ["call received grpc error"] [clientRole=rootcoord] [error="rpc error: code = Unknown desc = node not match[expectedNodeID=1045][actualNodeID=1048]"]
[2024/12/04 02:19:13.849 +00:00] [WARN] [grpcclient/client.go:488] ["start to reset connection because of specific reasons"] [client_role=rootcoord] [error="rpc error: code = Unknown desc = node not match[expectedNodeID=1045][actualNodeID=1048]"]
[2024/12/04 02:19:13.849 +00:00] [INFO] [grpcclient/client.go:238] ["previous client closed"] [role=rootcoord] [addr=]
[2024/12/04 02:19:15.451 +00:00] [WARN] [grpcclient/client.go:404] ["call received grpc error"] [clientRole=rootcoord] [error="rpc error: code = Unknown desc = node not match[expectedNodeID=1045][actualNodeID=1048]"]
[2024/12/04 02:19:15.451 +00:00] [WARN] [grpcclient/client.go:488] ["start to reset connection because of specific reasons"] [client_role=rootcoord] [error="rpc error: code = Unknown desc = node not match[expectedNodeID=1045][actualNodeID=1048]"]
[2024/12/04 02:19:15.451 +00:00] [INFO] [grpcclient/client.go:238] ["previous client closed"] [role=rootcoord] [addr=]
[2024/12/04 02:19:15.452 +00:00] [WARN] [retry/retry.go:106] ["retry func failed"] [retried=4] [error="rpc error: code = Unknown desc = node not match[expectedNodeID=1045][actualNodeID=1048]"]
[2024/12/04 02:19:18.653 +00:00] [INFO] [componentutil/componentutil.go:61] ["WaitForComponentStates success"] ["current state"=Healthy]
[2024/12/04 02:19:18.653 +00:00] [INFO] [datanode/service.go:274] ["RootCoord client is ready for DataNode"]
[2024/12/04 02:19:18.655 +00:00] [WARN] [grpcclient/client.go:248] ["failed to get client address"] [error="find no available datacoord, check datacoord state"]
[2024/12/04 02:19:18.655 +00:00] [WARN] [grpcclient/client.go:453] ["fail to get grpc client"] [client_role=datacoord] [error="find no available datacoord, check datacoord state"]
[2024/12/04 02:19:18.655 +00:00] [WARN] [grpcclient/client.go:474] ["grpc client is nil, maybe fail to get client in the retry state"] [client_role=datacoord] [error="empty grpc client: find no available datacoord, check datacoord state"] [errorVerbose="empty grpc client: find no available datacoord, check datacoord state\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).call.func2\n | \t/workspace/source/internal/util/grpcclient/client.go:473\n | github.com/milvus-io/milvus/pkg/util/retry.Handle\n | \t/workspace/source/pkg/util/retry/retry.go:104\n | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).call\n | \t/workspace/source/internal/util/grpcclient/client.go:466\n | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n | \t/workspace/source/internal/util/grpcclient/client.go:553\n | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n | \t/workspace/source/internal/util/grpcclient/client.go:569\n | github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]\n | \t/workspace/source/internal/distributed/datacoord/client/client.go:107\n | github.com/milvus-io/milvus/internal/distributed/datacoord/client.(*Client).GetComponentStates\n | \t/workspace/source/internal/distributed/datacoord/client/client.go:121\n | github.com/milvus-io/milvus/internal/util/componentutil.WaitForComponentStates[...].func1\n | \t/workspace/source/internal/util/componentutil/componentutil.go:39\n | github.com/milvus-io/milvus/pkg/util/retry.Do\n | \t/workspace/source/pkg/util/retry/retry.go:44\n | github.com/milvus-io/milvus/internal/util/componentutil.WaitForComponentStates[...]\n | \t/workspace/source/internal/util/componentutil/componentutil.go:64\n | github.com/milvus-io/milvus/internal/util/componentutil.WaitForComponentInitOrHealthy[...]\n | \t/workspace/source/internal/util/componentutil/componentutil.go:71\n | github.com/milvus-io/milvus/internal/distributed/datanode.(*Server).init\n | \t/workspace/source/internal/distributed/datanode/service.go:289\n | github.com/milvus-io/milvus/internal/distributed/datanode.(*Server).Run\n | \t/workspace/source/internal/distributed/datanode/service.go:184\n | github.com/milvus-io/milvus/cmd/components.(*DataNode).Run\n | \t/workspace/source/cmd/components/data_node.go:59\n | github.com/milvus-io/milvus/cmd/roles.runComponent[...].func1\n | \t/workspace/source/cmd/roles/roles.go:126\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1650\nWraps: (2) empty grpc client\nWraps: (3) find no available datacoord, check datacoord state\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString"]
[2024/12/04 02:19:18.655 +00:00] [WARN] [grpcclient/client.go:248] ["failed to get client address"] [error="find no available datacoord, check datacoord state"]
[2024/12/04 02:19:18.656 +00:00] [WARN] [grpcclient/client.go:460] ["fail to get grpc client in the retry state"] [client_role=datacoord] [error="find no available datacoord, check datacoord state"]
[2024/12/04 02:19:18.656 +00:00] [WARN] [retry/retry.go:106] ["retry func failed"] [retried=0] [error="empty grpc client: find no available datacoord, check datacoord state"] [errorVerbose="empty grpc client: find no available datacoord, check datacoord state\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).call.func2\n | \t/workspace/source/internal/util/grpcclient/client.go:473\n | github.com/milvus-io/milvus/pkg/util/retry.Handle\n | \t/workspace/source/pkg/util/retry/retry.go:104\n | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).call\n | \t/workspace/source/internal/util/grpcclient/client.go:466\n | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n | \t/workspace/source/internal/util/grpcclient/client.go:553\n | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n | \t/workspace/source/internal/util/grpcclient/client.go:569\n | github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]\n | \t/workspace/source/internal/distributed/datacoord/client/client.go:107\n | github.com/milvus-io/milvus/internal/distributed/datacoord/client.(*Client).GetComponentStates\n | \t/workspace/source/internal/distributed/datacoord/client/client.go:121\n | github.com/milvus-io/milvus/internal/util/componentutil.WaitForComponentStates[...].func1\n | \t/workspace/source/internal/util/componentutil/componentutil.go:39\n | github.com/milvus-io/milvus/pkg/util/retry.Do\n | \t/workspace/source/pkg/util/retry/retry.go:44\n | github.com/milvus-io/milvus/internal/util/componentutil.WaitForComponentStates[...]\n | \t/workspace/source/internal/util/componentutil/componentutil.go:64\n | github.com/milvus-io/milvus/internal/util/componentutil.WaitForComponentInitOrHealthy[...]\n | \t/workspace/source/internal/util/componentutil/componentutil.go:71\n | github.com/milvus-io/milvus/internal/distributed/datanode.(*Server).init\n | \t/workspace/source/internal/distributed/datanode/service.go:289\n | github.com/milvus-io/milvus/internal/distributed/datanode.(*Server).Run\n | \t/workspace/source/internal/distributed/datanode/service.go:184\n | github.com/milvus-io/milvus/cmd/components.(*DataNode).Run\n | \t/workspace/source/cmd/components/data_node.go:59\n | github.com/milvus-io/milvus/cmd/roles.runComponent[...].func1\n | \t/workspace/source/cmd/roles/roles.go:126\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1650\nWraps: (2) empty grpc client\nWraps: (3) find no available datacoord, check datacoord state\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString"]
Beta Was this translation helpful? Give feedback.
All reactions