Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: A large number of delete operations have accumulated in Milvus. #38708

Open
1 task done
become-nice opened this issue Dec 24, 2024 · 6 comments
Open
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@become-nice
Copy link
Contributor

become-nice commented Dec 24, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.2.10
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2): 2.2.12
- OS(Ubuntu or CentOS): debian10
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

When I performed a large number of batch delete and write operations, I found that the compaction performance could not keep up. A large number of deltalogs were accumulated. After checking the logs, I found that the operations from half an hour ago were still being applied. When I was still performing a large number of deletion and insertion operations, the instance would quickly become uninsertable.

[dep334-milvus-cxg-400-standalone-858b88f-dcfsm] {"log":"[2024/12/24 14:51:08.792 +08:00] [INFO] [datanode/flush_task.go:134] [\"running flush insert task\"] [\"segment ID\"=454803129838442013] [flushed=false] [dropped=false] [position=\"channel_name:\\\"cxg-400-rootcoord-dml_9_454803129824392717v0\\\" msgID:\\\"1\\\\225\\\\335i\\\\t\\\\311O\\\\006\\\" msgGroup:\\\"cxg-400-dataNode-248-cbg-40088-rootcoord-dml_9_454803129824392717v0\\\" timestamp:454825398530211840 \"] [PosTime=2024/12/24 14:20:04.110 +08:00]\n","stream":"stdout","time":"2024-12-24T06:51:08.792346734Z"}

Why can't we restrict the user's insertion and deletion operations when the compaction performance cannot keep up? Or has the optimization been made in version 2.4?

Expected Behavior

When compaction cannot keep up, restrict users from inserting and deleting data to avoid the cluster becoming unavailable.

Steps To Reproduce

run this script:

import time

import numpy as np
from pymilvus import (
    connections,
    utility,
    FieldSchema, CollectionSchema, DataType,
    Collection,
)

fmt = "\n=== {:30} ===\n"
search_latency_fmt = "search latency = {:.4f}s"
num_entities, dim = 10000, 512

print(fmt.format("start connecting to Milvus"))
connections.connect("default", host="yyyyyy", port="19530", user="root", password="xxxx")

has = utility.has_collection("hello_milvus_jetis")
print(f"Does collection hello_milvus_jetis exist in Milvus: {has}")

if has:
    print(fmt.format("Drop collection `hello_milvus_jetis`"))
    utility.drop_collection("hello_milvus_jetis")

fields = [
    FieldSchema(name="pk", dtype=DataType.VARCHAR, is_primary=True, auto_id=False, max_length=100),
    FieldSchema(name="name", dtype=DataType.VARCHAR, is_primary=False, auto_id=False, max_length=100, is_partition_key=True),
    FieldSchema(name="random", dtype=DataType.DOUBLE),
    FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=dim)
]

schema = CollectionSchema(fields, "hello_milvus_jetis is the simplest demo to introduce the APIs")

print(fmt.format("Create collection `hello_milvus_jetis`"))
hello_milvus = Collection("hello_milvus_jetis", schema, consistency_level="Strong")

rng = np.random.default_rng(seed=19530)

print(fmt.format("Start Creating index IVF_FLAT"))
index = {
    "index_type": "IVF_FLAT",
    "metric_type": "L2",
    "params": {"nlist": 128},
}

hello_milvus.create_index("embeddings", index)
from datetime import datetime
print(fmt.format("Start loading"))
hello_milvus.load()

print(fmt.format("Start inserting entities"))

for j in range(0, 500):
    current_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    print(f"Insert batch {j} at {current_time}")
    
    entities = [
        [str("aaaaaaaaaaaaabbbaaaaaaacccccddddeeeeeeeeeeeeeeffffffff" + str(i + j * num_entities + 10000000)) for i in range(num_entities)],
        [str(i) for i in range(num_entities)],
        rng.random(num_entities).tolist(),
        rng.random((num_entities, dim)).tolist(),
    ]

    hello_milvus.insert(entities)

for j in range(0, 5000) :
    current_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

    # 打印信息,包括当前时间和删除对象的信息
    print(f"Delete {j} at {current_time}")

    id_list = [str("aaaaaaaaaaaaabbbaaaaaaacccccddddeeeeeeeeeeeeeeffffffff" + str(i + j * 1000 + + 10000000)) for i in range(1000)]
    quoted_ids = ['"{}"'.format(id) for id in id_list]
    expr = "pk in [" + ", ".join(quoted_ids) + "]"
    hello_milvus.delete(expr)
    # hello_milvus.flush()

    entities = [
        [str("aaaaaaaaaaaaabbbaaaaaaacccccddddeeeeeeeeeeeeeeffffffff" + str(i + j * 1000 + + 10000000)) for i in range(1000)],
        [str(i + 100) for i in range(1000)],
        rng.random(1000).tolist(),
        rng.random((1000, dim)).tolist(),
    ]

    hello_milvus.insert(entities)


for j in range(0, 10000) :
    current_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

    entities = [
        [str("aaaaaaaaaaaaabbbaaaaaaacccccddddeeeeeeeeeeeeeeffffffff" + str(i)) for i in range(1000)],
        [str(i + 100) for i in range(1000)],
        rng.random(1000).tolist(),
        rng.random((1000, dim)).tolist(),
    ]

    hello_milvus.insert(entities)

    # 打印信息,包括当前时间和删除对象的信息
    print(f"Delete {j} at {current_time}")

    id_list = [str("aaaaaaaaaaaaabbbaaaaaaacccccddddeeeeeeeeeeeeeeffffffff" + str(i)) for i in range(1000)]
    quoted_ids = ['"{}"'.format(id) for id in id_list]
    expr = "pk in [" + ", ".join(quoted_ids) + "]"
    hello_milvus.delete(expr)

Milvus Log

No response

Anything else?

No response

@become-nice become-nice added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 24, 2024
@ThreadDao
Copy link
Contributor

ThreadDao commented Dec 25, 2024

@become-nice
As you wish, we have added support for changing the dml rate based on the l0SegmentsRowCount in 2.4.18. Please upgrade your milvus version and try. Please let me know if you encounter any problems.

upgrade milvus to v2.4.18 and config your milvus.yaml as follows: You can refer to it and then tuning it

  • deleteBufferXXXCountProtection: When queryNode cannot handle delete, dml requests delete rate will be decreased
  • l0SegmentsRowCountProtection: When dataNode cannot handle compact, dml requests delete rate will be decreased
quotaAndLimits:
  dml:
    deleteRate:
      max: 2
    enabled: true
    insertRate:
      max: 16
  limitWriting:
    deleteBufferRowCountProtection:
      enabled: true
      highWaterLevel: 25,000,000
      lowWaterLevel: 15,000,000
    deleteBufferSizeProtection:
      enabled: true
      highWaterLevel: 600MiB
      lowWaterLevel: 400MiB
    l0SegmentsRowCountProtection:
      enabled: true
      highWaterLevel: 25,000,000
      lowWaterLevel: 10,000,000
  limits:
    complexDeleteLimitEnable: true

@become-nice
Copy link
Contributor Author

I want to know what is complexDeleteLimitEnable mean. I see that there is a filter operation in version 2.4, which may delete a lot of data with just a few dozen bytes. In this case, will the deleteRate setting lose its original function? In version 2.2, deletion can only be performed based on the primary key id. We can clearly know the maximum number of data items allowed to be deleted per second under a certain deleteRate value, but in 2.4, how is this calculated?

@yanliang567
Copy link
Contributor

In Milvus 2.4, you can delete with an expression, which does not limit in pk only. But you can still delete entities with pk only expression that is the same behavior as 2.2.
As you mentioned, Milvus now can delete a lot of data with just a few dozen types, so we set a delete limition to protect the system from unavailable. Please try the recommendations above, and keep us posted if any issue.s

/assign @become-nice

@yanliang567 yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 26, 2024
@become-nice
Copy link
Contributor Author

In Milvus 2.4, you can delete with an expression, which does not limit in pk only. But you can still delete entities with pk only expression that is the same behavior as 2.2. As you mentioned, Milvus now can delete a lot of data with just a few dozen types, so we set a delete limition to protect the system from unavailable. Please try the recommendations above, and keep us posted if any issue.s

/assign @become-nice

I want to ask what is the difference between l0 compaction and mix compaction

@become-nice
Copy link
Contributor Author

In Milvus 2.4, you can delete with an expression, which does not limit in pk only. But you can still delete entities with pk only expression that is the same behavior as 2.2. As you mentioned, Milvus now can delete a lot of data with just a few dozen types, so we set a delete limition to protect the system from unavailable. Please try the recommendations above, and keep us posted if any issue.s

/assign @become-nice

If I set complexDeleteLimitEnable to false, can users still use filters to delete data?

@yanliang567
Copy link
Contributor

In Milvus 2.4, you can delete with an expression, which does not limit in pk only. But you can still delete entities with pk only expression that is the same behavior as 2.2. As you mentioned, Milvus now can delete a lot of data with just a few dozen types, so we set a delete limition to protect the system from unavailable. Please try the recommendations above, and keep us posted if any issue.s
/assign @become-nice

If I set complexDeleteLimitEnable to false, can users still use filters to delete data?

Yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

3 participants