-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Outlier detection support in DestinationRule for grpc proxyless #3000
base: master
Are you sure you want to change the base?
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -962,6 +962,74 @@ message OutlierDetection { | |
// disabled by setting it to 0%. The default is 0% as it's not typically | ||
// applicable in k8s environments with few pods per service. | ||
int32 min_health_percent = 5; | ||
|
||
// FailurePercent Algorithm for Grpc-xds proxyless for deciding | ||
// if the host is an outlier or not. | ||
FailurePercentageEjection failure_percentage_ejection = 10; | ||
|
||
// SuccessRateEjection Algorithm for Grpc-xds proxyless for deciding | ||
// if the host is an outlier or not. | ||
SuccessRateEjection success_rate_ejection = 11; | ||
} | ||
|
||
// Parameters for the failure percentage algorithm. | ||
// This algorithm ejects individual endpoints whose failure rate is greater than | ||
// some threshold, independently of any other endpoint | ||
// Defaults to values declared in envoy.config.cluster.v3.OutlierDetection.SuccessRateEjection | ||
message FailurePercentageEjection { | ||
// The failure percentage to use when determining failure percentage-based outlier detection. If | ||
// the failure percentage of a given address is greater than or equal to this value, it will be | ||
// ejected. | ||
google.protobuf.UInt32Value threshold = 1; | ||
|
||
// The % chance that an address will be actually ejected when an outlier status is detected through | ||
// failure percentage statistics. This setting can be used to disable ejection or to ramp it up | ||
// slowly. | ||
google.protobuf.UInt32Value enforcement_percentage = 2; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we need this? I am struggling to see the use case There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. enforcement_percentage this is like what is the probability for a host to be removed from LB pool when detected as outlier. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know what it is, but I am not clear why we need to offer it or why a user would want to set it. "It exists in the Envoy API" is not a reason why, either -- Envoy's API has different design principals than Istio There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @howardjohn I think I understand the use-case here, lets say that there are 500 servers and 50 of them started showing ejection behaviour. Then the outlier detection algorithm will eject all these 50 servers and distribute its load on rest of the 450 servers, this might be a big traffic shift for other servers to handle. On the contrary, having this setting to lets say 50 will eliminate 25 and then 12 and so on, such that the traffic is ramped up in corresponding intervals. Although none of the documentations (both enovy / grpc-xds) clearly mentions in which scenario this is applicable. But from what I see it may help prevent issues with temporary spikes and help smoothen out the curve. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well - the documentation on this PR is also pretty unclear. My concern is that reading either doc - I wouldn't know how to use it, and most likely people will use it wrong or not take advantage if it is actually needed - on an already over-complex system. I understand there is a need - and envoy and proxyless may support this - but if we go down the path of supporting each and every envoy field we may be better off with just adopting Envoy API ( which we already do in EnvoyFilter - with well known problems ). |
||
|
||
// The minimum number of addresses in order to perform failure percentage-based ejection. | ||
// If the total number of addresses is less than this value, failure percentage-based | ||
// ejection will not be performed. | ||
google.protobuf.UInt32Value minimum_hosts = 3; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we really need this in the context of Istio? How do we expect users to set it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @howardjohn for this and the remaining fields, I am a little confused when we say in the context of istio, I would like to better understand how do we decide which properties do we need to translate to data plane like envoy / grpc-xds, because istio is itself might not be doing any action here excpet for providing the configurations and if a user would like to configure these things for their setup, how would one ensure that they are made available to them via the destination rule otherwise? I would say these are these parameters are more or less tuning parameters and may not be left as well left to default values (which is 5), because they would otherwise not work for the cases where the service is having less than 5 hosts. They are better tuned as per the traffic patterns and the use cases of the user. I would have given better answer if I would have used it on staging / production setup and tried and tested it, but to test it we would as well be needing these features in istio. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Normally users try it with EnvoyFilter - adding it to a beta API just to test it out and figure out that only 1 or 2 users understand how to use it - but we have to maintain it for a long time is not ideal. I know John has a different opinion on new CRDs - but given that Ambient will have particular troubles with DestinationRule ( per-node ztunnel won't apply them - and waypoints can't respect the full semantics ) - this sounds like something that would be much better off in a separate CRD, like CircuitBreaking, with the new attachment model. DR is in particular tricky because it can be used both client side and server side. It is unreasonable to expect clients to know how many servers are running ( and all of them to update when this change). With auto-scaling - I doubt even the producer can know this. Some of those settings would be far better left to Istiod to compute. |
||
|
||
// The minimum number of total requests that must be collected in one interval (as defined by the | ||
// interval duration above) to perform failure percentage-based ejection for this address. If the | ||
// volume is lower than this setting, failure percentage-based ejection will not be performed for | ||
// this host. | ||
google.protobuf.UInt32Value request_volume = 4; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure if it is a sliding window or not There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the request for the host should be atleast request_volume count within the interval period in order to be eligible for outlierDetection There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we really need this in the context of Istio? How do we expect users to set it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @howardjohn make sense will check and update regarding the fields priority and usecase! |
||
} | ||
|
||
// Parameters for the success rate ejection algorithm. | ||
// This algorithm monitors the request success rate for all endpoints and | ||
// ejects individual endpoints whose success rates are statistical outliers. | ||
// Defaults to values declared in envoy.config.cluster.v3.OutlierDetection.SuccessRateEjection | ||
message SuccessRateEjection { | ||
// This factor is used to determine the ejection threshold for success rate | ||
// outlier ejection. The ejection threshold is the difference between the | ||
// mean success rate, and the product of this factor and the standard | ||
// deviation of the mean success rate: mean - (stdev * | ||
// success_rate_stdev_factor). This factor is divided by a thousand to get a | ||
// double. That is, if the desired factor is 1.9, the runtime value should | ||
// be 1900. | ||
google.protobuf.UInt32Value stdev_factor = 1; | ||
|
||
// The % chance that an address will be actually ejected when an outlier status | ||
// is detected through success rate statistics. This setting can be used to | ||
// disable ejection or to ramp it up slowly. | ||
google.protobuf.UInt32Value enforcement_percentage = 2; | ||
|
||
// The number of addresses that must have enough request volume to | ||
// detect success rate outliers. If the number of addresses is less than this | ||
// setting, outlier detection via success rate statistics is not performed | ||
// for any addresses. | ||
google.protobuf.UInt32Value minimum_hosts = 3; | ||
|
||
// The minimum number of total requests that must be collected in one | ||
// interval (as defined by the interval duration above) to include this address | ||
// in success rate based outlier detection. If the volume is lower than this | ||
// setting, outlier detection via success rate statistics is not performed | ||
// for that address. | ||
google.protobuf.UInt32Value request_volume = 4; | ||
} | ||
|
||
// SSL/TLS related settings for upstream connections. See Envoy's [TLS | ||
|
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this is only for grpc not sidecar.
How does it collaborate with the existing configs like
Consecutive_5Xx
? Does it have a limitation only some of these can be set.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently istio provides outlierDetection which is supported by envoy proxy only. But these FailurePercent and SuccessRate are not supported for both grpc-xds and envoy.
Does it have a limitation only some of these can be set?
regarding this , in order to trigger outlierDetection we need to set properties if either of these algorithm
reference for previous discussion:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does envoy not support the same ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Envoy has support envoy outlier detection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok but they do apply to envoy as well, right? if so, we shouldn't say its "for Grpc-xds proxyless", as its not. Its for any proxy type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot get the answer, can you reply positively
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hzxuzhonghu envoy proto for outlierDetection has those parameters for FailurePercent and SuccessRate algorithm, but currently we don't have a way to configure it from istio. Only way to configure those for application using envoy-proxy is through EnvoyFilter. But our use case is for grpc proxyless application(no envoy sidecar proxy) , and istio does not support a way to configure those Algorithm parameters. Also currently there is no way to configure outlierDetection for grpc-proxyless application.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@howardjohn @hzxuzhonghu let me answer these questions on @aditya-32 behalf.
Q. Does envoy support these parameters?
Yes
Q. How does it collaborate with the existing configs like Consecutive_5XX?
From Success Rate and Failure Percentage
There is no mention of how they are able to co-relate with the consecutive_5xx, only mention are of the configurations like
split_external_local_origin_errors
and few of the configurations.In this scenario, I believe that it is for the envoy / grpc to handle how they would want to process the configuration and istio should help with translation and leave the client and the user to understand what combination works, because there will always be few internal implementations which are not documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, my suggestion is to make them explicit, what does each option control and how will they collaborate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hzxuzhonghu can you please re-review, made the changes