Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New feature for socket timeout and the fix for issue #494 #504

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

shgchg
Copy link

@shgchg shgchg commented Aug 13, 2012

The goals of this fix are to

  1. Add a socket timeout tracker to track how many timeouts Hector client had within a window. Only if the number of the socket timeout occurred more than the allowance within the time window, then Hector connection manager will drop all of the connections to the Cassandra host.

In the current implementation of Hector 1.1, all of the connections of a Cassandra host will be dropped, if there is a single socket time out occurs.

  1. Fix issue#494.

…meouts.

- The trackers are disabled by default.
@zznate
Copy link
Collaborator

zznate commented Aug 14, 2012

Have you tested this in a live-fire situation yet? If so, can you give me an example of the problem this addressed?

In general, I really like the encapsulation, configuration and test cases though. Good work.

@shgchg
Copy link
Author

shgchg commented Aug 15, 2012

Hi Nate,

Thanks for the quick response. The problem that we try to address here is to provide an less aggressive option to replace the current strategy for dealing with the HectorTransportException in HConnectionManager.operateWithFailover.

== Current behavior ==
Whenever there is a single HectorTransportException due to socket timeout, the entire connection pool of the timed out Cassandra host will be shut down. This behavior cause all 6 very busy but still alive Cassandra host being mark as down eventually. So the hostpool will be empty until the quartz triggered CassandraHostRetryService put them back in the host pool. This process involve disconnect and reconnect all of the connections of the Cassandra servers, which is very expensive.

== After the fix ==
The fix adds a tracker that can trace how many socket time out has been happened within a time window. The pool will be shut down, only if the timeout count within the window exceed the preset threshold. The not only help users to define a less aggressive strategy to evict the Cassandra host out of the host pool, but also ensure the shutdown Cassandra host being evict out of the pool eventually.

Here are some tips for setting up socket timeout related properties.

Socket properties of cassandraHostConfigurator: If the calls are sequential for any reason, then you would need socketTimeoutWindow to be bigger than socketTimeoutCounter * cassandraThriftSocketTimeout, or else it would be impossible to mark the host as down. Please keep this in mind when setting up socket related parameters.

Property: useSocketTimeoutTracker
    Description: Track HectorTransportException recieved from specific hosts over a period of time. Default is false.
Property: socketTimeoutWindow
    Description: Timeout window of the socket timeout counter. Default is 100ms.
Property: socketTimeoutCounter
    Description: Timeout count within the socket timeout window. Default is 10.
Property: cassandraThriftSocketTimeout
    Description: The socket timeout to set on the underlying Thrift client transport layer. Default is 0ms.

== HostTimeoutTracker ==
Along the side of developing the SocketTimeoutTracker, I also fixed the HostTimeoutTracker, which does not work as what we defined in the Java doc.

== Test ==
This patch has been unit tested. I am currently testing this in my environment. I'll let you know the result.

Thanks,

@zznate
Copy link
Collaborator

zznate commented Aug 15, 2012

fix that 'mvn install' issue and we're good.

@patricioe
Copy link
Collaborator

After few timeouts, do you reset the state in order to prevent the pool
from shutting down?

For instance, if we don't reset it, the logic can detect that a node must
be shut down after N amount of timeouts that occurred in I.e. 3 months but
very dispersed.

Does it make sense?
On Aug 15, 2012 11:23 AM, "Nate McCall" [email protected] wrote:

fix that 'mvn install' issue and we're good.


Reply to this email directly or view it on GitHubhttps://github.com//pull/504#issuecomment-7764414.

@shgchg
Copy link
Author

shgchg commented Aug 20, 2012

Hi Patricio,

I understand your concern, but that should not happen.

Let's assume we set the max count to 3, and window to 1000ms. So, the queue
size will be 3. The queue will only maintain the timestamp of the 3 most
recent request.

Let's assume on 01/20/2012 we have three timestamps being stored in the
queue, and not more request since then. On 04/20/2012, there is one more
request come in. This request will not shut down the pool of the node,
because (timestamp on 04/20/2012 - timestamp on 01/20/2012) > 1000ms.

Now, the queue will contains two timestamps from 01/20/2012, and one
timestamp from 04/20/2012.

Hope this answered your question. Please let me know if you still think
it's a concern.

Thanks,

On Wed, Aug 15, 2012 at 11:56 AM, Patricio Echague <[email protected]

wrote:

After few timeouts, do you reset the state in order to prevent the pool
from shutting down?

For instance, if we don't reset it, the logic can detect that a node must
be shut down after N amount of timeouts that occurred in I.e. 3 months but
very dispersed.

Does it make sense?
On Aug 15, 2012 11:23 AM, "Nate McCall" [email protected] wrote:

fix that 'mvn install' issue and we're good.


Reply to this email directly or view it on GitHub<
https://github.com/hector-client/hector/pull/504#issuecomment-7764414>.


Reply to this email directly or view it on GitHubhttps://github.com//pull/504#issuecomment-7765480.

@patricioe
Copy link
Collaborator

it does. But I was thinking more like an expiring cache that can hold info
for X amount of seconds. For example: 2 seconds.

If there are no more timeouts within 2 seconds, then the cache will expire
the timeout historic info. Cassandra does that for all the request to
decide when to fire a timeout exception. In fact, I did that work in
cassandra core for the HintedHandoff.

thoughts ?

On Mon, Aug 20, 2012 at 12:46 PM, shgchg [email protected] wrote:

Hi Patricio,

I understand your concern, but that should not happen.

Let's assume we set the max count to 3, and window to 1000ms. So, the
queue
size will be 3. The queue will only maintain the timestamp of the 3 most
recent request.

Let's assume on 01/20/2012 we have three timestamps being stored in the
queue, and not more request since then. On 04/20/2012, there is one more
request come in. This request will not shut down the pool of the node,
because (timestamp on 04/20/2012 - timestamp on 01/20/2012) > 1000ms.

Now, the queue will contains two timestamps from 01/20/2012, and one
timestamp from 04/20/2012.

Hope this answered your question. Please let me know if you still think
it's a concern.

Thanks,

On Wed, Aug 15, 2012 at 11:56 AM, Patricio Echague <
[email protected]

wrote:

After few timeouts, do you reset the state in order to prevent the pool
from shutting down?

For instance, if we don't reset it, the logic can detect that a node
must
be shut down after N amount of timeouts that occurred in I.e. 3 months
but
very dispersed.

Does it make sense?
On Aug 15, 2012 11:23 AM, "Nate McCall" [email protected]
wrote:

fix that 'mvn install' issue and we're good.


Reply to this email directly or view it on GitHub<
https://github.com/hector-client/hector/pull/504#issuecomment-7764414>.


Reply to this email directly or view it on GitHub<
https://github.com/hector-client/hector/pull/504#issuecomment-7765480>.


Reply to this email directly or view it on GitHubhttps://github.com//pull/504#issuecomment-7881037.

@shgchg
Copy link
Author

shgchg commented Aug 21, 2012

I believe your strategy will definitely work, but it involves more
management in expiring the record in the cache. The strategy will need a
quartz task to check periodically. Also, it needs to go through the cache
to expire the history records that's out of the window. The advantage of
this strategy is it could save more memory.

On the other hand, with what I implemented so far. It waste more memory,
but this is less management work involved.

On Mon, Aug 20, 2012 at 4:30 PM, Patricio Echague
[email protected]:

it does. But I was thinking more like an expiring cache that can hold info
for X amount of seconds. For example: 2 seconds.

If there are no more timeouts within 2 seconds, then the cache will expire
the timeout historic info. Cassandra does that for all the request to
decide when to fire a timeout exception. In fact, I did that work in
cassandra core for the HintedHandoff.

thoughts ?

On Mon, Aug 20, 2012 at 12:46 PM, shgchg [email protected]
wrote:

Hi Patricio,

I understand your concern, but that should not happen.

Let's assume we set the max count to 3, and window to 1000ms. So, the
queue
size will be 3. The queue will only maintain the timestamp of the 3 most
recent request.

Let's assume on 01/20/2012 we have three timestamps being stored in the
queue, and not more request since then. On 04/20/2012, there is one more
request come in. This request will not shut down the pool of the node,
because (timestamp on 04/20/2012 - timestamp on 01/20/2012) > 1000ms.

Now, the queue will contains two timestamps from 01/20/2012, and one
timestamp from 04/20/2012.

Hope this answered your question. Please let me know if you still think
it's a concern.

Thanks,

On Wed, Aug 15, 2012 at 11:56 AM, Patricio Echague <
[email protected]

wrote:

After few timeouts, do you reset the state in order to prevent the
pool
from shutting down?

For instance, if we don't reset it, the logic can detect that a node
must
be shut down after N amount of timeouts that occurred in I.e. 3 months
but
very dispersed.

Does it make sense?
On Aug 15, 2012 11:23 AM, "Nate McCall" [email protected]
wrote:

fix that 'mvn install' issue and we're good.


Reply to this email directly or view it on GitHub<
https://github.com/hector-client/hector/pull/504#issuecomment-7764414>.


Reply to this email directly or view it on GitHub<
https://github.com/hector-client/hector/pull/504#issuecomment-7765480>.


Reply to this email directly or view it on GitHub<
https://github.com/hector-client/hector/pull/504#issuecomment-7881037>.


Reply to this email directly or view it on GitHubhttps://github.com//pull/504#issuecomment-7887101.

@patricioe
Copy link
Collaborator

that is already resolve in Guava. It's called Cache. if you already have it
implemented that way, I'm cool with that. We can do another iteration and
improve it.

My main concern was/is memory leak.

On Mon, Aug 20, 2012 at 5:52 PM, shgchg [email protected] wrote:

I believe your strategy will definitely work, but it involves more
management in expiring the record in the cache. The strategy will need a
quartz task to check periodically. Also, it needs to go through the cache
to expire the history records that's out of the window. The advantage of
this strategy is it could save more memory.

On the other hand, with what I implemented so far. It waste more memory,
but this is less management work involved.

On Mon, Aug 20, 2012 at 4:30 PM, Patricio Echague
[email protected]:

it does. But I was thinking more like an expiring cache that can hold
info
for X amount of seconds. For example: 2 seconds.

If there are no more timeouts within 2 seconds, then the cache will
expire
the timeout historic info. Cassandra does that for all the request to
decide when to fire a timeout exception. In fact, I did that work in
cassandra core for the HintedHandoff.

thoughts ?

On Mon, Aug 20, 2012 at 12:46 PM, shgchg [email protected]
wrote:

Hi Patricio,

I understand your concern, but that should not happen.

Let's assume we set the max count to 3, and window to 1000ms. So, the
queue
size will be 3. The queue will only maintain the timestamp of the 3
most
recent request.

Let's assume on 01/20/2012 we have three timestamps being stored in
the
queue, and not more request since then. On 04/20/2012, there is one
more
request come in. This request will not shut down the pool of the node,
because (timestamp on 04/20/2012 - timestamp on 01/20/2012) > 1000ms.

Now, the queue will contains two timestamps from 01/20/2012, and one
timestamp from 04/20/2012.

Hope this answered your question. Please let me know if you still
think
it's a concern.

Thanks,

On Wed, Aug 15, 2012 at 11:56 AM, Patricio Echague <
[email protected]

wrote:

After few timeouts, do you reset the state in order to prevent the
pool
from shutting down?

For instance, if we don't reset it, the logic can detect that a node
must
be shut down after N amount of timeouts that occurred in I.e. 3
months
but
very dispersed.

Does it make sense?
On Aug 15, 2012 11:23 AM, "Nate McCall" [email protected]
wrote:

fix that 'mvn install' issue and we're good.


Reply to this email directly or view it on GitHub<

https://github.com/hector-client/hector/pull/504#issuecomment-7764414>.


Reply to this email directly or view it on GitHub<
https://github.com/hector-client/hector/pull/504#issuecomment-7765480>.


Reply to this email directly or view it on GitHub<
https://github.com/hector-client/hector/pull/504#issuecomment-7881037>.


Reply to this email directly or view it on GitHub<
https://github.com/hector-client/hector/pull/504#issuecomment-7887101>.


Reply to this email directly or view it on GitHubhttps://github.com//pull/504#issuecomment-7888552.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants