New feature for socket timeout and the fix for issue #494 #504

shgchg · 2012-08-13T20:31:18Z

The goals of this fix are to

Add a socket timeout tracker to track how many timeouts Hector client had within a window. Only if the number of the socket timeout occurred more than the allowance within the time window, then Hector connection manager will drop all of the connections to the Cassandra host.

In the current implementation of Hector 1.1, all of the connections of a Cassandra host will be dropped, if there is a single socket time out occurs.

Fix issue#494.

…meouts. - The trackers are disabled by default.

zznate · 2012-08-14T14:07:19Z

Have you tested this in a live-fire situation yet? If so, can you give me an example of the problem this addressed?

In general, I really like the encapsulation, configuration and test cases though. Good work.

shgchg · 2012-08-15T18:18:48Z

Hi Nate,

Thanks for the quick response. The problem that we try to address here is to provide an less aggressive option to replace the current strategy for dealing with the HectorTransportException in HConnectionManager.operateWithFailover.

== Current behavior ==
Whenever there is a single HectorTransportException due to socket timeout, the entire connection pool of the timed out Cassandra host will be shut down. This behavior cause all 6 very busy but still alive Cassandra host being mark as down eventually. So the hostpool will be empty until the quartz triggered CassandraHostRetryService put them back in the host pool. This process involve disconnect and reconnect all of the connections of the Cassandra servers, which is very expensive.

== After the fix ==
The fix adds a tracker that can trace how many socket time out has been happened within a time window. The pool will be shut down, only if the timeout count within the window exceed the preset threshold. The not only help users to define a less aggressive strategy to evict the Cassandra host out of the host pool, but also ensure the shutdown Cassandra host being evict out of the pool eventually.

Here are some tips for setting up socket timeout related properties.

Socket properties of cassandraHostConfigurator: If the calls are sequential for any reason, then you would need socketTimeoutWindow to be bigger than socketTimeoutCounter * cassandraThriftSocketTimeout, or else it would be impossible to mark the host as down. Please keep this in mind when setting up socket related parameters.

Property: useSocketTimeoutTracker
    Description: Track HectorTransportException recieved from specific hosts over a period of time. Default is false.
Property: socketTimeoutWindow
    Description: Timeout window of the socket timeout counter. Default is 100ms.
Property: socketTimeoutCounter
    Description: Timeout count within the socket timeout window. Default is 10.
Property: cassandraThriftSocketTimeout
    Description: The socket timeout to set on the underlying Thrift client transport layer. Default is 0ms.

== HostTimeoutTracker ==
Along the side of developing the SocketTimeoutTracker, I also fixed the HostTimeoutTracker, which does not work as what we defined in the Java doc.

== Test ==
This patch has been unit tested. I am currently testing this in my environment. I'll let you know the result.

Thanks,

zznate · 2012-08-15T18:23:45Z

fix that 'mvn install' issue and we're good.

patricioe · 2012-08-15T18:56:35Z

After few timeouts, do you reset the state in order to prevent the pool
from shutting down?

For instance, if we don't reset it, the logic can detect that a node must
be shut down after N amount of timeouts that occurred in I.e. 3 months but
very dispersed.

Does it make sense?
On Aug 15, 2012 11:23 AM, "Nate McCall" [email protected] wrote:

fix that 'mvn install' issue and we're good.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/504#issuecomment-7764414.

shgchg · 2012-08-20T19:46:37Z

Hi Patricio,

I understand your concern, but that should not happen.

Let's assume we set the max count to 3, and window to 1000ms. So, the queue
size will be 3. The queue will only maintain the timestamp of the 3 most
recent request.

Let's assume on 01/20/2012 we have three timestamps being stored in the
queue, and not more request since then. On 04/20/2012, there is one more
request come in. This request will not shut down the pool of the node,
because (timestamp on 04/20/2012 - timestamp on 01/20/2012) > 1000ms.

Now, the queue will contains two timestamps from 01/20/2012, and one
timestamp from 04/20/2012.

Hope this answered your question. Please let me know if you still think
it's a concern.

Thanks,

On Wed, Aug 15, 2012 at 11:56 AM, Patricio Echague <[email protected]

wrote:

After few timeouts, do you reset the state in order to prevent the pool
from shutting down?

For instance, if we don't reset it, the logic can detect that a node must
be shut down after N amount of timeouts that occurred in I.e. 3 months but
very dispersed.

Does it make sense?
On Aug 15, 2012 11:23 AM, "Nate McCall" [email protected] wrote:

fix that 'mvn install' issue and we're good.

—
Reply to this email directly or view it on GitHub<
https://github.com/hector-client/hector/pull/504#issuecomment-7764414>.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/504#issuecomment-7765480.

patricioe · 2012-08-20T23:30:23Z

it does. But I was thinking more like an expiring cache that can hold info
for X amount of seconds. For example: 2 seconds.

If there are no more timeouts within 2 seconds, then the cache will expire
the timeout historic info. Cassandra does that for all the request to
decide when to fire a timeout exception. In fact, I did that work in
cassandra core for the HintedHandoff.

thoughts ?

On Mon, Aug 20, 2012 at 12:46 PM, shgchg [email protected] wrote:

Hi Patricio,

I understand your concern, but that should not happen.

Let's assume we set the max count to 3, and window to 1000ms. So, the
queue
size will be 3. The queue will only maintain the timestamp of the 3 most
recent request.

Let's assume on 01/20/2012 we have three timestamps being stored in the
queue, and not more request since then. On 04/20/2012, there is one more
request come in. This request will not shut down the pool of the node,
because (timestamp on 04/20/2012 - timestamp on 01/20/2012) > 1000ms.

Now, the queue will contains two timestamps from 01/20/2012, and one
timestamp from 04/20/2012.

Hope this answered your question. Please let me know if you still think
it's a concern.

Thanks,

On Wed, Aug 15, 2012 at 11:56 AM, Patricio Echague <
[email protected]

wrote:

After few timeouts, do you reset the state in order to prevent the pool
from shutting down?

For instance, if we don't reset it, the logic can detect that a node
must
be shut down after N amount of timeouts that occurred in I.e. 3 months
but
very dispersed.

Does it make sense?
On Aug 15, 2012 11:23 AM, "Nate McCall" [email protected]
wrote:

fix that 'mvn install' issue and we're good.

—
Reply to this email directly or view it on GitHub<
https://github.com/hector-client/hector/pull/504#issuecomment-7764414>.

—
Reply to this email directly or view it on GitHub<
https://github.com/hector-client/hector/pull/504#issuecomment-7765480>.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/504#issuecomment-7881037.

shgchg · 2012-08-21T00:52:05Z

I believe your strategy will definitely work, but it involves more
management in expiring the record in the cache. The strategy will need a
quartz task to check periodically. Also, it needs to go through the cache
to expire the history records that's out of the window. The advantage of
this strategy is it could save more memory.

On the other hand, with what I implemented so far. It waste more memory,
but this is less management work involved.

On Mon, Aug 20, 2012 at 4:30 PM, Patricio Echague
[email protected]:

it does. But I was thinking more like an expiring cache that can hold info
for X amount of seconds. For example: 2 seconds.

If there are no more timeouts within 2 seconds, then the cache will expire
the timeout historic info. Cassandra does that for all the request to
decide when to fire a timeout exception. In fact, I did that work in
cassandra core for the HintedHandoff.

thoughts ?

On Mon, Aug 20, 2012 at 12:46 PM, shgchg [email protected]
wrote:

Hi Patricio,

I understand your concern, but that should not happen.

Let's assume we set the max count to 3, and window to 1000ms. So, the
queue
size will be 3. The queue will only maintain the timestamp of the 3 most
recent request.

Let's assume on 01/20/2012 we have three timestamps being stored in the
queue, and not more request since then. On 04/20/2012, there is one more
request come in. This request will not shut down the pool of the node,
because (timestamp on 04/20/2012 - timestamp on 01/20/2012) > 1000ms.

Now, the queue will contains two timestamps from 01/20/2012, and one
timestamp from 04/20/2012.

Hope this answered your question. Please let me know if you still think
it's a concern.

Thanks,

On Wed, Aug 15, 2012 at 11:56 AM, Patricio Echague <
[email protected]

wrote:

After few timeouts, do you reset the state in order to prevent the
pool
from shutting down?

For instance, if we don't reset it, the logic can detect that a node
must
be shut down after N amount of timeouts that occurred in I.e. 3 months
but
very dispersed.

Does it make sense?
On Aug 15, 2012 11:23 AM, "Nate McCall" [email protected]
wrote:

fix that 'mvn install' issue and we're good.

—
Reply to this email directly or view it on GitHub<
https://github.com/hector-client/hector/pull/504#issuecomment-7764414>.

—
Reply to this email directly or view it on GitHub<
https://github.com/hector-client/hector/pull/504#issuecomment-7765480>.

—
Reply to this email directly or view it on GitHub<
https://github.com/hector-client/hector/pull/504#issuecomment-7881037>.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/504#issuecomment-7887101.

patricioe · 2012-08-21T02:31:40Z

that is already resolve in Guava. It's called Cache. if you already have it
implemented that way, I'm cool with that. We can do another iteration and
improve it.

My main concern was/is memory leak.

On Mon, Aug 20, 2012 at 5:52 PM, shgchg [email protected] wrote:

I believe your strategy will definitely work, but it involves more
management in expiring the record in the cache. The strategy will need a
quartz task to check periodically. Also, it needs to go through the cache
to expire the history records that's out of the window. The advantage of
this strategy is it could save more memory.

On the other hand, with what I implemented so far. It waste more memory,
but this is less management work involved.

On Mon, Aug 20, 2012 at 4:30 PM, Patricio Echague
[email protected]:

it does. But I was thinking more like an expiring cache that can hold
info
for X amount of seconds. For example: 2 seconds.

If there are no more timeouts within 2 seconds, then the cache will
expire
the timeout historic info. Cassandra does that for all the request to
decide when to fire a timeout exception. In fact, I did that work in
cassandra core for the HintedHandoff.

thoughts ?

On Mon, Aug 20, 2012 at 12:46 PM, shgchg [email protected]
wrote:

Hi Patricio,

I understand your concern, but that should not happen.

Let's assume we set the max count to 3, and window to 1000ms. So, the
queue
size will be 3. The queue will only maintain the timestamp of the 3
most
recent request.

Let's assume on 01/20/2012 we have three timestamps being stored in
the
queue, and not more request since then. On 04/20/2012, there is one
more
request come in. This request will not shut down the pool of the node,
because (timestamp on 04/20/2012 - timestamp on 01/20/2012) > 1000ms.

Now, the queue will contains two timestamps from 01/20/2012, and one
timestamp from 04/20/2012.

Hope this answered your question. Please let me know if you still
think
it's a concern.

Thanks,

On Wed, Aug 15, 2012 at 11:56 AM, Patricio Echague <
[email protected]

wrote:

After few timeouts, do you reset the state in order to prevent the
pool
from shutting down?

For instance, if we don't reset it, the logic can detect that a node
must
be shut down after N amount of timeouts that occurred in I.e. 3
months
but
very dispersed.

Does it make sense?
On Aug 15, 2012 11:23 AM, "Nate McCall" [email protected]
wrote:

fix that 'mvn install' issue and we're good.

—
Reply to this email directly or view it on GitHub<

https://github.com/hector-client/hector/pull/504#issuecomment-7764414>.

—
Reply to this email directly or view it on GitHub<
https://github.com/hector-client/hector/pull/504#issuecomment-7765480>.

—
Reply to this email directly or view it on GitHub<
https://github.com/hector-client/hector/pull/504#issuecomment-7881037>.

—
Reply to this email directly or view it on GitHub<
https://github.com/hector-client/hector/pull/504#issuecomment-7887101>.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/504#issuecomment-7888552.

This patch fix issue#494 and add a new feature for tracking socket ti…

845ebd2

…meouts. - The trackers are disabled by default.

shgchg added 2 commits August 14, 2012 10:58

Minor formatting change.

cb67cdb

Prepare release for 1.1-2-AMG.

3653fd6

prepare for next development iteration

0bfcf52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New feature for socket timeout and the fix for issue #494 #504

New feature for socket timeout and the fix for issue #494 #504

shgchg commented Aug 13, 2012

zznate commented Aug 14, 2012

shgchg commented Aug 15, 2012

zznate commented Aug 15, 2012

patricioe commented Aug 15, 2012

shgchg commented Aug 20, 2012

patricioe commented Aug 20, 2012

shgchg commented Aug 21, 2012

patricioe commented Aug 21, 2012

New feature for socket timeout and the fix for issue #494 #504

Are you sure you want to change the base?

New feature for socket timeout and the fix for issue #494 #504

Conversation

shgchg commented Aug 13, 2012

zznate commented Aug 14, 2012

shgchg commented Aug 15, 2012

zznate commented Aug 15, 2012

patricioe commented Aug 15, 2012

shgchg commented Aug 20, 2012

patricioe commented Aug 20, 2012

shgchg commented Aug 21, 2012

patricioe commented Aug 21, 2012