DWR

DWR lags(blogs) threads while writing message to scriptSession

Details

  • Type: Bug Bug
  • Status: Open Open
  • Priority: Critical Critical
  • Resolution: Unresolved
  • Affects Version/s: 2.0.6, 3.0.M1, 3.0.RC1
  • Fix Version/s: 3.0.RC3
  • Component/s: Reverse Ajax
  • Documentation Required:
    No
  • Description:
    Hide
    I am using latest build of DWR(Bamboo version), My server runs on Jetty6 + Jetty 7 continuations filter.
    While sending message to scriptSession by scriptSession.add(ScriptBuffer) sometimes the thread is blocked for a while.

    Sometimes those lags are from 1 second to 60 seconds, even more. I have tried to run this code on Jetty6/jetty7/tomcat, Java 5, Java6, OpenJDK, It lagged everywhere.
    I am attaching a log file with lags, so you can see how often it lags.

    Currently I am sending 0.65 mesage/second and I have 250 users connected online. My logs show, that it lags very seldom for longer time, but everytime it happens, the thread which is sending message is useless and it can't send message to other script sessions.


    More info about the bug and all the tries to fix it and to find out why it happens can be found on dwr forum: http://dwr.2114559.n2.nabble.com/Deadlock-on-reverse-ajax-running-on-Jetty6-1-26-td5850763.html


    There are also sctacktraces of blocked threads.


    Show
    I am using latest build of DWR(Bamboo version), My server runs on Jetty6 + Jetty 7 continuations filter. While sending message to scriptSession by scriptSession.add(ScriptBuffer) sometimes the thread is blocked for a while. Sometimes those lags are from 1 second to 60 seconds, even more. I have tried to run this code on Jetty6/jetty7/tomcat, Java 5, Java6, OpenJDK, It lagged everywhere. I am attaching a log file with lags, so you can see how often it lags. Currently I am sending 0.65 mesage/second and I have 250 users connected online. My logs show, that it lags very seldom for longer time, but everytime it happens, the thread which is sending message is useless and it can't send message to other script sessions. More info about the bug and all the tries to fix it and to find out why it happens can be found on dwr forum: http://dwr.2114559.n2.nabble.com/Deadlock-on-reverse-ajax-running-on-Jetty6-1-26-td5850763.html There are also sctacktraces of blocked threads.
  1. DefaultScriptSession.java
    (18 kB)
    libor havlicek
    09/Nov/12 8:03 AM
  2. jstack_16_11_12_2.txt
    (918 kB)
    libor havlicek
    16/Nov/12 8:58 AM
  3. jstack_7_11_12.txt
    (2.07 MB)
    libor havlicek
    09/Nov/12 8:03 AM
  4. pingLags.log
    (42 kB)
    libor havlicek
    15/Feb/11 5:41 AM

Activity

Hide
David Marginian added a comment - 15/Feb/11 6:04 AM

We are still researching this issue. It has not been determined that this is a DWR issue. The user has a complex environment (Amazon cloud, Terracota clustering, etc.).

Show
David Marginian added a comment - 15/Feb/11 6:04 AM We are still researching this issue. It has not been determined that this is a DWR issue. The user has a complex environment (Amazon cloud, Terracota clustering, etc.).
Hide
David Marginian added a comment - 15/Feb/11 6:06 AM

Libor, since the log file you attached only shows your method lagging, in may be helpful for this jira to attach your code etc.

Show
David Marginian added a comment - 15/Feb/11 6:06 AM Libor, since the log file you attached only shows your method lagging, in may be helpful for this jira to attach your code etc.
Hide
libor havlicek added a comment - 15/Feb/11 6:21 AM

I can atttach the stacktrace where it happens: My code is not important, since I am just doing scriptSession.addScript(scriptBuffer); Where scriptBuffer contains only Strings, so there are no locks/any other objects which could possibly block thread why converting POJO to JSON.

I have 2 kinds of lags detected:

Lag canceled at this stacktrace:
at java.lang.Object.wait(Native Method)
at
org.mortbay.io.nio.SelectChannelEndPoint.blockWritable(SelectChannelEndPoint.java:279)
at
org.mortbay.jetty.AbstractGenerator$Output.blockForOutput(AbstractGenerator.java:544)
at
org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:571)
at org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:997)
at
org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:648)
at
org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:579)
at java.io.ByteArrayOutputStream.writeTo(ByteArrayOutputStream.java:109)
at
org.mortbay.jetty.AbstractGenerator$OutputWriter.write(AbstractGenerator.java:903)
at
org.mortbay.jetty.AbstractGenerator$OutputWriter.write(AbstractGenerator.java:752)
at
org.mortbay.jetty.AbstractGenerator$OutputWriter.write(AbstractGenerator.java:741)
at java.io.PrintWriter.write(PrintWriter.java:412)
at java.io.PrintWriter.write(PrintWriter.java:429)
at java.io.PrintWriter.print(PrintWriter.java:559)
at java.io.PrintWriter.println(PrintWriter.java:695)
at
org.directwebremoting.dwrp.PlainScriptConduit.addScript(PlainScriptConduit.java:93)
at
org.directwebremoting.impl.DefaultScriptSession.addScript(DefaultScriptSession.java:239)
at
server.comunication.dwr.OneReverseDWRServer.sendLocalBuffer(OneReverseDWRServer.java:385)

and

---------------------------------------

Lag canceled at this stacktrace:
at
org.directwebremoting.impl.DefaultScriptSession.addScript(DefaultScriptSession.java:190)
at
server.comunication.dwr.OneReverseDWRServer.sendLocalBuffer(OneReverseDWRServer.java:385)
at
server.comunication.dwr.OneReverseDWRServer.sendMessageLocal(OneReverseDWRServer.java:363)
at
server.comunication.dwr.OneReverseDWRServer.sendMessage(OneReverseDWRServer.java:412)
at server.comunication.messaging.SendTask.call(SendTask.java:53)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

---------------------------------------

And I find out, thatin this second stacktrace when it is waiting on org.directwebremoting.impl.DefaultScriptSession.addScript(DefaultScriptSession.java:190), the thread blocking this thread is wating on same place where is waiting my thread from 1st stacktrace. So in fact there is just 1 source of lags.

Show
libor havlicek added a comment - 15/Feb/11 6:21 AM I can atttach the stacktrace where it happens: My code is not important, since I am just doing scriptSession.addScript(scriptBuffer); Where scriptBuffer contains only Strings, so there are no locks/any other objects which could possibly block thread why converting POJO to JSON. I have 2 kinds of lags detected: Lag canceled at this stacktrace: at java.lang.Object.wait(Native Method) at org.mortbay.io.nio.SelectChannelEndPoint.blockWritable(SelectChannelEndPoint.java:279) at org.mortbay.jetty.AbstractGenerator$Output.blockForOutput(AbstractGenerator.java:544) at org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:571) at org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:997) at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:648) at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:579) at java.io.ByteArrayOutputStream.writeTo(ByteArrayOutputStream.java:109) at org.mortbay.jetty.AbstractGenerator$OutputWriter.write(AbstractGenerator.java:903) at org.mortbay.jetty.AbstractGenerator$OutputWriter.write(AbstractGenerator.java:752) at org.mortbay.jetty.AbstractGenerator$OutputWriter.write(AbstractGenerator.java:741) at java.io.PrintWriter.write(PrintWriter.java:412) at java.io.PrintWriter.write(PrintWriter.java:429) at java.io.PrintWriter.print(PrintWriter.java:559) at java.io.PrintWriter.println(PrintWriter.java:695) at org.directwebremoting.dwrp.PlainScriptConduit.addScript(PlainScriptConduit.java:93) at org.directwebremoting.impl.DefaultScriptSession.addScript(DefaultScriptSession.java:239) at server.comunication.dwr.OneReverseDWRServer.sendLocalBuffer(OneReverseDWRServer.java:385) and --------------------------------------- Lag canceled at this stacktrace: at org.directwebremoting.impl.DefaultScriptSession.addScript(DefaultScriptSession.java:190) at server.comunication.dwr.OneReverseDWRServer.sendLocalBuffer(OneReverseDWRServer.java:385) at server.comunication.dwr.OneReverseDWRServer.sendMessageLocal(OneReverseDWRServer.java:363) at server.comunication.dwr.OneReverseDWRServer.sendMessage(OneReverseDWRServer.java:412) at server.comunication.messaging.SendTask.call(SendTask.java:53) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) --------------------------------------- And I find out, thatin this second stacktrace when it is waiting on org.directwebremoting.impl.DefaultScriptSession.addScript(DefaultScriptSession.java:190), the thread blocking this thread is wating on same place where is waiting my thread from 1st stacktrace. So in fact there is just 1 source of lags.
Hide
Erik Wiersma added a comment - 23/Apr/11 6:35 PM

Perhaps you can put http://code.google.com/p/javamelody/ in his application. It's a simple servletfilter that provides loads of heaps and heaps of information (incl. memory usage, peaks, gc times, threads, etc). If something is blocking, it should show up in the graphs.

Show
Erik Wiersma added a comment - 23/Apr/11 6:35 PM Perhaps you can put http://code.google.com/p/javamelody/ in his application. It's a simple servletfilter that provides loads of heaps and heaps of information (incl. memory usage, peaks, gc times, threads, etc). If something is blocking, it should show up in the graphs.
Hide
libor havlicek added a comment - 09/Nov/12 8:01 AM

I was waiting more than 19 months for this fix. Since I had to do lot of tricky thread killing to solve this problem outside DWR, and since those tricks are not helping in all the situations, I decided to fix this bug directly in DWR.

I am attaching a log file, that shows what kind of disaster can happen, if 1 thread is hanging. (One hanging thread is blocking by write Lock thousands other threads and slowly kills whole jetty server.) What I have find out, is that hanging thread is not caused by DWR. Doesn't matter why that thread is hanging, Reasons can be various, starting from bad browser behavior, bugs in NIO, bugs in Linux, etc..(You can see that in the log file. Writing to a buggy connection in one ScriptConduit makes that writing thread hanging and it blocks by write lock all new connections, that need to work with DeafaultScriptSession.) As far as I understand DWR code correctly, old ScriptConduit is not important, if new ScriptConduit is created. So there is no reason to wait on write Lock till old conduit timeouts. All the ScriptBuffer messages can be send to a new ScriptConduit.

I analysed DeafaultScriptSession and rewrote it. Old code is using synchronized() statements, and locking on wrong parts of code. The old code is good, if no threads are hanging, but absolutely wasting thread time by waiting on synchronized statements.

In my code, I am using separate read/write Locks for ScriptBuffer list access and separate read/write lock for conduits set locking.

My fix is working good for me. I am gonna to test it on real server with 1000 users online.

The only question that I don't know yet is, what happen, If same Script Message is written to 2 different conduits. Can javascript client site check it, and use only once? I developed this checking on client site long time ago, so it works for me. I am not sure, if that will work for other DWR users. Because in my fix, there could possibly happen a situation, when both new conduit and hanging conduit send same message to client. If this is not a problem, than my fix could be used in DWR. In my opinion it is better to send same message sometimes 2 times, via 2 conduits than to wait on writeLock and waste time.

If message can be sent only once, there is an option to change DeafaultScriptSession to do sending only in 1 thread. If some thread is sending, second thread can put message into buffer, or just set new conduit and let first thread finish the job and then do the job of second thread.

I am attaching: log file + new DefaultScriptSession

Show
libor havlicek added a comment - 09/Nov/12 8:01 AM I was waiting more than 19 months for this fix. Since I had to do lot of tricky thread killing to solve this problem outside DWR, and since those tricks are not helping in all the situations, I decided to fix this bug directly in DWR. I am attaching a log file, that shows what kind of disaster can happen, if 1 thread is hanging. (One hanging thread is blocking by write Lock thousands other threads and slowly kills whole jetty server.) What I have find out, is that hanging thread is not caused by DWR. Doesn't matter why that thread is hanging, Reasons can be various, starting from bad browser behavior, bugs in NIO, bugs in Linux, etc..(You can see that in the log file. Writing to a buggy connection in one ScriptConduit makes that writing thread hanging and it blocks by write lock all new connections, that need to work with DeafaultScriptSession.) As far as I understand DWR code correctly, old ScriptConduit is not important, if new ScriptConduit is created. So there is no reason to wait on write Lock till old conduit timeouts. All the ScriptBuffer messages can be send to a new ScriptConduit. I analysed DeafaultScriptSession and rewrote it. Old code is using synchronized() statements, and locking on wrong parts of code. The old code is good, if no threads are hanging, but absolutely wasting thread time by waiting on synchronized statements. In my code, I am using separate read/write Locks for ScriptBuffer list access and separate read/write lock for conduits set locking. My fix is working good for me. I am gonna to test it on real server with 1000 users online. The only question that I don't know yet is, what happen, If same Script Message is written to 2 different conduits. Can javascript client site check it, and use only once? I developed this checking on client site long time ago, so it works for me. I am not sure, if that will work for other DWR users. Because in my fix, there could possibly happen a situation, when both new conduit and hanging conduit send same message to client. If this is not a problem, than my fix could be used in DWR. In my opinion it is better to send same message sometimes 2 times, via 2 conduits than to wait on writeLock and waste time. If message can be sent only once, there is an option to change DeafaultScriptSession to do sending only in 1 thread. If some thread is sending, second thread can put message into buffer, or just set new conduit and let first thread finish the job and then do the job of second thread. I am attaching: log file + new DefaultScriptSession
Hide
libor havlicek added a comment - 09/Nov/12 8:03 AM

Files related to my previous post.

Show
libor havlicek added a comment - 09/Nov/12 8:03 AM Files related to my previous post.
Hide
David Marginian added a comment - 09/Nov/12 8:12 AM

Thanks Libor. This was a complicated issue and we never had time to fully investigate. I am curious have you ever made an attempt to use RC2? The reason I ask is that in RC2 we are using Java Concurrent APIs which greatly reduce thread contention.

Show
David Marginian added a comment - 09/Nov/12 8:12 AM Thanks Libor. This was a complicated issue and we never had time to fully investigate. I am curious have you ever made an attempt to use RC2? The reason I ask is that in RC2 we are using Java Concurrent APIs which greatly reduce thread contention.
Hide
libor havlicek added a comment - 09/Nov/12 8:21 AM

Hi David,

I am using this version.

#Tue Jun 21 12:34:37 PDT 2011
bamboo.build.number=303
major=3
revision=0
minor=0
title=RC2-dev

The lates I can find is:

#Tue Jun 28 15:30:30 PDT 2011
bamboo.build.number=312
major=3
revision=0
minor=0
title=RC2-final

So I am using almost latest version, 7 days older than latest one. Is there a big difference?

Show
libor havlicek added a comment - 09/Nov/12 8:21 AM Hi David, I am using this version. #Tue Jun 21 12:34:37 PDT 2011 bamboo.build.number=303 major=3 revision=0 minor=0 title=RC2-dev The lates I can find is: #Tue Jun 28 15:30:30 PDT 2011 bamboo.build.number=312 major=3 revision=0 minor=0 title=RC2-final So I am using almost latest version, 7 days older than latest one. Is there a big difference?
Hide
David Marginian added a comment - 09/Nov/12 1:55 PM

Libor, there is not a big difference. I was incorrect earlier. That class can use some improvements we will take a look at your changes.

Show
David Marginian added a comment - 09/Nov/12 1:55 PM Libor, there is not a big difference. I was incorrect earlier. That class can use some improvements we will take a look at your changes.
Hide
libor havlicek added a comment - 09/Nov/12 3:19 PM

I just want to let you know, that I am testing that fix on real server with 1000 users online, and it seems to work correctly. But, as I said, I ma not sure about common DWR behavior with possible problem with multiple same messages received on client site. In my case, it does't make problems. I am not sure, if that fixed all the problems. I will see in few days of real usage.

Show
libor havlicek added a comment - 09/Nov/12 3:19 PM I just want to let you know, that I am testing that fix on real server with 1000 users online, and it seems to work correctly. But, as I said, I ma not sure about common DWR behavior with possible problem with multiple same messages received on client site. In my case, it does't make problems. I am not sure, if that fixed all the problems. I will see in few days of real usage.
Hide
David Marginian added a comment - 10/Nov/12 5:49 PM

Thanks Libor. Let us know. I also wanted to link to a thread from the DWR mailing list that I now realize discovered the same issue

http://dwr.2114559.n2.nabble.com/Problem-with-synchronization-in-DefaultScriptSession-td7579784.html

I have a few questions (1 brought up by the mailing list thread above):

1) It seems like we should be also setting a timeout on the lock. I am wondering what the best value for the timeout would be.
2) I know setting the lock up in fair mode hurts throughput but I am wondering if we should be using it or the default (non fair mode).

Show
David Marginian added a comment - 10/Nov/12 5:49 PM Thanks Libor. Let us know. I also wanted to link to a thread from the DWR mailing list that I now realize discovered the same issue http://dwr.2114559.n2.nabble.com/Problem-with-synchronization-in-DefaultScriptSession-td7579784.html I have a few questions (1 brought up by the mailing list thread above): 1) It seems like we should be also setting a timeout on the lock. I am wondering what the best value for the timeout would be. 2) I know setting the lock up in fair mode hurts throughput but I am wondering if we should be using it or the default (non fair mode).
Hide
libor havlicek added a comment - 11/Nov/12 2:58 AM

Hi David,

yes, the other reported problem is same as this report. Completly the same issue.

Threads are simply hanging, time can vary from milliseconds up to minutes or very rarely even hours.

In my opinion the best solution is, to never let other threads to hang there. DefaultScriptSession is used by every Http request, what leads to stacking threads there with every new request sent. Same problem is with reverse ajax.

The fix, that I made does not fix the problem of DWR. It works in my environment, because I have created system, that simply does not allow multiple threads to call addScript function. Once some thread is sending the message, other threads simply put new script into buffer and let old thread to resent it. This way, my reverse ajax calls does not stack on writelock in DeafultScitpSession.

The easy solution for dwr could be with tryLock() and milliseconds timeout. I am using 4 seconds and than I kill the thread.

So my answer for your 2 question is, tryLock should help a lot. But I think, much better is not to allow second thread to wait there. Simply put it into stack and let old thread to send it after it finishes his previous job. This way, there will be no hanging at all.

There are 2 function, where you have to do this fix. For addConduit, this is blocking threads that are used by http requests and second place is addScript function, there is problem with threads from reverse ajax.

On your second question, I don't think there is a big different in locking setting, those locks are not used often, it is just every time some message is recieved from 1 client and when server push message back to that client. So it is in my case max 2 times per second.

Libor

Show
libor havlicek added a comment - 11/Nov/12 2:58 AM Hi David, yes, the other reported problem is same as this report. Completly the same issue. Threads are simply hanging, time can vary from milliseconds up to minutes or very rarely even hours. In my opinion the best solution is, to never let other threads to hang there. DefaultScriptSession is used by every Http request, what leads to stacking threads there with every new request sent. Same problem is with reverse ajax. The fix, that I made does not fix the problem of DWR. It works in my environment, because I have created system, that simply does not allow multiple threads to call addScript function. Once some thread is sending the message, other threads simply put new script into buffer and let old thread to resent it. This way, my reverse ajax calls does not stack on writelock in DeafultScitpSession. The easy solution for dwr could be with tryLock() and milliseconds timeout. I am using 4 seconds and than I kill the thread. So my answer for your 2 question is, tryLock should help a lot. But I think, much better is not to allow second thread to wait there. Simply put it into stack and let old thread to send it after it finishes his previous job. This way, there will be no hanging at all. There are 2 function, where you have to do this fix. For addConduit, this is blocking threads that are used by http requests and second place is addScript function, there is problem with threads from reverse ajax. On your second question, I don't think there is a big different in locking setting, those locks are not used often, it is just every time some message is recieved from 1 client and when server push message back to that client. So it is in my case max 2 times per second. Libor
Hide
libor havlicek added a comment - 11/Nov/12 3:02 AM

och, I forgot. Add there killing hanging thread. If thread is hanging for more than 5 seconds, it ussually hangs much longer. Kill it and let other thread finish the job.

Show
libor havlicek added a comment - 11/Nov/12 3:02 AM och, I forgot. Add there killing hanging thread. If thread is hanging for more than 5 seconds, it ussually hangs much longer. Kill it and let other thread finish the job.
Hide
Mike Wilson added a comment - 13/Nov/12 12:30 PM

Hey Libor, and thanks for the work you put in here. After looking at your stacktraces and the code, I agree with your analysis. Comments on your comments and questions:

Locking behaviour:
We certainly shouldn't call any blocking functions inside such wide-scoped synchronization blocks. We'll need to fix that.

Multiple conduits:
As you say, only one conduit would normally be active. Though, even if we only keep the last conduit, we still face the same issues of hanging connections and duplicate messages, so at the moment we may as well keep the handling of multiple conduits.

Duplicate messages:
We don't have any logic stopping us from transferring the same message twice in different conduits, but we probably should have.

Threading solution:
I think the correct way to solve all this is to let ScriptSession.addScript hand over the data to be written to a buffer, and then let the conduits' polling threads fetch this data and do the transfer.
I can look into this and see if it's possible to make a nice implementation. Could you help me test it out in your environment in that case?

Best regards
Mike

Show
Mike Wilson added a comment - 13/Nov/12 12:30 PM Hey Libor, and thanks for the work you put in here. After looking at your stacktraces and the code, I agree with your analysis. Comments on your comments and questions: Locking behaviour: We certainly shouldn't call any blocking functions inside such wide-scoped synchronization blocks. We'll need to fix that. Multiple conduits: As you say, only one conduit would normally be active. Though, even if we only keep the last conduit, we still face the same issues of hanging connections and duplicate messages, so at the moment we may as well keep the handling of multiple conduits. Duplicate messages: We don't have any logic stopping us from transferring the same message twice in different conduits, but we probably should have. Threading solution: I think the correct way to solve all this is to let ScriptSession.addScript hand over the data to be written to a buffer, and then let the conduits' polling threads fetch this data and do the transfer. I can look into this and see if it's possible to make a nice implementation. Could you help me test it out in your environment in that case? Best regards Mike
Hide
libor havlicek added a comment - 14/Nov/12 2:41 AM

Hi Mike,

of course, I can test your new version on my environment.

I have created on my site lot of hot fixes, to make DWR usable as powerful reverse-ajax library. Those fixes are outside the DWR, so my code will not help you. But, I want to write here, what I think DWR should change to be perfect library.

Problem 1. Threading:
You are right, there should be some separate thread pool, which will be used for sending messages into conduits. One fixed size thread pool, for all the ScriptSessions. And while sending there should be exactly one thread from thread pool doing the job for one ScriptSession. And every time some other thread calls ScriptSession.addScript, or ScriptSession.addConduit it just saves it into ScriptSession, and thread pool does the send job. Threads in the thread pool should have max sending time. After they hang for let say 5 sec, they should be killed.

Problem 2. Message counting
I have implemented counting of messages per ScriptSession. Once ScriptSession is created, it has internal counter and with every script buffer sent, it adds unique identifier to the message. I am using it for fixing problem with lost messages and to fix ordering problem. Since every message has identifier, my client knows if messages were delivered in wrong order, or if some message is missing or was received twice. Than client is pushing those messages in good order to execute in JavaScript. Client is sending back array if indexes of the messages. So my server knows, which messages were delivered. Server than decide, if some message should be send again, and deletes messages from history stack, that were confirmed by client.

This way, I never lose incoming message from server. And always push them into browser in good order. For example, if I receive messages like 1,2,3,5 and 4th message is lost. Client executes messages 1,2,3 than sends array of indexes[1,2,3, 5] to the server and server sends again no: 4 and deletes 1,2,3,5 from his history stack. Client is waiting for no: 4. When no.4 is received, client executes messages: 4,5. It works perfect.

The only downside is, that I have to send extra message from client with array of indexes of received messages. It would be much better, if DWR does it silently, by adding this call to commonly used HttpRequests.

Problem 3. Lost HttpRequests + ordering
My users sometimes complain, that their client is stacked and stopped to work. This is caused by requests that were not delivered correctly to the server. Of course, client can detect if request delivery failed. What I am missing is option for must deliver and keep order. I mean if it fails, it should be send again and again and again. I am not sure, how this is in in DWR, I am partly solving this on my site. And Ordering is not solved for sure in DWR.

I did not fix this last problem yet correctly on my side. But I miss it.

I know, that there are web pages, that does not care, about ordering of messages, or even losing some messages. I think there should be such option in DWR, to set up mustDeliver = true , keepOrder=true. In fact, once you want to solve problem with duplicate messages received, you have to add identifier to the messages. And than, the ordering and must delivery options are not big problem to implement.

Problem 4. WebSocket

WebSocket is future of web. Why to use tricky hacks, if it can be easily done by WebSocket? DWR is awesome, because of Js2Java and Java2Js object mapping. But future is in WebSocket. I am using DWR for IE, FireFox and in Chrome, I am using WebSocket. Since DWR is so cool, I am using DWR for client->server communication. But when there is an option for WebSocket, I am using websocket for Server->Client communication. I simply convert my Java message by DWR library into String. And than I send that string via WebSocket. WebSockets are easy to implement, why not to use them if there is such option? If you make WebSocket native in DWR, you can use it for both site communication Client->Server and Server->Client. This way, DWR would be very attractive and powerful.

Best regards

Libor

Show
libor havlicek added a comment - 14/Nov/12 2:41 AM Hi Mike, of course, I can test your new version on my environment. I have created on my site lot of hot fixes, to make DWR usable as powerful reverse-ajax library. Those fixes are outside the DWR, so my code will not help you. But, I want to write here, what I think DWR should change to be perfect library. Problem 1. Threading: You are right, there should be some separate thread pool, which will be used for sending messages into conduits. One fixed size thread pool, for all the ScriptSessions. And while sending there should be exactly one thread from thread pool doing the job for one ScriptSession. And every time some other thread calls ScriptSession.addScript, or ScriptSession.addConduit it just saves it into ScriptSession, and thread pool does the send job. Threads in the thread pool should have max sending time. After they hang for let say 5 sec, they should be killed. Problem 2. Message counting I have implemented counting of messages per ScriptSession. Once ScriptSession is created, it has internal counter and with every script buffer sent, it adds unique identifier to the message. I am using it for fixing problem with lost messages and to fix ordering problem. Since every message has identifier, my client knows if messages were delivered in wrong order, or if some message is missing or was received twice. Than client is pushing those messages in good order to execute in JavaScript. Client is sending back array if indexes of the messages. So my server knows, which messages were delivered. Server than decide, if some message should be send again, and deletes messages from history stack, that were confirmed by client. This way, I never lose incoming message from server. And always push them into browser in good order. For example, if I receive messages like 1,2,3,5 and 4th message is lost. Client executes messages 1,2,3 than sends array of indexes[1,2,3, 5] to the server and server sends again no: 4 and deletes 1,2,3,5 from his history stack. Client is waiting for no: 4. When no.4 is received, client executes messages: 4,5. It works perfect. The only downside is, that I have to send extra message from client with array of indexes of received messages. It would be much better, if DWR does it silently, by adding this call to commonly used HttpRequests. Problem 3. Lost HttpRequests + ordering My users sometimes complain, that their client is stacked and stopped to work. This is caused by requests that were not delivered correctly to the server. Of course, client can detect if request delivery failed. What I am missing is option for must deliver and keep order. I mean if it fails, it should be send again and again and again. I am not sure, how this is in in DWR, I am partly solving this on my site. And Ordering is not solved for sure in DWR. I did not fix this last problem yet correctly on my side. But I miss it. I know, that there are web pages, that does not care, about ordering of messages, or even losing some messages. I think there should be such option in DWR, to set up mustDeliver = true , keepOrder=true. In fact, once you want to solve problem with duplicate messages received, you have to add identifier to the messages. And than, the ordering and must delivery options are not big problem to implement. Problem 4. WebSocket WebSocket is future of web. Why to use tricky hacks, if it can be easily done by WebSocket? DWR is awesome, because of Js2Java and Java2Js object mapping. But future is in WebSocket. I am using DWR for IE, FireFox and in Chrome, I am using WebSocket. Since DWR is so cool, I am using DWR for client->server communication. But when there is an option for WebSocket, I am using websocket for Server->Client communication. I simply convert my Java message by DWR library into String. And than I send that string via WebSocket. WebSockets are easy to implement, why not to use them if there is such option? If you make WebSocket native in DWR, you can use it for both site communication Client->Server and Server->Client. This way, DWR would be very attractive and powerful. Best regards Libor
Hide
Mike Wilson added a comment - 14/Nov/12 3:08 AM

Thanks for taking the time to write down your thoughts Replying in order:

Problem 1. Threading:
I'm planning to try to reuse the request threads we already have, but are not currently using, so no need for thread pools or timeouts, and no waiting.

Problem 2. Message counting and Problem 3. Lost HttpRequests + ordering:
Duplicate messages: I can verify that this is a weakness in the current code base.
Lost messages: we fixed a bug causing lost messages in DWR-584, are you sure this is still a problem?
Messages in wrong order: hm, I don't see how this would occur, did you find out how this happened?

Problem 4. WebSocket:
Yes, this is a common request, but with our limited resources we haven't yet been able to attack this functionality.

Show
Mike Wilson added a comment - 14/Nov/12 3:08 AM Thanks for taking the time to write down your thoughts Replying in order: Problem 1. Threading: I'm planning to try to reuse the request threads we already have, but are not currently using, so no need for thread pools or timeouts, and no waiting. Problem 2. Message counting and Problem 3. Lost HttpRequests + ordering: Duplicate messages: I can verify that this is a weakness in the current code base. Lost messages: we fixed a bug causing lost messages in DWR-584, are you sure this is still a problem? Messages in wrong order: hm, I don't see how this would occur, did you find out how this happened? Problem 4. WebSocket: Yes, this is a common request, but with our limited resources we haven't yet been able to attack this functionality.
Hide
libor havlicek added a comment - 14/Nov/12 4:13 AM

I am not sure about current DWR bugs. I did my hacks during last 3 years.

Threading: Ok, I don't know what free threads have DWR available. You know it best.

Lost messages: I am not sure, if messages are still losing. I did my fix, to solve it. With that fix, I made the fix with correct order, because lost messages were send again and than received in wrong order. But I am really curious how you want to achieve that you receive messages in good order, if you plan to send them to multiple conduits. In my opinion you can't be sure about what client received and when. As I said, I am killing hanging threads, so I really don't know if message was sent correctly or not. Only client knows it. So maybe I made my problem with losing messages and wrong order. I had to solve problem with hanging threads, and killing the threads could made that lost message problem.

Show
libor havlicek added a comment - 14/Nov/12 4:13 AM I am not sure about current DWR bugs. I did my hacks during last 3 years. Threading: Ok, I don't know what free threads have DWR available. You know it best. Lost messages: I am not sure, if messages are still losing. I did my fix, to solve it. With that fix, I made the fix with correct order, because lost messages were send again and than received in wrong order. But I am really curious how you want to achieve that you receive messages in good order, if you plan to send them to multiple conduits. In my opinion you can't be sure about what client received and when. As I said, I am killing hanging threads, so I really don't know if message was sent correctly or not. Only client knows it. So maybe I made my problem with losing messages and wrong order. I had to solve problem with hanging threads, and killing the threads could made that lost message problem.
Hide
Mike Wilson added a comment - 14/Nov/12 4:25 AM

But I am really curious how you want to achieve that you receive messages in good order, if you plan to send them to multiple conduits.

The way I'm thinking about it currently is to push all messages into all conduits in order. As long as a conduit is alive TCP will guarantee that messages are delivered in order. A new conduit will start transferring from the first unacknowledged message, so we will risk transferring duplicate messages but not unordered ones.
To avoid extra ack requests I'm thinking about sending the ack count each time we renew a poll request, normally every 60 secs. A ScriptSession will have to hold on to transferred messages until that time, but I think that is ok.

So maybe I made my problem with losing messages and wrong order.

Yes, that what was came to mind for me too, so I wanted to check if there was something else that I was maybe missing.

Show
Mike Wilson added a comment - 14/Nov/12 4:25 AM
But I am really curious how you want to achieve that you receive messages in good order, if you plan to send them to multiple conduits.
The way I'm thinking about it currently is to push all messages into all conduits in order. As long as a conduit is alive TCP will guarantee that messages are delivered in order. A new conduit will start transferring from the first unacknowledged message, so we will risk transferring duplicate messages but not unordered ones. To avoid extra ack requests I'm thinking about sending the ack count each time we renew a poll request, normally every 60 secs. A ScriptSession will have to hold on to transferred messages until that time, but I think that is ok.
So maybe I made my problem with losing messages and wrong order.
Yes, that what was came to mind for me too, so I wanted to check if there was something else that I was maybe missing.
Hide
libor havlicek added a comment - 14/Nov/12 4:54 AM

Ok, If you send all the scripts to all the conduits, and if you add identifier to same scripts to drop later received same messages, than order should be fine. Thank you for explanation.

Just I don't know how you want to solve the problem with hanging thread. It hangs in NIO, so if you don't kill it, it can hang minutes or even hours. If it hangs, it always hangs while writing to dead ScriptConduit. That's why I am killing it.

I hope your plans how to change it will work good than Let me know, when you change it, so I can test it.

Best Regards

Libor

Show
libor havlicek added a comment - 14/Nov/12 4:54 AM Ok, If you send all the scripts to all the conduits, and if you add identifier to same scripts to drop later received same messages, than order should be fine. Thank you for explanation. Just I don't know how you want to solve the problem with hanging thread. It hangs in NIO, so if you don't kill it, it can hang minutes or even hours. If it hangs, it always hangs while writing to dead ScriptConduit. That's why I am killing it. I hope your plans how to change it will work good than Let me know, when you change it, so I can test it. Best Regards Libor
Hide
David Marginian added a comment - 15/Nov/12 7:11 PM

Good discussion here guys. Libor with respect to WebSockets I have looked into it. The resource limitation is not the only reason we have not tackled it. I find it interesting that CometD disables it by default as they say browser support is still too buggy:

http://cometd.org/documentation/2.x/howtos/websocket

Show
David Marginian added a comment - 15/Nov/12 7:11 PM Good discussion here guys. Libor with respect to WebSockets I have looked into it. The resource limitation is not the only reason we have not tackled it. I find it interesting that CometD disables it by default as they say browser support is still too buggy: http://cometd.org/documentation/2.x/howtos/websocket
Hide
libor havlicek added a comment - 16/Nov/12 8:58 AM

Hi David,
you are right about WebSocket bugs. I had multiple issues with that. Most problems occurred, when Chrome updated to more secure standard and there were no libraries for that. But sooner or later, WebSocket will be fine.

But I have another problem with DWR. I am attaching log file. It is connected with hanging of threads on HashMap operations in DefaultScriptSessionManager. Once this problem starts, Jetty is slowly creating new threads. And in the end, it crash because of too much threads. I don't know why this happen, the only explanation can be, that 2 threads were working with 1 HashMap in same time. And since that operation is not thread safe, it broke HashMap and threads are hanging there. Can you check that problem? It happens to me every few days, and it results in server restart.

Thank you Libor

Show
libor havlicek added a comment - 16/Nov/12 8:58 AM Hi David, you are right about WebSocket bugs. I had multiple issues with that. Most problems occurred, when Chrome updated to more secure standard and there were no libraries for that. But sooner or later, WebSocket will be fine. But I have another problem with DWR. I am attaching log file. It is connected with hanging of threads on HashMap operations in DefaultScriptSessionManager. Once this problem starts, Jetty is slowly creating new threads. And in the end, it crash because of too much threads. I don't know why this happen, the only explanation can be, that 2 threads were working with 1 HashMap in same time. And since that operation is not thread safe, it broke HashMap and threads are hanging there. Can you check that problem? It happens to me every few days, and it results in server restart. Thank you Libor
Hide
David Marginian added a comment - 16/Nov/12 9:15 AM

Yes, sooner or later the WebSocket support will be good. So we will continue to look into it.

Regarding the new issue, we will look into it. Newer versions of DefaultScriptSessionManager use only ConcurrentMaps. Why do you believe it is an issue in DefaultScriptSessionManager?

Show
David Marginian added a comment - 16/Nov/12 9:15 AM Yes, sooner or later the WebSocket support will be good. So we will continue to look into it. Regarding the new issue, we will look into it. Newer versions of DefaultScriptSessionManager use only ConcurrentMaps. Why do you believe it is an issue in DefaultScriptSessionManager?
Hide
libor havlicek added a comment - 16/Nov/12 9:30 AM

I tried to search via google,to find out the possible problems. And the only explanation I found was, that there is absolutely no reason for any hanging threads in HashMap. If it happens, than HashMap has to be corrupted. And lot of people wrote, that this happens, when 2 threads work with one HashMap. If Map is concurrent, like ConcurrentMap, it doesn't mean that it solves all the concurrency problems. Simple iteration by one thread and writing by another thread can damage ConcurrentMap. As far as I know, iterations are not thread safe, on any concurrent collections iterating. I am always using locks for iterations(or in special cases doing copies for iterating). Just check log file and give me your opinion.

Thank you

Libor

Show
libor havlicek added a comment - 16/Nov/12 9:30 AM I tried to search via google,to find out the possible problems. And the only explanation I found was, that there is absolutely no reason for any hanging threads in HashMap. If it happens, than HashMap has to be corrupted. And lot of people wrote, that this happens, when 2 threads work with one HashMap. If Map is concurrent, like ConcurrentMap, it doesn't mean that it solves all the concurrency problems. Simple iteration by one thread and writing by another thread can damage ConcurrentMap. As far as I know, iterations are not thread safe, on any concurrent collections iterating. I am always using locks for iterations(or in special cases doing copies for iterating). Just check log file and give me your opinion. Thank you Libor
Hide
David Marginian added a comment - 16/Nov/12 9:41 AM

"If Map is concurrent, like ConcurrentMap, it doesn't mean that it solves all the concurrency problems. "

Correct.

"Simple iteration by one thread and writing by another thread can damage ConcurrentMap."

Iterators returned by ConcurrentHashmap are weakly consistent - not fail-fast. Weakly consistent iterators can tolerate concurrent modifications. Per Java Concurrency in Practice:

"ConcurrentHashMap, along with the other concurrent collections, further improve on the synchronized collection classes by providing iterators that do not throw ConcurrentModificationException, thus eliminating the need to lock the collection during iteration."

Are you sure you attached the correct thread-dump? I scanned it quickly but may be missing the exact section you are talking about.

Show
David Marginian added a comment - 16/Nov/12 9:41 AM "If Map is concurrent, like ConcurrentMap, it doesn't mean that it solves all the concurrency problems. " Correct. "Simple iteration by one thread and writing by another thread can damage ConcurrentMap." Iterators returned by ConcurrentHashmap are weakly consistent - not fail-fast. Weakly consistent iterators can tolerate concurrent modifications. Per Java Concurrency in Practice: "ConcurrentHashMap, along with the other concurrent collections, further improve on the synchronized collection classes by providing iterators that do not throw ConcurrentModificationException, thus eliminating the need to lock the collection during iteration." Are you sure you attached the correct thread-dump? I scanned it quickly but may be missing the exact section you are talking about.
Hide
libor havlicek added a comment - 16/Nov/12 10:02 AM

I am not saying that it is throwing any exception. I just don't know why it is hanging on those stack traces:

In the file that I attached, you can find about 50 such stack traces. It means 50 threads are not working. If it were like 2-3 threads, ok, it could be just some bad luck, but when I do stack trace of Jetty when everything is fine, there are never such stack-traces. But when my users get big lags, it is always this stack-trace. And it ends up with hundredths of such threads and than I have to restart server.

"qtp1915810907-14705" prio=10 tid=0x00007fa9fc7a1000 nid=0x4879 runnable [0x00007fa9bcbc9000]
java.lang.Thread.State: RUNNABLE
at java.util.HashMap.put(HashMap.java:391)
at java.util.HashSet.add(HashSet.java:217)
at org.directwebremoting.impl.DefaultScriptSessionManager.associateScriptSessionAndPage(DefaultScriptSessionManager.java:242)
at org.directwebremoting.impl.DefaultScriptSessionManager.getScriptSession(DefaultScriptSessionManager.java:125)
at org.directwebremoting.impl.DefaultWebContext.checkPageInformation(DefaultWebContext.java:87)
at org.directwebremoting.dwrp.BasePollHandler.handle(BasePollHandler.java:116)

and

"qtp1915810907-9678" prio=10 tid=0x00007fa9fc517800 nid=0x3167 runnable [0x00007fa9bf9f7000]
java.lang.Thread.State: RUNNABLE
at java.util.HashMap.removeEntryForKey(HashMap.java:582)
at java.util.HashMap.remove(HashMap.java:551)
at java.util.HashSet.remove(HashSet.java:233)
at org.directwebremoting.impl.DefaultScriptSessionManager.disassociateScriptSessionAndPage(DefaultScriptSessionManager.java:255)
at org.directwebremoting.impl.DefaultScriptSessionManager.invalidate(DefaultScriptSessionManager.java:309)
at org.directwebremoting.impl.DefaultScriptSession.invalidate(DefaultScriptSession.java:140)
at org.directwebremoting.impl.DefaultScriptSessionManager.checkTimeouts(DefaultScriptSessionManager.java:359)
at org.directwebremoting.impl.DefaultScriptSessionManager.maybeCheckTimeouts(DefaultScriptSessionManager.java:328)
at org.directwebremoting.impl.DefaultScriptSessionManager.getScriptSession(DefaultScriptSessionManager.java:90)
at org.directwebremoting.impl.DefaultWebContext.checkPageInformation(DefaultWebContext.java:87)
at org.directwebremoting.dwrp.BasePollHandler.handle(BasePollHandler.java:116)

Thanks

Libor

Show
libor havlicek added a comment - 16/Nov/12 10:02 AM I am not saying that it is throwing any exception. I just don't know why it is hanging on those stack traces: In the file that I attached, you can find about 50 such stack traces. It means 50 threads are not working. If it were like 2-3 threads, ok, it could be just some bad luck, but when I do stack trace of Jetty when everything is fine, there are never such stack-traces. But when my users get big lags, it is always this stack-trace. And it ends up with hundredths of such threads and than I have to restart server. "qtp1915810907-14705" prio=10 tid=0x00007fa9fc7a1000 nid=0x4879 runnable [0x00007fa9bcbc9000] java.lang.Thread.State: RUNNABLE at java.util.HashMap.put(HashMap.java:391) at java.util.HashSet.add(HashSet.java:217) at org.directwebremoting.impl.DefaultScriptSessionManager.associateScriptSessionAndPage(DefaultScriptSessionManager.java:242) at org.directwebremoting.impl.DefaultScriptSessionManager.getScriptSession(DefaultScriptSessionManager.java:125) at org.directwebremoting.impl.DefaultWebContext.checkPageInformation(DefaultWebContext.java:87) at org.directwebremoting.dwrp.BasePollHandler.handle(BasePollHandler.java:116) and "qtp1915810907-9678" prio=10 tid=0x00007fa9fc517800 nid=0x3167 runnable [0x00007fa9bf9f7000] java.lang.Thread.State: RUNNABLE at java.util.HashMap.removeEntryForKey(HashMap.java:582) at java.util.HashMap.remove(HashMap.java:551) at java.util.HashSet.remove(HashSet.java:233) at org.directwebremoting.impl.DefaultScriptSessionManager.disassociateScriptSessionAndPage(DefaultScriptSessionManager.java:255) at org.directwebremoting.impl.DefaultScriptSessionManager.invalidate(DefaultScriptSessionManager.java:309) at org.directwebremoting.impl.DefaultScriptSession.invalidate(DefaultScriptSession.java:140) at org.directwebremoting.impl.DefaultScriptSessionManager.checkTimeouts(DefaultScriptSessionManager.java:359) at org.directwebremoting.impl.DefaultScriptSessionManager.maybeCheckTimeouts(DefaultScriptSessionManager.java:328) at org.directwebremoting.impl.DefaultScriptSessionManager.getScriptSession(DefaultScriptSessionManager.java:90) at org.directwebremoting.impl.DefaultWebContext.checkPageInformation(DefaultWebContext.java:87) at org.directwebremoting.dwrp.BasePollHandler.handle(BasePollHandler.java:116) Thanks Libor
Hide
David Marginian added a comment - 16/Nov/12 10:02 AM

Libor, I am not saying we don't have a problem (we may) just that I don't see anything in the dump that specifically indicates DefaultScriptSessionManager - is this just a hunch on your part?

Show
David Marginian added a comment - 16/Nov/12 10:02 AM Libor, I am not saying we don't have a problem (we may) just that I don't see anything in the dump that specifically indicates DefaultScriptSessionManager - is this just a hunch on your part?
Hide
David Marginian added a comment - 16/Nov/12 10:14 AM

Ok, I see what you are talking about now. We will take a look.

Show
David Marginian added a comment - 16/Nov/12 10:14 AM Ok, I see what you are talking about now. We will take a look.
Hide
David Marginian added a comment - 16/Nov/12 10:25 AM

The issues in the thread dump could very well be related to:
http://bugs.directwebremoting.org/jira/browse/DWR-574

Show
David Marginian added a comment - 16/Nov/12 10:25 AM The issues in the thread dump could very well be related to: http://bugs.directwebremoting.org/jira/browse/DWR-574
Hide
libor havlicek added a comment - 16/Nov/12 10:33 AM

David, you are right. It is same issue.

I am using now
#Tue Jun 21 12:34:37 PDT 2011
bamboo.build.number=303
major=3
revision=0
minor=0
title=RC2-dev

It seems, you already solved that problem this year. I will download latest bamboo build to test it.

Show
libor havlicek added a comment - 16/Nov/12 10:33 AM David, you are right. It is same issue. I am using now #Tue Jun 21 12:34:37 PDT 2011 bamboo.build.number=303 major=3 revision=0 minor=0 title=RC2-dev It seems, you already solved that problem this year. I will download latest bamboo build to test it.
Hide
David Marginian added a comment - 16/Nov/12 10:41 AM

Great! We really appreciate all your help.

Show
David Marginian added a comment - 16/Nov/12 10:41 AM Great! We really appreciate all your help.

People

Dates

  • Created:
    15/Feb/11 5:41 AM
    Updated:
    29/Nov/12 3:59 PM