Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow sharing assigns between live navigation #3482

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

SteffenDE
Copy link
Collaborator

@SteffenDE SteffenDE commented Oct 27, 2024

Considerations

To support sharing assigns between live navigations, there are multiple ways we could implement this:

  1. Add a way to store values in the transport

We could have the transport own an ETS table or provide an API to assign values to it. Then LV could store assigns that use assign_new into this storage. Drawbacks: assign_new would have a messaging overhead, sending (potentially big) data to another process, even if it is not needed again. If an assign is updated later, we'd either need to sync those updates (keeping track which assigned keys used assign_new) or live with potentially old data.

  1. Add a way for the old LV to be kept alive on navigation

This is the solution that this PR implements (together with a PR in Phoenix). This solves all the drawbacks mentioned above: it only copies the data that is actually used for the optimization and it also always gets the most recent data from the old LV's assigns. The only new drawback that I can come up with is that the old LV needs to be in memory just a little bit longer than usual, but I think that's fine.

Changes

This PR relies on a yet to be finalized feature in Phoenix Channels (phoenixframework/phoenix#5959) to perform a custom handover between channel rejoins.

It changes the LV code to not leave the old channel before rejoining and also instructs Phoenix to not kill the old process before starting the new channel process. After the new channel is joined, the old one is killed. Any assign_new calls in the LV mount will try to fetch assigns from the old LV.

This is a backwards compatible optimization, so if a version of Phoenix is used that does not support handover, it just falls back to calling the function supplied to assign_new as usual.

Closes #3357.

cc @josevalim

This commit relies on a yet to be finalized feature in Phoenix Channels
to perform a custom handover between channel rejoins.

It changes the LV code to not leave the old channel before rejoining and
also instructs Phoenix to not kill the old process before starting the
new channel process. After the new channel is joined, the old one is
killed. Any `assign_new` calls in the LV mount will try to fetch assigns
from the old LV.

This is a backwards compatible optimization, so if a version of Phoenix
is used that does not support handover, it just falls back to calling the
function supplied to assign_new as usual.

Closes #3357.
@SteffenDE SteffenDE force-pushed the sd-handover-assigns branch from 5d0b028 to 8518e61 Compare October 27, 2024 13:49
@josevalim
Copy link
Member

@SteffenDE I originally thought about another way to implement this, which is to let the server say "I have spawned up a new channel" and the client assumes it has taken over control. The benefit of doing it this way is that you don't need to keep the process around. You immediately spawn the new process passing the assigns it needs and that's it. Do you have thoughts about this route?

@SteffenDE
Copy link
Collaborator Author

I'll actually need to handle push_navigate as well. Forgot about that.

I need to think about the consequences of your proposal (e.g. we'll still want to start the new process under the channel's supervisor). Currently, I cannot really imagine what an API for this would look like.

We cannot really shorten the time that the process needs to wait though, I think, as we don't know the keys from assign_new upfront - at least without looking through the LV mount's AST or just passing all assigns, which could be rather expensive. So we need to at least wait for the initial mount of the new LV, don't we?

@josevalim
Copy link
Member

We will wait for the initial mount but, because we are also the one who called the initial mount, we will know if the initial mount either succeed or failed. We don't need to have a timeout that says "in case I am not pinged in 15 seconds, terminate".

@SteffenDE
Copy link
Collaborator Author

We don’t wait in the current implementation :)

@SteffenDE
Copy link
Collaborator Author

So yeah I think the current implementation is fine for client-initiated live navigation. What you described is what we need for push_navigate.

@josevalim
Copy link
Member

Oh, I see, we don't wait because it is client initiated!

@SteffenDE if we need the version I proposed for push_navigate, it can then be used for both push_navigate and live_navigate, right? 🤔 So it should be more general?

@josevalim
Copy link
Member

Another downside of the current implementation is that it only works if you rejoin the same channel. The other version should be a bit more general... but it may also be (much?) harder to implement. :(

@greven
Copy link

greven commented Oct 29, 2024

Sorry to disrupt the conversation, but I couldn't understand the end-goal with the issue.
Is the idea by sharing assigns to reduce costly queries for example?
And would this only work inside a live session or between any Live Views?

@SteffenDE
Copy link
Collaborator Author

SteffenDE commented Oct 29, 2024

@greven yes, the end goal is to reduce things like querying the current user from the DB on each live navigation by expanding assign_new to access previous assigns. It will only work inside a live session.

TODO: check flash handling
TODO: refactor js to remove duplicatings between replaceMain and handover
TODO: check if should keep the existing view class or replace it completely,
      passing the old channel to the new one?
@SteffenDE
Copy link
Collaborator Author

@josevalim I implemented a proof of concept for server initiated handovers in 2ec6c6a and phoenixframework/phoenix@cd99fcf. It breaks a couple of tests, so very much not finished, but it shows the concept. I did not change the client handovers to use this yet, but we could do it by changing the client from issuing a rejoin to instead send a message and trigger the same path.

@SteffenDE
Copy link
Collaborator Author

It would be amazing if assign_async could be started during a dead render then carried over to the live render saving time for actual data loads

@bcardarella it would, but it's imho probably nothing LiveView will ever try to do do itself. There have been lots of discussions around the double mount in the past. The main issue is that you simply cannot assume that a live render will happen at all as a dead render can be triggered by any crawler, bot, etc. without ever connecting to the WebSocket. Therefore, you'd always need a timeout on how long things would be kept in memory.

If people want to do this, it should be possible to implement something like this right now by spawning a globally registered process under a key that is put into the session on the dead render, requesting the data from it during the live render. This could be implemented as a separate library and shared with the community if there is lots of demand for it. I'm not sure if it is really worth it though.

Anyways, it's definitely out of scope for this issue.

@elliottneilclark
Copy link
Contributor

elliottneilclark commented Dec 2, 2024

You immediately spawn the new process passing the assigns it needs and that's it.

I highly vote in favor of this for different reasons; I think LV should always spawn the new process waiting for the connect.

Time To First Interaction

The average web visitor will leave after opening a page or site in a very few seconds. It's normal to see average session times in seconds for even the most-read websites worldwide (yes even the sites you are reading for hours sometimes have average session times that would surprise you). Users will open a tab then put the phone down, users will click on a link on another tab and never return, etc. All this means that users who are just connecting and are getting the fresh browser tab will make up a large proportion of the user experience. As such, LiveView should optimize for the first few interactions.

By sharing all the assigns without needing any code changes, experienced load times would decrease. I would love to see all the existing code speed up on the first time to interaction (TTI) because the mount assigns are already shared by default when connecting.

Even if the assigns take <50 ms to compute on average, the long tail will be a noticeable lag on the some percentage of socket connection. This means I would like to give developers a default where they don't have to think about bucketing assigns as needing to be cached or shared and not. Methods that start out fast have a way of adding more complexity as a code base matures.

Why LV Users Care

If you are a web shop, you care greatly about the time to the first interaction since you hope that the first interaction is placing something in the cart.

Heavy web applications with long-running sessions will consider the time to first interaction for sign-up. I don't want my user to bounce because the page froze. Every extra ms of input latency is painful here.

Fewer Foot Cannons

Treating all assigns as the same and sharing them means that new developers to LiveView applications won't assign and then get confused about why it's slowing down for some users but not when they try it on a long-running tab. Everytime someone changes from assign to assign_new or similar is a usability wart.

Assigns Caching Tuning

By unifying assigns for deadview, connect and reconnect it has a nice operational property: We can tune how long to keep processes and their assigns alive. This gives a very nice knob.

For example, if I operate a website and I see that my users are having latency issues that last between 500ms and 1.5s, I can tune the LV processes to keep the assigns around for 2 seconds. This means that almost all cell tower-related latency blips will not need to recompute their assigns, and I will also get an upper limit on the memory required.

As an SRE operating a system, I want to have a way to trade off memory to perceived latency. If I have a site where almost no one reconnects (dead view only), then I can turn off speculatively spawning the process or reduce the wait time significantly. Currently, I don't have an easy way to say I am okay with using more memory for a better user experience by default.

Source: I have been the SRE/PE in charge of a large web property that made live connections for processing/rendering, and I have been building/using LV for a while now.

  • I refer to load times here as when the UI waits on the server to compute the assigns unneeded times. I don't mean CTE or page load here.

@SteffenDE
Copy link
Collaborator Author

@elliottneilclark I'm sorry to disappoint you, but this PR as well as the corresponding issue are not about sharing assigns between the dead and connected mount. We're only experimenting with sharing assigns between navigations that happen over the existing websocket connection. If you want to optimize loading times for the connected mount, you can do so by having important data cached in memory. There are a lot of caveats with doing such things, especially when trying to do them by default, and I'm not convinced that this is something we want to focus on - at least right now.

That being said, I can imagine an external library you might plug into your LiveView application that does most of the work for you, so if that's something you really think we should have, feel free to explore this. The important bits to do it are already there. I could imagine something like this working:

  1. you add a plug into your router pipeline that stores the assigns just before sending the response (Plug.Conn.register_before_send)
  2. you store the assigns in a short-lived process that terminates after a configurable timeout
  3. you encode the pid of this process and render it in the template
  4. you send the pid as parameter on connected mount
  5. finally, you extract the parameter in an on_mount hook and try to fetch the assigns from the pid, if it is still alive

@elliottneilclark
Copy link
Contributor

elliottneilclark commented Dec 3, 2024

PR as well as the corresponding issue are not about sharing assigns between the dead and connected mount. We're only experimenting with sharing assigns between navigations that happen over the existing websocket connection.

I know, and that's fine; I appreciate any progress on LV. However, before LV goes down the rabbit hole of optimizing assigns for different scenarios, it's important to say that I have seen the more general solution work at scale and that it has benefits. Hyrum's law means it's worth the time before settling on assigns behavior that will become the api.

The time to first interact and reducing UI jankiness are essential to the end-user experience. Most web properties try to estimate the time to first interaction with FCP. However, from experience, that's not a great indication of actual user perception of the UI for reactive web properties. Reactive web pages often have an FCP that's good to look at and solves half the problem of slow rendering. The other important metric is when all UI elements on the screen are fully alive/inflated/running and when that UI isn't fully live/running (reconnect, or live navigation).

Sharing assigns between navigations, initial mounts, and reconnect seems like a general transition of state that can be improved without the need for different APIs.

If you want to optimize loading times for the connected mount, you can do so by having important data cached in memory.

Batteries Included did it differently; by having long-running gen servers that cache the data, we made getting assigns a call to a live gen server with cached data. However, the fact that we saw the UI jankiness so often while developing and that we solved it differently suggests this is an issue that projects that use LV for long enough will run into. I saw UI jank on Connect often and live navigation less frequently.

That being said, I can imagine an external library you might plug into your LiveView application that does most of the work for you, so if that's something you really think we should have, feel free to explore this.

Thank you for the suggestion, we might explore that in the future.

@josevalim
Copy link
Member

josevalim commented Dec 4, 2024

Note: this is a reply to a now deleted comment.

  1. If your page is not public (i.e. you don't care about SEO), you can skip the first render altogether by checking connected? and using render_with to render only the layout on dead render and then have only the live render hit the data layer. This would be similar to the Apple Music example above where the page does not have any meaningful content before JavaScript runs.

  2. This issue is not related to the dead-live render and trying to hijack it for a separate discussion, which requires a completely different trade-off analysis, makes it very hard for us to discuss this issue in particular. Please let's keep this focused.

@josevalim
Copy link
Member

josevalim commented Dec 4, 2024

Note: this is a reply to a now deleted comment.

Also, fwiw, "I don't understand why a first-class solution is off the table for discussion" is completely innaccurate. We have discussed this extensively. As said early on, it is just out of scope for this issue.

When the WS connects send just the diffs over not the full page and hydrate the dead render.

This is very hard to achieve in practice because how do we know what changed between the dead and the live render if there is no live process? We would need to store the data on the server over some period of time and then transfer it to the node that accept the websocket mount, which comes with large implications in terms of application design. You now need to plan for your whole "live state" to be transferred across nodes, it may be more expensive than the database transfer (because the data is structured), and things like resources won't work at all. Perhaps this is easier to achieve by doing WebSockets over HTTP2, as the HTTP2 connection typically sticks to the same server.

EDIT: alternatively, we keep the process in another node and route all LiveView requests to another node, but now you are paying a in-cluster round-trip for every single LiveView interaction.

Anyway, hijacking an unrelated issue and saying we don't want to discuss something unrelated to the issue is not nice. :)

@elliottneilclark
Copy link
Contributor

If your page is not public (i.e. you don't care about SEO)

That defeats the purpose of a full FCP, and the application that we're creating is more complex than that.I think we are the largest source available Phoenix Live View web application with public-facing, internal, and three-year history of building with LV.

Please don't be defensive. I have been trying to keep the discussion around what the average user of LiveView would experience. Not single solutions that only make things better in narrow times.

This issue is not related to the dead-live render and trying to hijack it for a separate discussion, which requires a completely different trade-off analysis, makes it very hard for us to discuss this issue in particular. Please let's keep this focused.

I discussed why it's better to avoid making a piecemeal solution of assign behavior. Hyrums Law will make changing any of LiveView's behaviors harder to change after developers depend on it. So it will be calcified once you put in new behavior for assign_new and other things discussed in this PR.

The top comments in HN echo the same issues that I am here.

There are 19 different mentions of connection/reconnection in the HN threads.

Trying to apply this discussion to a much larger scope requires completely different solutions.

From exerience with correct planning and fallbacks this can be the same behaviors.

Anyway, hijacking an unrelated issue and saying we don't want to discuss something unrelated to the issue is not nice. :)

Ok.

@josevalim
Copy link
Member

josevalim commented Dec 4, 2024

Once again, there are two unrelated topics being mixed up. This is about a same-node optimization. Anything multi-node is by definition out of scope. The way this thread has been going is akin to:

  1. Someone designs an algorithm that improves the performance of a concurrent system
  2. Someone points out that the distributed system could be improved too
  3. We point out that concurrent vs distribution requires different trade-offs and different optimizations will apply

Are there general solutions that would solve both distributed and concurrent issues outlined here: yes, there are. But they won't be as good as an optimization that is specific to the concurrent same-node one.

In more concrete terms, when we optimize live_navigation, we have complete control over the life-cycle, so we can fully optimize the amount of round trips, latency, and how long the shared data lives. The dead render and reconnections do not have any of these properties! But it would be very silly of us to not leverage these properties for live_navigation due to an eventual distributed but more general solution.

There is no one dismissing the reconnection or dead render issues. They are important. But it is orthogonal. We should leverage concurrency and same node properties when we can. Treating everything as a distributed problem would be a recipe for performance woes and hard to debug issues (since pids, ports, and references cannot be transparently shared across nodes).

@josevalim
Copy link
Member

josevalim commented Dec 4, 2024

And, fwiw, @elliottneilclark, my previous replies were not direct to you. The comments I have replied to have been removed, so my comments now probably look out of context.

@josevalim
Copy link
Member

josevalim commented Dec 4, 2024

Btw, here are some examples to drive the point home:

  1. PubSub special cases local messaging compared to distributed one - the simplest would actually be to treat them all the same, but we optimized to avoid double copying 10 years ago

  2. LiveView optimizes assigns sharing on dead render to avoid any copying - could we shove them all into a GenServer and share it there, including with live_navigation and perhaps cross nodes? Yes, but that would copy the data twice (once in, once out). This specialized solution means no copying whatsoever.

There are more examples where we leverage the fact we control the life-cycle to simplify operations and avoid data copying, such as fastlaning. This issue is, in a nutshell, a similar optimization: it leverages the existing processes to copy data once. Any shared solution would do two copies. If you believe those are single solutions that only make things better in narrow times, I recommend disabling the similar optimizations we have done in the past and assess the performance impact on your application.

@elliottneilclark
Copy link
Contributor

Anything multi-node is by definition out of scope.

The single-node, more general solution makes the routing of multi-node solutions simply an optimization.

It's all about the percentage chance that a subsequent request hits a node with a process already running. If we always start the process after the initial dead view, then in the single node with no delay between HTTP and websocket, it's a 100% chance that assigns don't have to be recomputed. The same goes for push navigation. If we push navigate while using a aging/ttl for old mounted live_views then it's all about the chance that the node hasn't had memory pressure or had to evict. That's why the same solution can help all the jank; it becomes an optimization to get an already running process with filled-in assigns.

In a multi-node, it's similar. If the initial HTTP and WebSocket are routed to the same nodes (sticky session), then assigns can be reused. If, in a multi-node, the HTTP and WebSocket don't go to the same node, then we have to fall back and re-compute like the behavior today. You always need this fallback for misrouted requests if your load balancer isn't 100% reliable or if your application nodes can change (Layer 4 and Layer 7 LB's change their quorum frequently in the cloud)

@josevalim
Copy link
Member

@elliottneilclark you are confusing two things: single-node is not the same as same-node. My comments above are not assuming it is a single node. It is assuming the fact you have multiple nodes but we know that it will always target the same node (because that's how websockets work). The optimizations we are describing apply 100% of the time.

@elliottneilclark
Copy link
Contributor

We're talking past each other. I do not want to derail this anymore.

@josevalim
Copy link
Member

josevalim commented Dec 4, 2024

You are saying that you have a local cache that you use it if can, or you recompute it. The odds of hitting the cache is going to 1/number_of_nodes. But these odds do not matter here because live_navigation always hits the same node.

When it comes to live_navigation, your solution has the following downsides:

  1. You have to move data into the cache and then out of the cache, that requires 2 copies
  2. Memory grows unbounded, unless you apply cache purging, LRU, and ttls/sizing strategies

The benefit of your solution is that it could be used for operations that can be routed across nodes, such as dead render and reconnects, with the 1/number_of_nodes probability of hitting it.

The solution proposed here has the new LiveView communicate with the old one:

  1. Data is copied once and only the data that is needed

  2. There is no storage to manage or unbounded growth. There is only a brief period where the old LiveView is alive transferring data to the new one, but we can manage all of this, because we fully control the lifecycle

If you believe you have a better solution for live_navigation that does fewer data copies and has a better memory profile, then I am all ears.

@SteffenDE
Copy link
Collaborator Author

Please note that we're happy to discuss concrete improvements to the dead/live lifecycle in proposals section of the forum :)

@elliottneilclark
Copy link
Contributor

You are saying that you have a local cache that you use it if can, or you recompute it. The odds of hitting the cache is going to 1/number_of_nodes. But these odds do not matter here because live_navigation always hits the same node.

No. I am not saying that. I am saying that the more general optimization solves what this PR would try as well, so rather than single one-off solutions for push_navigate, deadview -> liveview, and reconnect each. One single elegant solution works at scale with 1 node and no load balancer, a single load balancer, and L4 + L7 balancers + XXX,XXX.0 of nodes.

Yes, but that would copy the data twice (once in, once out). This specialized solution means no copying whatsoever.

Memcopy is several orders of magnitude faster than any remote ecto or HTTP API call. ~500k nanoseconds for for 1mb memcopy. A single round trip packet in a dc is likely around that as well, meaning that a full query is at least 2 orders of magnitude more work.

@josevalim
Copy link
Member

josevalim commented Dec 4, 2024

So we definitely understand each other, we just don't agree on the trade-offs. :)

Your single elegant solution will be quite worse for dead render and worse for live navigation. And I believe the dead render and live navigation are common enough to warrant specialized solutions. This is speaking from experience optimizing stuff like channels fastlaning and pubsub, where avoiding the extra copy showed benefits in both benchmarks and production apps.

@josevalim
Copy link
Member

josevalim commented Dec 4, 2024

Most of the times the improvements on raw throughput by avoiding data copying were limited, usually between 1-2x, but it consistently lead to meaningfully better performance on the memory department: fewer memory allocations, fewer GC runs, and you avoid unnecessarily sharing large binaries too.

@josevalim
Copy link
Member

josevalim commented Dec 4, 2024

I deleted my previous comment (as I was also actively participating in the sidetracking of this issue :D) and I have opened #3551 to continue the other discussion. The proposal there should be strictly better than a GenServer as a cache layer, as it does zero or one copy within the same node, and has reduced payloads on the first connected render and reconnects.

It is also orthogonal to this issue. It optimizes reconnects and first connected render by adopting LiveViews, but it doesn't apply to live_navigation at all (which requires two LiveViews by definition).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow sharing assigns while navigating between LiveViews in the same live session
4 participants