-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement token-based sessions. #4690
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for working on this! I'm quite happy to see it happen.
I'm not a contributor on crates.io, just an interested security engineer with some experience working on sessions and databases, so take the below with an appropriate grain of salt.
Session tables can be expensive (in DB capacity), because they need to be read on every request. Even more expensive if they need to be written on every request. We can reduce the expense if we can keep all, or a significant chunk, of the sessions table in memory. Some recommendations to reduce the size of the table:
Remove the index on hashed_token (indexes can be surprisingly expensive, bytes-wise). Instead, have the cookie assert a session id alongside the token, like scio:12345:77af9bea05b66df6422ffcf1bf97b7e62e2ca4e5
. Then you can look up by the primary key, which is shorter and cheaper anyhow.
Move the last_used_at, last_ip_address, and last_user_agent into their own table. If the load from the writes gets too big, crates.io could disable writes to that table (at the cost of not having updated session usage data) to regain some capacity. Alternately, we could produce the "last accessed" data by post-processing logs instead of doing live writes to the DB. I realize I proposed in the original issue that the sessions table should store this, but I'm having minor second thoughts. :-)
Session tables have the property that you typically want to delete old data to save space. Deleting data row-by-row can be expensive. It's probably a good idea to partition the session table by ranges of primary key: https://www.postgresql.org/docs/current/ddl-partitioning.html
Partitioning can provide several benefits:
Query performance can be improved dramatically in certain situations, particularly when most of the heavily accessed rows of the table are in a single partition or a small number of partitions. Partitioning effectively substitutes for the upper tree levels of indexes, making it more likely that the heavily-used parts of the indexes fit in memory.
Bulk loads and deletes can be accomplished by adding or removing partitions, if the usage pattern is accounted for in the partitioning design. Dropping an individual partition using DROP TABLE, or doing ALTER TABLE DETACH PARTITION, is far faster than a bulk operation.
Hope this is helpful!
src/models/persistent_session.rs
Outdated
// TODO: Do we want to check if the user agent or IP address don't match? What about the | ||
// created_at/last_user_at times, do we want to expire the tokens? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't disqualify a session if the user agent or IP address change. IP addressed change regularly due to DHCP leases or moving to a different network (e.g. a cafe). User agent strings change as a result of browser upgrades.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. At what point do we revoke sessions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of basic options:
- If the user explicitly logs out of any session, invalidate all sessions everywhere (simplest, probably good enough)
- If the user logs of a session, invalidate that session, AND have a separate "log out other sessions" button
Another maybe interesting option (probably not for this PR):
- Have a background job that revalidates the GitHub token periodically. If that token has been revoked, invalidate all sessions.
☔ The latest upstream changes (presumably #4678) made this pull request unmergeable. Please resolve the merge conflicts. |
Made some updates. Still need to:
|
@jsha I got a couple of questions for you :)
|
I hadn't noticed that there's no creation_date column. That's probably a good idea. Since the primary key is SERIAL, we expect that the primary key id maps linearly to creation dates. To check if a partition in pruneable, we should be able to look at the creation date of the highest id in that partition. If that's past the expiration time, the whole partition can be dropped. The best practices for partitioning choices are documented here: https://www.postgresql.org/docs/current/ddl-partitioning.html#DDL-PARTITIONING-DECLARATIVE-BEST-PRACTICES
Since we're now looking up by the primary key, partitioning by that key seems like the best choice. This also saves us from needing another index.
I haven't fully solidified my thinking, but in broad strokes I'm thinking about how much of the session table we can afford to fit in the DB's memory, and also about locking and update rates. If the non-essential (and potentially large) fields are in another table, we wouldn't necessarily need to keep them in memory. On the other hand, writing to them constantly might wind up keeping them in memory anyhow (to the detriment of other tables). I'm also slightly concerned that constantly writing to session rows could cause contention on reads of those session rows. In general, I'm having second thoughts on the idea of recording user-agent and IP on every access, because I suspect it will be very expensive. It's a good feature to have, but I want to maximize the chances of success for the core feature of revocable sessions. What do you think of omitting that part for now? Also, fair warning: I am no Postgres expert. I'm taking what I've learned on MariaDB/MySQL and extrapolating/reading the docs. |
Here's another way of thinking about the access IP and user-agent: For the attack we're most worried about (compromised GitHub account), the IP and user-agent of the login are just about as useful as the IP and user-agent of the last access. So we could store that info at login to help people figure out if they have any inappropriate sessions. |
We don't have I agree with starting with the basics for now. I'm thinking of keeping the |
@jsha I'm having trouble figuring out how partitioning will fit alongside the rest of the infrastructure. I can define the table to be partitioned by primary key but the partitions themselves need to be setup with SQL code (not from Rust). I have some ideas how this could work, so let me know what you think. Assuming we can have some sort of cronjob in the server, I could set one up daily that reads the latest session (not sure how yet, maybe just select all and sort descending or something), and create a partition from the highest ID in the previous partition and the current highest. Then, we can have another cronjob that looks at the partitions starting from the oldest and deletes the ones that are stale (by whatever definition of stale we decide upon). I'm not sure that we can even know what partitions currently exist with SQL. However, I think it might be worth considering having the partitions be ranged by creation_date and here's why: we don't need an index as this will be done with a cronjob once a day so efficiency isn't necessarily a priority (do you agree with that reasoning? Maybe DB locks will be a problem). The advantage of doing the partition by creation_date is that it's very easy to create and delete the older partitions without having to do any sorting to figure out if the partition should be deleted. Additionally, it would be easy to define such cronjob without having to figure out what partitions currently exist (we can just say look at the partition with What do you think? |
☔ The latest upstream changes (presumably ca82d24) made this pull request unmergeable. Please resolve the merge conflicts. |
Unfortunately, due to diesel-rs/diesel#2411, the table can't be named `sessions` as it causes patch conflicts with recent_crate_downloads.
@bjorn3 or @pietroalbini do either of you know who I should contact in order to setup automatic creation/deletion of the partitions as a cron job within the server? |
☔ The latest upstream changes (presumably 64feb48) made this pull request unmergeable. Please resolve the merge conflicts. |
@adsnaider there is already a daily cronjob whose source code is https://github.com/rust-lang/crates.io/blob/master/src/worker/daily_db_maintenance.rs. I guess you can probably add it there? Also, note that I'm not on the crates.io team, so it's up to them to review the change 🙂 |
This PR sets up the session table (called persistent_session because diesel... diesel-rs/diesel#3055). Every time a user is logged in, a session will be created alongside the cookie that the user will use for authentication. This PR is the initial step for fixing #2630