-
-
Notifications
You must be signed in to change notification settings - Fork 933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve statistics for downloads #4642
Comments
See also https://packaging.python.org/en/latest/guides/analyzing-pypi-package-downloads/ for how pypi handles this |
👋🏻 heyo, i'm Colby, i'm maintaining the infrastructure for rubygems.org and wanted to jump in to help get this done. I wanted ask some questions to better understand what changes introducing TimescaleDB will have. I appreciate Timescale putting their hand up to help us here, it's super appreciated by everyone here. My big takeaway of this proposal is introducing a runtime dependency to rubygems.org, which we have already, ie: Fastly, but look to limit if possible. What benefit is it to run a Timescale Cloud instance vs would our use case be something simple enough for the Timescale Postgres extension could handle relatively easily? I also heard of a potential Timescale DB instance inside AWS being in active development, is this far away? Our download logs only go as far back as 2015 when we moved to Fastly, so you'll probably need to add a step to backfill gem versions created before this date. Which you can probably backfill up to |
@colby-swandale What data could be used to backfill pre-Fastly gems? In case there is none, we can just mark those versions as incomplete statistics-wise. |
Hello Colby! Thanks for reaching out!
A cloud allows to use elastic computing and storage, high availability, replicas, etc. This would also be a great marketing for our product but the open source version just works.
I don't have details enough to share any estimates but will try to check with the team.
I totally agree and I was thinking even how these statistics could be a separated service, like So, I'm also happy to move it as an independent process to isolate the entire scenario too. If you agree I can first bring the POC that just runs totally independently. |
@colby-swandale on the other side new isolated app will add maintenance burden. 🤔 @jonatas do you have any idea/estimate what kind of response time we can get for most complex queries planned? |
I don't think we'll have anything over a second. Everything will be pre-processed, so I imagine the avg query will be under 300ms. |
Hi folks, I just created this POC with the basic code to allow us to collect hourly statistics from the raw data. We can run all logs available and just pre-load the data into some instance, but I still don't have access to run it. @simi brought the point of make it an isolated service versus run it on the actual infrastructure, and I'd love if we could I see a lot of positive impact on building a isolated server which just track downloads. I don't think this type of feature needs to be part of the server and having the extra database layer would add a new layer of complexity over ActiveRecord as it uses a different connections. On an isolated server we'd need to mimic LogTickets or just have access to s3 api to list and consume all the files:
I'm very open to follow in both ways. I can really integrate into the point that @segiddins went before. I just explored as a POC and looking for more feedback before we proceed to the production implementation. I think as an isolated server we have more chance to develop other types of analysis and even detect patterns. |
This was raised by @colby-swandale actually. We need to ensure Timescale service health is not going to affect health of the rest of the service. I thought we do something especial for OpenSearch, but seems we're not. 🤔 @colby-swandale would you mind to decide if it is ok to start with built-in API with some reasonable timeouts or rather start with isolated service? |
Is your feature request related to a problem?
I had a meeting with @simi to follow up and continue the draft
@segiddins started on #3560 and here let's break down the problem.
Problem: The actual DownloadGem does not offer granularity or insights to the team creating the gem. The idea is improve the support giving more granularity and details about the user behavior while installing the gems.
Describe the solution you'd like
Introduce a new granular track of downloads. Allowing users to know more details of when gems will are installed and expose publicly more statistics about gems being downloaded.
The gem page can present daily, weekly monthly totals. The public view can also see hourly downloads of "Today".
The ideal scenario would also include the location from where the Downloads comes from but I haven't investigated enough if we have such granular level of information available.
Describe alternatives you've considered
I haven't checked alternatives as Postgresql is already in the stack and TimescaleDB was already the suggestion.
Additional context
I'm very glad to work and support RubyGems. I'm a rubyist for almost 2 decades and last 3 years I moved to work at Timescale as a Developer Advocate, the company behind the TimescaleDB extension. I also created the timescaledb gem. So, my plan is break it down in a few PRs:
The text was updated successfully, but these errors were encountered: