Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notebook template for transforms to run on Google Colab #851

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

Ryan-Gordon-314159
Copy link

Why are these changes needed?

This PR contains a notebook that can serve as a guide for porting transforms over to work on Google Colab

Related issue number (if any).

Issue #844

@shahrokhDaijavad shahrokhDaijavad self-assigned this Dec 3, 2024
@shahrokhDaijavad shahrokhDaijavad self-requested a review December 3, 2024 20:33
@shahrokhDaijavad shahrokhDaijavad removed their assignment Dec 3, 2024
Copy link
Member

@shahrokhDaijavad shahrokhDaijavad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @Ryan-Gordon-314159. As we discussed, the open colab icon will work only when this PR is merged. Other than that, I have tested this by manually running it on Google Colab.

Copy link
Collaborator

@touma-I touma-I left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see comments. We need to make this notebook tell a story (not just copy from other notebooks)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to change the notebook to tell a story that would be interesting to the end user. What do you think of this?

  • Step 1: Enable Collab
  • Step 2: Pip install
  • Step 3: Identify interesting pdf files and use web crawler to fetch them. If possible, identify pdf contant that has OCR in it
  • Step 4: Configure the transform to generate MD, run the transform
  • Step 5: Show output from the transform, not only the new content but any other fields that the transfrom may have produced that may be of interest based on the content
  • Step 6: Repeat step 4 and 5 above but with a different configuration of the tansform for example, generating Json instead of MD or extracting OCR, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to consider getting rid of my_utils.py file, if possible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do not download my_utils.py. I won't be able to merge this PR with this in it.

Copy link
Collaborator

@touma-I touma-I left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not be download utils.py ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do not download my_utils.py. I won't be able to merge this PR with this in it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants