-
Notifications
You must be signed in to change notification settings - Fork 257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UniPDF v4 - Proposals #337
Comments
type ContentStreamOperations []*ContentStreamOperation it's not fun to work with those typed slices, since iterating through an arbitrary type is kinda messy. Better to have like |
With support for 1 character code <-> multiple runes (string) in CMaps, it makes sense to update our text encoder interfaces in the future. Currently we have // TextEncoder defines the common methods that a text encoder implementation must have in UniDoc.
type TextEncoder interface {
// String returns a string that describes the TextEncoder instance.
String() string
// Encode converts the Go unicode string to a PDF encoded string.
Encode(str string) []byte
// Decode converts PDF encoded string to a Go unicode string.
Decode(raw []byte) string
// RuneToCharcode returns the PDF character code corresponding to rune `r`.
// The bool return flag is true if there was a match, and false otherwise.
// This is usually implemented as RuneToGlyph->GlyphToCharcode
RuneToCharcode(r rune) (CharCode, bool)
// CharcodeToRune returns the rune corresponding to character code `code`.
// The bool return flag is true if there was a match, and false otherwise.
// This is usually implemented as CharcodeToGlyph->GlyphToRune
CharcodeToRune(code CharCode) (rune, bool)
// ToPdfObject returns a PDF Object that represents the encoding.
ToPdfObject() core.PdfObject
} It would make sense to have charcode <-> string, and charcode <-> string. or maybe ones that process multiples instead of single ones. |
|
Text extraction is now aware of paragraph and line structure. We can therefore write a search function that returns bounding boxes of the line fragments of the matching text when the match spans multiple lines or multiple paragraphs |
support create and manage PDF/A3 with file attachment |
@progamer71 That is in our radar but that is not what this ticket about. This is about API compatibility and possible major changes in upcoming v4. PDF/A3 is not part of our API yet, so it is not a concern here. It would make sense to create a new issue for that, if there is not one already. And with more details as well. |
|
In V4: We should change content stream processing.
which returns a string. The problem with this is that the content streams can get very big, and working with it as string leads to copying which is inefficient and memory intensive. Creating a new type in |
Deprecate |
Remove model.ImageHandling or make internal. Alternatively it could be redesigned such that it would be actually usable for providing handlers for loading images. At the moment this functionality is not well maintained and would need more testing. StreamEncoders could be designed such that they can be registered, such that an external handler could be registered (in particular for image handling). The Decode output for images as a []byte stream (data) may not be ideal and sometimes we are loading an image and converting between models multiple times which is not efficient. |
Text extraction should have options. Possible options:
|
Unify ContentStreamProcessor based on usage in render and extractor packages. Should be able to keep track of graphics and text state there in one place. |
Introduction
The idea of this ticket is to collect all ideas that would make sense for next major version (v4). This is a chance for considering major updates.
Ideas related to performance improvements should include benchmarks that clearly show potential advantages.
Ideas for refactoring need to clearly state the advantages and also what it means for the user. If there is a breaking change, there should be an easy way to update code.
Ideas for reducing binary sizes are also welcomed. Although its important that the API remains easy to use.
Related issues
There are already a few relevant issues:
The text was updated successfully, but these errors were encountered: