-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
updates for schemas #123
Comments
Instead of manually adding all UCSC table schemas, we should consider an approach to build the schema "database" that automatically sources table schemas from UCSC and translates them into something python/user-friendly. An alternative would be to grab data and schemas directly from UCSC's sql database dynamically: see cruzdb. Since the UCSC schemas are all typed SQL schemas, we could consider optionally adding things like:
But here we risk doing a lot of manual curation again.
|
another option for implementing a renaming dictionary would be following the format for colname remapping in core.specs bioframe/bioframe/core/specs.py Line 13 in ccb8e70
|
GFF/GTF reader proposal Usually parsing gff/gtf files is painful because of the "attributes" column that is a key-value dictionary stored as string.
Does it make sense to add an option to do a similar expansion on any column with user-specified regex, e.g. here? bioframe/bioframe/io/fileops.py Line 43 in fbd129c
GTF/GFF defaults can be added as examples. Pros:
Cons:
Maybe it's already in UCSC @nvictus and it's simple to re-use? |
Btw, there's this which got sandboxed. It's slow, as you might expect: |
Ah, I did not notice that one. The disadvantage of this one is that it's not generalized and cannot be simply customized if there are no spaces in the annotation or there's a mix of quote chars / item separators. I would upvote dissandboxing it. |
Agreed that there's plenty of room for improvement. I would advocate for now keeping it as a function downstream from |
there's also |
I was unable to parse some public gtf files with it, only custom solution worked. |
https://genome.ucsc.edu/cgi-bin/hgTables?db=hg38&hgta_group=rep&hgta_track=rmsk&hgta_table=rmsk&hgta_doSchema=describe+table+schema
"""bin swScore milliDiv milliDel milliIns genoName genoStart genoEnd genoLeft strand repName repClass repFamily repStart repEnd repLeft id"""
bioframe/bioframe/io/schemas.py
Line 171 in ccb8e70
read_table()
, since we would often do the following:consider adding a set of columns to be dropped, e.g. when they come from a database they sometimes include indexing columns like
bin
that are not very usefulthink how to add
dtypes
to schemas, such as was done in cooltools.cli.pipeup:https://github.com/open2c/cooltools/blob/1212cf0757741951a6be15bb7351cf35240493a0/cooltools/cli/pileup.py#L138
The text was updated successfully, but these errors were encountered: