-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF-8 collumn names written as ascii #63
Comments
I downloaded one of the shapefiles from that website, but it looks like the DBF files with the column names are not encoded in UTF-8, but BIG5. We assume UTF-8 in DBFTables.jl, by converting the bytes to String, which is UTF-8. If the same bytes are decoded as BIG5 however the result seems fine: using StringEncodings
bytes = UInt8[0xb9, 0xcf, 0xa6, 0x57, 0xa5, 0x4e, 0xbd, 0x58]
decode(bytes, "BIG5") # -> "圖名代碼" It is not supported by this package, but often a .cpg file is added that specifies the encoding:
I'm impressed that GDAL (used by GeoDataFrames and QGIS) seems to correctly guess the encoding here. If I export the file from QGIS, it encodes it in UTF-8 and writes a .cpg file with Looking at these links
it seems there is a Language Driver ID in some DBF headers, that can be used for this as well. I suppose that is how GDAL figured out the encoding. I suppose that DBFTables could take a dependency on StringEncodings and add support for both CPG files and Language Driver IDs, though that would require some effort. Probably your best bet for now is to just use GeoDataFrames.jl. GDAL in general is better at reading a wide variety of shapefiles compared to this package. Another option is to pull this file through GDAL's |
I tried to open ESRI Shape files offered by the OpenData website of TW (https://data.gov.tw/en/datasets/all). For example in a file for smaller Taichung City. Column names that are saved as UTF-8 characters and can be display in QGIS (my version is 3.20), are displayed by a sort ASCII 8-bit? characters, when I open these with GeoDataFrames.jl.
圖名代碼 ->\xb9ϦW\xa5N\xbdX
The text was updated successfully, but these errors were encountered: