-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add code-mixed language identifier #77
base: master
Are you sure you want to change the base?
Conversation
Thanks a lot @tathagata-raha for your contribution. Your work looks great, I just had few comments:
Again, thanks for the great work and apologies for the delay in reply. Let me know once you've clarified the above, I'll do final round of testing and then we should be good to release. |
@tathagata-raha
As you can see it's a reference to your local home directory somewhere (not in the code although). Any ideas how to fix it? |
@tathagata-raha |
@tathagata-raha Saving:
Loading:
Maybe there is a more elegant way to do it, but at least it works |
I tried it and transformers works with iNLTK's torch dependency just fine, in case I'm using transformers==3.5.1, not 4+ version |
What?
In this PR, I have added the support for identifying code-mixed and Indian languages written in Roman script. Currently, it can detect Hinglish, Tanglish and Manglish and Hindi, Tamil, and Malayalam written in Roman script.
Related issues
Solves #76, #54
Why?
In this toolkit, support has been provided for identifying languages written in the native script. However, if an Indian language is written in Indian script, it would predict 'en' or English. That's why this feature might be helpful. Moreover, detecting if a language is code-mixed or not and identifying the code-mixed languages are also not present.
How?
download_assets.py
file. The downloaded learner is saved in the codemixed folder within models.codemixed_util.py
contains the necessary classes for the learner to run. These classes need to be imported while running the code.check_codemixed
to theidentify_language
function. When set toFalse
, it returns en or English if the input is in Latin script. When set toTrue
, it executes theidentify_codemixed
function to detect code-mixed instances in the input.transformers
library.Testing?
I have written some unit tests for this functionality. You can check the unit tests and the output in this Github Gist. Apart from that, I also ran other tests to make sure that no dependencies get broken or any other functionality fails.
Example code
Refer to this gist for example code of this functionality.
Concerns
codemixed_util.py
need to be imported before running the code-mixed identifier. Else it will raiseAttributeError
.Anything Else?
For more insight into the dataset creation and classification model training, check this repository