Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tok Pisin #7

Open
evali1 opened this issue Jan 16, 2019 · 1 comment
Open

Tok Pisin #7

evali1 opened this issue Jan 16, 2019 · 1 comment

Comments

@evali1
Copy link

evali1 commented Jan 16, 2019

Worthwhile project, but the corpus has lots of plain English (both Am & Br/Aus) and probably the source materials contain texts in English as well as Tok Pisin and the former need to be excluded before data is collected. Also consider purging items with a stop in the middle, which appear to be omitted spaces and not bona fide forms. Further, there are many many proper names which are presumably not very interesting for the purposes of the endeavor.

The trickier bit is the spelling variation of words which are actually the same, depending on regionally varying pronunciation as well as varying degrees of influence from English writing; thus e.g. avris and abrus are the same thing ('avoid'; at least one more variant is in there), and the forms with -im at the end are the transitive versions of the same; and I expect that 'bek' and 'beck' are the same too.

I am not aware enough of the styles of all the regions to do a clean-up of the corpus but I wanted to point out the problems that I was able to spot.

Regards,

Eva (Swedish but fluent speaker of the New Ireland dialect after some three years in a village there)

@hugolpz
Copy link
Contributor

hugolpz commented Mar 22, 2021

Hello @evali1,

Following your ticket I created a curation process which should allow a volunteer to review the list and exclude words quite quicly. I expect 1000~2000 words can be reviewed per hour, so the non-native words in these items can be excluded.

I'am looking for a first user to use the process.
Please follow this link. This Project is derivated and partenairing with UNILEX, which provides the raw data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants