KumiswaAmakolishi namanyuvesi

Kuyini Corpus Awezolimi?

Vele emashumini ambalwa eminyaka edlule ukuze wenze ngokuzenzakalelayo ucwaningo mayelana nezilimi, ososayensi kuphela baphupha. I wawenziwa uhlangothi, edonsela inqwaba yabafundi, kukhona amathuba amaphutha ezinkulu "ukuzithela ngabandayo", futhi okubaluleke - konke lokhu wathatha eside, eside kakhulu.

With ukuthuthukiswa ubuchwepheshe bama-computer isibe kungenzeka ucwaningo kwi-oda ezidlula ngokushesha, futhi namuhla omunye izinkomba ethembisayo ekutadisheni ulimi kuyinto yezilimi Corpus. Sici Its main ukusetshenziswa esiningi ulwazi umbhalo, ulwazi egciniwe olulodwa, ngendlela ekhethekile futhi ngokuthi umzimba emakiwe.

Kuze kube manje, kukhona izakhiwo eziningi wadala letehlukene ngesisekelo impahla ehlukahlukene ngokolimi esithathe kusuka izigidi achitha izinkulungwane zezigidi amayunithi lucebe. Lokhu isiqondiso kuyaqapheleka njengoba ethembisayo ubonisa neragelo phambili maqondana nesicelo nokucwaninga. Ongoti, ngandlela-thile nokuhwebelana nge ngolimi lwemvelo, kutuswa ukuba bajwayelane umzimba imibhalo okungenani ezingeni eliyisisekelo.

Umlando Corpus yezilimi

Ukwakheka kulo mkhuba ngenxa ukudalwa-United States ngaleso Brown umzimba ngasekuqaleni 60-yalolucwaningo kwekhulu elidlule. Iqoqo lihlanganisa imibhalo yonke 1 million amafomu izwi, futhi namuhla kulo mzimba usayizi be ngokupheleleyo zokungakwazi. Lokhu kungenxa yokungabikho ijubane ukuthuthukiswa ubuchwepheshe bama-computer, kanye okufunwa esikhulayo izinsiza ezintsha zocwaningo.

Esikhathini 90 yezilimi Corpus sekukhona ibe isiyalo okugcwele ezimele, iqoqo imibhalo baye basondelana futhi ibhalwe kwezilimi. Kule nkathi idaliwe, isibonelo, amathokheni British National Corpus 100 million.

With ukuthuthukiswa kulendawo wesayensi yezilimi, umbhalo amavolumu beba more and (uphinde ufinyelele kumabhiliyoni amayunithi isichazamazwi), kanye nomhleli isiba wehlukile. Kuze kube manje, isikhala Inthanethi zingatholakala izidumbu telulwimi lolukhulunywako, ngezilimi eziningi, izincwadi zokufunda ngamakhasimende kwezobuciko noma zezemfundo, kanye nezinye izinhlobo eziningi.

Yiziphi izindlu

izinhlobo umzimba e yezilimi umzimba ungase unikezwe ngenxa yezizathu eziningana. Intuitively, isisekelo ngezigaba kungaba ulimi umbhalo (isiRashiya, isi-German), imodi ukufinyelela (umthombo ovulekile, ivaliwe, commercial), uhlobo nezincwadi okusekelwe (eqanjiwe, obubhalwe, zezemfundo, journalism).

indlela abathandwayo edonsa mathiriyeli olimini olukhulunywayo. Njengoba ukuqopha ngamabomu inkulumo enjalo ukudala indawo yokufakelwa ngoba Kubantu abaphendula, abangamaphesenti kanye impahla okuholela ayikwazanga ngokuthi "nokuguqulwa", yezilimi zanamuhla Corpus usephendukele enye indlela. Isisebenzi sokuzithandela ifakwe i-microphone, emini ekhiqizwa irekhodi zonke izingxoxo, lapho khona uhlanganyela. Abantu emhlabeni Yiqiniso, kungenzeka ukuthi anazi ukuthi ngokuhamba yansuku zonke enomthelela ekuthuthukiseni isayensi.

Kamuva wathola irekhodi egcinwe database ahambisana ephrintiwe umbhalo okulotshiweyo hlobo. Ngakho, kuba nokwenzeka yomdwebo adingekayo ukuze kudalwe izulu setemlomo nsuku zonke inkulumo izindlu.

isicelo

Noma nini lapho kungenzeka ukusetshenziswa kolimi, futhi mhlawumbe ukusetshenziswa izakhiwo imibhalo. Izindlela ukusebenzisa kwalesi sikebhe yezilimi kungaba:

  • Ukudala uhlelo sokunquma ukhiye, is kabanzi sezombangazwe namabhizinisi lokhozi izimpendulo ezinhle nezimbi abavoti kanye namakhasimende, ngokulandelana.
  • Uxhumano uhlelo ulwazi izichazamazwi kanye abahumushi ukuba futhi bathuthukise ukusebenza kwazo.
  • A ezihlukahlukene imisebenzi yocwaningo enesandla ukuqonda iyunithi ulimi, umlando ukuthuthukiswa yayo futhi isibikezelo izinguquko esikhathini esizayo esiseduze.
  • Ukuthuthukiswa kolwazi ukubuyisa izinhlelo esekelwe morphological, syntactic, semantic kanye nezinye izici.
  • Ukuthuthukisa izinhlelo ngokolimi ezahlukene kanye nabanye.

Ukusebenzisa izakhiwo

efanayo isisetshenziswa esibonakalayo nge injini ejwayelekile, futhi ushukumisa umsebenzisi ukufaka igama noma amagama ahlangene ukucinga base ulwazi. Ngaphandle yakha nombuzo ngqo ungasebenzisa version ngcono, okuvumela ukuba uthole ulwazi kombhalo sanoma yisiphi isimo sokunquma namalimi wemiphakathi.

isizinda search kungaba:

  • ubulungu iqembu ethile titfo tenkhulumo;
  • izici lwetakhi telulwimi;
  • engasho lutho;
  • Umbala sitayela ngokomzwelo.

Ungase futhi sihlanganise nokuseshiwe ngoba ukulandelana kwamagama, isibonelo, ukuthola kuzo zonke izindawo elivela sesenzo usenkathini yamanje, umuntu wokuqala elisebunyeni, okuyinto eza ngemva isabizwana "e" futhi zebizo esimweni umenziwa wesenzo. Isixazululo nomsebenzi olula kudinga umsebenzisi imizuzwana embalwa futhi kudinga kuphela igundane okumbalwa emasimini ecacisiwe.

Inqubo yokudala

Usesho ngokwayo kungaba kwenziwe kuwo wonke subcorpus futhi owayekhethwe ngokuqondile, kuye ngezidingo ekufezeni umgomo othile:

  1. Isinyathelo sokuqala siwukuba ukuchaza lapho imibhalo yakha isendlalelo sokobana kunjalo. Ngezinhloso esisebenzayo, lisetshenziswa njalo lesibonwa, imibiko yezindaba, amazwana ku-intanethi. Le phrojekthi ucwaningo ukusetshenziswa ezihlukahlukene iphakethe izinhlobo, kodwa umbhalo Kufanele kukhethwe ngokuvumelana okuthile enivumelana.
  2. Iqoqo umphumela imibhalo ngaphansi pretreatment, kukhona ukulungiswa amaphutha, uma ikhona, okulungiselelwe incazelo lwempilo kanye extra-kwezilimi umbhalo.
  3. Siqedwa lonke ulwazi elombhalo: Isula ihluzo, izithombe, amatafula.
  4. Ingabe kudingeka kuhlinzekwe isamba sika amathokheni, okuyizinto ngokuvamile inkulumo, kucutshungulwe.
  5. Ekugcineni, wazithwala sebuningini morphological, syntactical nezinye mshini etholwe izakhi.

Umphumela wazo zonke okwenziwe ngumuntu isakhiwo syntactic nge basakaza therein sebuningini izakhi, ngamunye elibonakala tenkhulumo, nemithetho yolimi nendlela, kwezinye izimo, izichasiso semantic.

Ubunzima ekudaleni izakhiwo

Kubalulekile ukuqonda ukuthi akwanele ukuba ahlanganise iqoqo emagama nobe imisho ukuze umzimba. Ngakolunye uhlangothi, iqoqo imibhalo kufanele balinganisele, okungukuthi, amelela setinhlobo letehlukene tematheksthi ngezabelo ezithile. Ngakolunye - okuqukethwe ebiyelwe kufanele zibekwe ngendlela ekhethekile.

Inkinga yokuqala isixazululekile kwesivumelwano: isibonelo, iqoqo kuhlanganisa 60% ekufundzeni ematheksthi etemibhalo, u-20% emadokhumenthari, kumaphesenti athile unikezwa ukumelwa ebhaliwe lulwimi lolukhulunywako, nomthetho, imisebenzi yesayensi, njll ephelele iresiphi umzimba ukulinganisela namuhla alikho ...

Umbuzo wesibili ngokuqondene isakhiwo okuqukethwe, ukuxazulula inselele. Kukhona izinhlelo ezikhethekile futhi algorithm esetshenziselwa othomathikhi wokugubha tematheksthi, kodwa aziyeki yalokho ephelele, kungabangela ukuphazamiseka futhi zidinga sibuyekeze mathupha. Amathuba nezinselelo ekubhekaneni nale nkinga zichazwe ngokuningiliziwe ngendlela iphepha V P. Zaharova ka Corpus yezilimi.

Umbhalo yomdwebo luyasetshenziswa emazingeni eziningana, okuyinto thina ubhale ngezansi.

ukumaka morphological

Kusukela esikoleni, siyakhumbula ukuthi isiRashiya, zikhona izingxenye ezahlukene yokukhuluma, futhi ngamunye kubo has izici zayo siqu. Ngokwesibonelo, isenzo has imikhakha kuthambekela futhi isikhathi lapho kungekho ibizo. isikhulumi lwendabuko ngaphandle kokungabaza enqaba emabito conjugate tento, kodwa ukumaka emzimbeni ezingu-100. amathokheni umsebenzi wezandla ngeke zisebenze. Zonke imisebenzi edingekayo Kodwa ungakwazi akhiphe ikhompyutha, salokhu ke kufanele ufundiswe.

ukumaka morphological, nekhompyutha kumelwe "ukuqonda" igama ngalinye njengoba ingxenye ethile yokukhuluma unezici ezithile nokusetshenziswa kolimi. Kusukela Russian (kanye nanoma yimiphi eminye ulimi) usebenza eziningi imithetho njalo, kungenzeka ukwakha inqubo othomathikhi for the ukuhlaziywa morphological ukuthi ukutshala imali ngokuthenga imoto ngoba eziningi algorithm. Nokho, kukhona kilomlayo, kanye izici ezihlukahlukene nzima. Ngenxa yalokho, net ikhompyutha ukuhlaziya namuhla neze, ngisho 4 Iphutha% kuveza a ukubaluleka 4 mln. Amagama emzimbeni 100 million. Units, ezidinga sibuyekeze mathupha.

incwadi enemininingwane uchaza inkinga Zaharova V P. "Corpus Awezolimi".

isichasiselo syntactic

Kudluliswa noma kudluliswa - inqubo inquma ubuhlobo wemagama emshweni. Ukusebenzisa iqoqo algorithm kungenzeka ukunquma umbhalo isihloko, isilandiso, izithasiselo, badedelane amaningi yokukhuluma. Thola ukuthi iyiphi mazwi ukulandelana main, futhi okuyinto - engaphansi, singakwazi luthathe ngempumelelo ulwazi kusuka umbhalo ukuze futhi nifundise abantwana umshini ukukhipha ephendula isicelo search kuphela ulwazi ezithakazelisayo kithi.

By the way, yesimanje izinjini zisebenzisa lokhu ukunikeza izinombolo ethize esikhundleni semibhalo eside ephendula imibuzo efanele ezifana "mangaki ama-calories ku-apula" noma "ibanga kusuka eMoscow kuya Petersburg." Nokho, ukuqonda ngisho izisekelo yenqubo esachazwa isidingo uthintane "Isingeniso Corpus Awezolimi" noma ezinye okokufundisa eziyisisekelo.

yomdwebo semantic

I engasho lutho lwegama elithi - okungukuthi, ngamazwi alula singathi, okuqondiwe. Kabanzi kusebenza indlela eya ukuhlaziywa semantic we tags izwi isichasiso, khomba okuqondene lakhe iqoqo izigaba semantic kanye subcategories. Imininingwane olufana nalolu lubaluleke iqhakambisa algorithm sihlaziye umbhalo ithoni, summarization okuzenzakalelayo nezinye izindlela imisebenzi ka Corpus yezilimi.

Kunezindlela eziningana "impande" emthini, emelela igama abstract nge engasho lutho ebanzi kakhulu. Njengoba igatsha ISIZINDA esihlahleni akhiwa, equkethe izakhi lucebe kakhudlwana futhi ecacile. Ngokwesibonelo, igama elithi "isidalwa" kungenzeka ukuthi umataniswe ngemiqondo efana 'ongumuntu "futhi" isilwane ". Igama lokuqala uzoqhubeka azikhandle ku ubuchwepheshe ezahlukene, behlobene imigomo, ubuzwe, kanti eyesibili - ku amakilasi nezinhlobo zezilwane.

Ukusetshenziswa kolwazi ukubuyisa izinhlelo

Izindawo ukusetshenziswa Corpus yezilimi ukumboza Amasimu ahlukahlukene womsebenzi. Housings asetshenziselwa nokulungiselela kanye nokulungiswa izichazamazwi, dala translation izinhlelo ezizenzakalelayo, isichasiselo, kubuyiswa amaqiniso, sokunquma ithoni nezinye ukucutshungulwa umbhalo.

Ngaphezu kwalokho, imithombo enjalo ngenkuthalo asetshenziswe Ukufunda izilimi emhlabeni futhi izindlela ezisebenza kolimi jikelele. Ukufinyelela lemiqulu emikhulu Imininingwane pre-ezilungiselelwe kusiza cwaningo ngokushesha futhi olunzulu ukuthambekela izilimi ukuthuthukiswa, futhi ushintsho esitebeleni kumiswa neologisms isivinini inkulumo uyakwazisa amayunithi lucebe nabanye.

Njengoba umsebenzi ngemadlana enkulu kangaka idatha kudinga ezishintshayo, namuhla kukhona ukuxhumana obuseduze obukhona phakathi komhlambi computer futhi Corpus yezilimi.

Russian National Corpus

Leli cala (esifushanisiwe NKRYA) kuhlanganisa eziningi subcorpus, okuvumela ukusetshenziswa umthombo ezihlukahlukene imisebenzi.

Izinto database zihlukaniswe NKRYA:

  • izincwadi e 90s abezindaba 'futhi kowezi-2000, kokubili ezifuywayo angaphandle;
  • ukuqopha inkulumo;
  • aktsentologicheski imakwe imibhalo (ngamanye amazwi, izimpawu zokucindezeleka);
  • inkulumo lwesigodi;
  • izinkondlo;
  • Izinto ne mshini syntactic nezinye.

Uhlelo Imininingwane sihlanganisa Subcorpus nge izinguqulo parallel imisebenzi kusuka Russian isiZulu, isiJalimane, isiFulentshi futhi eziningi nezinye izilimi (kanye noHezekeli).

Futhi database kukhona ingxenye imibhalo zomlando, emelela inkhulumo lebhalwako ngesiRashiya ezinkathini ezihlukahlukene ngendlela. Kukhona umzimba ukuqeqeshwa, okungaba wusizo izakhamuzi zakwamanye amazwe kahle isiRashiya.

Russian National Corpus yakhiwa izigidi ezingu-400 amayunithi lucebe, futhi ngezindlela eziningi ngaphambi ingxenye enkulu izilimi Europe emizimbeni.

amathemba

Iqiniso esivuna ukuqashelwa kulo mkhuba ukutholakala ezithembisa laboratory Corpus yezilimi emayunivesithi Russian, kanye angaphandle. With ukusetshenziswa nocwaningo kohlaka lolu lwazi nokusesha imithombo kubandakanya ukuthuthukiswa kwezindawo ezithile emkhakheni ubuchwepheshe okusezingeni eliphezulu, izinhlelo umbuzo-ukuphendula, kodwa okuxoxwe ngawo ngenhla.

ukuqhubeka Corpus yezilimi Kubikezelwa kuwo wonke amazinga, ezisukela lobuchwepheshe kanye ngokuya ukuqaliswa ubuchule obuphezulu omusha nokwandisa izinqubo yokusesha kanye ngokucubungula ulwazi, amakhompyutha amandla, RAM ngaphezulu, futhi umthengi, ngoba abasebenzisi izindlela bayanda usebenzise lolu hlobo umthombo ku nsuku zonke zabo ukuphila nomsebenzi.

Ekuphetheni

Phakathi nekhulu leminyaka eledlule e-2017 kwabonakala esikhathini eside esizako, lapho mkhathi udabula yonke futhi amarobhothi benze wonke umsebenzi abantu. Eqinisweni, isayensi igcwele "amabala amhlophe" nokwenza imizamo amashushu uphendule le mibuzo isintu amakhulu eminyaka kokuphazamisa. Imibuzo isebenze kolimi lapha luthathe indawo nodumo, iKhabhinethi yezamakhompyutha yezilimi engasisiza bona ubhalelwa ukuyiphendula.

Ukucubungula ezinkulu amasethi wedatha kungathola amaphethini, ngaphambili ezingafinyeleleki, ukubikezela ukuthuthukiswa izici ngolimi oluthile ukulandelela ukwakheka kwamagama isikhathi cishe real.

Ebuhlotsheni esisebenzayo, ezivalekile global kungabonwa, isibonelo, njengoba ithuluzi ezingaba ukuhlola isimo somphakathi - Internet iyithuluzi eliwusizo avuselelwe malanga onkhe imibhalo nezinguqulo ezihlukahlukene okudalwe abasebenzisi zangempela: lokhu amazwana ukubuyekezwa, ama-athikili, nezinye izinhlobo eziningi yokukhuluma.

Ngaphezu kwalokho, ukusebenza emizimbeni kunomthelela ukuthuthukiswa kwe-hardware esifanayo, abathintekayo imininingwane ukubuyiswa, thina bayawazi isevisi "Google" noma "Yandex", umshini translation, izichazamazwi electronic.

Singasho ngokuqiniseka bagomela ngokuthi yezilimi Corpus kwenza kuphela izinyathelo kuqala, futhi esikhathini esizayo esiseduze ngeke uchume.

Similar articles

 

 

 

 

Trending Now

 

 

 

 

Newest

Copyright © 2018 zu.birmiss.com. Theme powered by WordPress.