首页 | 本学科首页   官方微博 | 高级检索  
     检索      


Investigating the relevance of Arabic text classification datasets based on supervised learning
Authors:Ahmad Hussein Ababneh
Institution:Computer Science Department, American University of Madaba, Madaba, 2882, Jordan
Abstract:Training and testing different models in the field of text classification mainly depend on the pre-classified text document datasets. Recently, seven datasets have emerged for Arabic text classification, including Single-Label Arabic News Articles Dataset (SANAD), Khaleej, Arabiya, Akhbarona, KALIMAT, Waten2004, and Khaleej2004. This study investigates which of these datasets can provide significant training and fair evaluation for text classification. In this investigation, well-known and accurate learning models are used, including naive Bayes, random forest, K-nearest neighbor, support vector machines, and logistic regression models. We present relevance and time measures of training the models with these datasets to enable Arabic language researchers to select the appropriate dataset to use based on a solid basis of comparison. The performances of the five learning models across the seven datasets are measured and compared with the performance of the same models trained on a well-known English language dataset. The analysis of the relevance and time scores shows that training the support vector machine model on Khaleej and Arabiya obtained the most significant results in the shortest amount of time, with the accuracy of 82%.
Keywords:KNN"}  {"#name":"keyword"  "$":{"id":"kwrd0015"}  "$$":[{"#name":"text"  "$$":[{"#name":"italic"  "_":"K"}  {"#name":"__text__"  "_":"-nearest neighbor  LR"}  {"#name":"keyword"  "$":{"id":"kwrd0025"}  "$$":[{"#name":"text"  "_":"Logistic regression  ML"}  {"#name":"keyword"  "$":{"id":"kwrd0035"}  "$$":[{"#name":"text"  "_":"Machine learning  NB"}  {"#name":"keyword"  "$":{"id":"kwrd0045"}  "$$":[{"#name":"text"  "_":"Navie Bayes  RF"}  {"#name":"keyword"  "$":{"id":"pc_cOuUjkcS1p"}  "$$":[{"#name":"text"  "_":"Random forest  SVM"}  {"#name":"keyword"  "$":{"id":"pc_CxQ9MPmb4b"}  "$$":[{"#name":"text"  "_":"Support vector machine  TC"}  {"#name":"keyword"  "$":{"id":"kwrd1045"}  "$$":[{"#name":"text"  "_":"Text classification  Logistic regression (LR)  Naive bayes (NB)  Random forest (RF)  Support vector machine (SVM)  Text classification (TC)
本文献已被 ScienceDirect 等数据库收录!
点击此处可从《电子科技学刊:英文版》浏览原始摘要信息
点击此处可从《电子科技学刊:英文版》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号