Sakhr offers the broadest Arabic corpora in terms of coverage and size, and continues to enhance and develop Arabic corpora that are essential for statistical analysis, NLP research and new product development. Sakhr’s research center in Egypt and the US are staffed by some of the leading scientists in Arabic linguistics and engineering. This team has deep experience in corpora generation, and their modeling and development expertise are tapped by institutions world-wide seeking research collaboration.
Examples of Sakhr’s Arabic resources include:
Parts of Speech Tagged Corpus
A text corpus is a large set of texts that are used in statistical analysis. For linguistic research, it is often subjected to tagging or annotation. Sakhr has built the largest and most accurate Part-Of-Speech tagged corpus with over 7 million words. Sakhr’s POST offers a representation of the variety in syntactic, semantic and pragmatic features of Modern Standard Arabic.
Sakhr’s full-fledged lexicon for Arabic language is a complete coverage of MSA and Classical Arabic Stems featuring roots, morphological patterns (MP), parts of speech, applicability information between stems, roots, POSs and MPs, applicability info between affixes and stems, derivational info of stems, senses for stems, morphosyntactic, lexical, semantic and lexico-syntactic info for stems and senses, syntactic valence patterns for senses, etc.
Arabic World Knowledge
Sakhr is a building a database of contemporary Arabic Named Entities, with their English equivalent, classified based on a tree of 30 subjects, including names of humans, locations, creative productions, organizations, etc. and tagged with some morphological information. This database powers Sakhr’s Name Transliteration solution, which is used by MENA governmental entities to manage visas and financial institution anti-fraud processes.