Stylometric Online Authorship Identification: An Exploratory Study
Online communication mediums such as email, web sites, newsgroups, online forums, and chat rooms have been ubiquitously integrated into our everyday lives. Unfortunately, online channels are also being misused for distribution of unsolicited and inappropriate information (e.g., extremist propaganda, online pornography, online gambling). The anonymous nature of these channels makes them an ideal source of communication for criminal groups and extremist organizations. Additionally, the evolution of the internet as a major international communication medium has spawned the advent of a multilingual dimension.
Authorship analysis has been used to analyze long, precise English texts such as plays of Shakespeare (authorship identification) or student's class papers (plagiarism detection). Few past studies have addressed the multilingual issues of online communications. The language-specific stylistic characteristics and the informal nature of online communications present unique research challenges. In order to address these challenges, we aim to develop a comprehensive framework and associated text mining techniques for multilingual online stylometric feature extraction and authorship classification. We plan to focus this exploratory study on two languages, English and Arabic. The linguistic differences between these two languages will allow us to evaluate common stylistic representations and explore other language-specific problems. We plan to develop comprehensive English and Arabic lexical, syntactic, structural, and content-based features that are suited for identifying online writing styles. We propose to evaluate these features using several large-scale public extremist forums (in English and Arabic) collected from the Web. We also plan to develop a scalable principle component analysis based feature reduction technique for authorship classification. Previous authorship analysis research was only able to analyze a limited number of authors (typically 5-20 authors). We aim at developing scalable online authorship analysis techniques that can be used to analyze 100s to 1000s of anonymous authors (a common scenario for web communications). Feature (subset) selection techniques will be developed to help reduce the high dimensionality of online writing features. Lab experiments will be conducted to verify the classification accuracy and scalability (speed and efficiency) of our approach.
We believe our unique combination of comprehensive multilingual online stylistic features and the development of scalable feature classification techniques (although high-risk) can provide a potentially high-payoff solution to the challenging problem of multilingual online stylometric authorship identification. Online Arabic authorship analysis in particular is extremely difficult and high-risk. Upon successful development of this SGER project, we anticipate a strong foundation for our future "cyber trust" research. The findings can also provide important insights to several computational and social sciences communities.
The primary intellectual contribution of our research is many fold: (a) develop and examine new text mining techniques that may be suitable for identity tracing in cyberspace, (b) create new representations of people's identities using online "Writeprints" (i.e., the representation of people's key online writing style features), (c) evaluate the effectiveness of different multilingual stylistic features and classification techniques for improving identification scalability and robustness.
The broader impact of this research includes: (a) creating a new representation of people's identities for classification of cyber criminals and potential extremists in online communities; (c) improving intelligence and law enforcement agencies' abilities to detect, prevent, and respond to cyber crimes and terrorist events via the Internet; and (c) providing a large-scale research corpus and feature extraction resources for information scientists, political scientists, and terrorism researchers.
Funded by the National Science Foundation under award number 0646942, "SGER: Multilingual Online Stylometric Authorship Identification: An Exploratory Study." See the NSF Award abstract.
Dr. Jay Nunamaker
|Dr. Hsinchun Chen|
|V. A. Benjamin, W. Chung, A. Abbasi, J. Chuang, C. A. Larson, and H. Chen, "Evaluating text visualization: An experiment in authorship analysis," ISI 2013: 16-20, Proceedings of 2013 IEEE International Conference on Intelligence and Security Informatics, Seattle, Washington, June 2013.|
|Ahmed Abbasi and Hsinchun Chen, "Writeprints: A Stylometric Approach to Identity-Level Identification and Similarity Detection in Cyberspace," ACM Transactions on Information Systems (ACM TOIS), 26:2 (March 2008), 29 pgs.|
|Ahmed Abbasi and Hsinchun Chen, "CyberGate: A System and Design Framework for Text Analysis of Computer Mediated Communication." MIS Quarterly (MISQ), 32:4 (December 2008, Special Issue on Design Science Research), pgs. 811-837.|
|Ahmed Abbasi, Hsinchun Chen, and Jay Nunamaker. "Stylometric Identification in Electronic Markets: Scalability and Robustness." Journal of Management Information Systems (JMIS), 25:1 (Summer 2008), pgs. 49-78.|
|Abbasi, A., and Chen, H. (2007). "Categorization and analysis of text in computer mediated communication archives using visualization," in Ray Larson, Edie Rasmussen, Shigeo Sugimoto and Elaine Toms, eds., Proceedings of the 2007 Joint Conference on Digital Libraries (JCDL), Vancouver, BC, Canada, June 18-23, 2007, p. 11-18.|
Fu, T.; Abbasi, A.; and Chen, H. (2007). "Interaction coherence analysis for Dark Web forums," in Gheorghe Muresan, Tayfur Altiok, Banjamin Melamed, and Daniel Zeng., Proceedings of the 2007 IEEE Intelligence and Security Informatics Conference, New Brunswick, NJ, May 23-24, p. 342-349.