Big Data Research for Hacker Communities

 Cybersecurity Big Data Research for Hacker Communities

NSF SaTC: CORE: Small: Cybersecurity Big Data Research for Hacker Communities: A Topic and Language Modeling Approach: $500,000



  • PI: Dr. Hsinchun Chen, Regents' Professor, ACM/IEEE Fellow, U. of Arizona (UA), AI Lab Director
  • Co-PI: Dr. Weifeng Li- U. of Georgia (UGA)


It is estimated that cybercrime will cost the global economy around $6 trillion by 2021, particularly due to intellectual property theft and financial fraud using stolen consumer data. Incidents of large-scale hacking and data theft regularly occur, with many cyberattacks resulting in theft of sensitive personal information or intellectual property. Cybersecurity will remain a critical problem for the foreseeable future, necessitating more research on a large, diverse, covert and evolving international hacker community. Computer science and social science researchers face non-trivial challenges, however, such as the technical difficulties in data collection and analytics, the massive volume of data collection, the heterogeneity and covert nature of data elements, and the ability to comprehend common hacker terms and concepts across regions.

In order to alleviate these challenges, this project has two research goals:

  1. Advance current capabilities for scalable identification, collection, and analysis of international hacker community contents
  2. Make contributions to the cybersecurity community by developing new big data techniques that could enable researchers to conduct analyses on hacker content and other related domains. 

The UA’s National Security Agency-designated Center of Academic Excellence in Cyber Defense, Research, and Operations, NSF Scholarship-for-Service (SFS) Cyber-Corps, and top-ranked Master’s in Cybersecurity programs position the project for excellent synergy with teaching and research. Techniques developed in this project not only advance CTI knowledge, but also deep transfer learning, deep generative modeling, supervised topic modeling, dynamic topic modeling, neural variational inference, and numerous other important domains. Results from this research will be disseminated through various academic and cybersecurity industry channels such as undergraduate and graduate curriculums, the IEEE Intelligence and Security Informatics conference, National Cyber-Forensics Training Alliance (NCFTA), The Society for the Policing of Cyberspace (POLCYB), and NSF CyberCorps SFS.