PI: Dr. Hsinchun Chen, Artificial Intelligence Lab, The University of Arizona.
This project is intended to create a large archive, known as the Dark Web archive, and a research infrastructure for use by computer and information scientists as well as social scientists studying a wide range of computational problems and social and organizational phenomena. The archive will ultimately comprise test bed data containing thousands of multilingual websites including millions of web pages and thousands of multimedia files by U.S. domestic, Middle Eastern, and Latin American terrorist and extremist groups. A methodology and spidering (collection building) tools for time-based automated capture of terrorists groups’ websites and multimedia resources will be extended and enhanced from previous work through this project; this approach will then support monthly updates of the entire collection. In addition, the infrastructure will include tools supporting search, browse, and analysis capabilities.
Researchers all over the world in a variety of disciplines are working on developing the means for understanding extremist groups, terrorism and terrorists: their effects on the world; how they communicate, organize and propagate themselves; how they are funded; who they connect with and why, etc. As a prototype, the data in the Dark Web has been highly requested and sought after by numerous researchers working on these problems: not only social scientists and analysts struggling to understand the phenomenon of terrorism, but also computer and information scientists who work in knowledge discovery and dissemination (KDD), in data and text mining including entity extraction, and in many other fields of endeavor. However, as a prototype, the Dark Web archive has not been accessible or usable except by those few able and willing to build their own interfaces and tools; it is not readily updated as the spidering process still needs enhancements ; and it does not support analysis. At various stages throughout the project, input and evaluation will be sought from the community to be served, including computer and information science (CIS) researchers, social scientists, terrorism researchers and analysts, and others. Dissemination and distribution will also be an important component: existing conferences, workshops, and other venues will be leveraged to ensure that knowledge about the availability of the Dark Web archive and infrastructure is widely disseminated.
Intellectual merit: CIS researchers will be able to utilize the Dark Web archive for a wide range of exercises: to develop video and voice recognition technologies, advance information retrieval techniques whether mono- or multi-lingual, and improve methodologies in data and text mining as well as machine learning and artificial intelligence. Social scientists will be able to use the archive to study dynamic “dark” networks and the linkages or relationships between organizations, verify hypotheses about the use of the web by extremist/terrorist groups, and study the inter-relationship of culture, religion and politics. The Dark Web archive will support the comparison of current and historical data, minimize manual analysis by researchers in the social sciences; and enable the replication of experiments by researchers.
Broader impacts: In addition to supporting researchers in information, computer and social sciences, this project will also have some utility for the national security sector, including law enforcement and the intelligence community, although that is not its primary purpose. The letters of support accompanying this proposal amply demonstrate the breadth and depth of the proposed work, and its potential impact on researchers in both computational and social sciences.
Submitted to the National Science Foundation under proposal number 0709338. See the NSF project page.
Acknowledgement and Disclaimer: This work has been supported in part by the National Science Foundation under Grant Number #CNS-0709338. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
In addition to continuing to add data to keep the portal up to date (something that was indicated as very important from our users), we have been working on improving the searching, browsing, and translation capabilities of the portal. As outlined in our initial proposal, we have conducted periodic evaluations through user testing, and funneled evaluation results to our system development efforts. Our searching, browsing and translation capabilities were evaluated by users to be important functions but to be somewhat 'clunky' and not as easy to use as they would like. Specifically, the issues have included inconsistency in display, lack of sophisticated search functionality (such as Boolean searching), and very slow translation.
To address these major issues, the portal was rebuilt and released as beta version 2.0; it adopted a new design for the user interface based on user feedback. With the new design, the system functionalities were grouped into three categories: Searching, Browsing, and Social Network analysis. The consolidated searching/browsing made navigating the forums much easier from a user perspective, and the introduction of filters, placed conveniently on the results screens, made it even easier for users to 'zero in' on very specific forums, messages, or threads of interest.
However, while the searching and browsing functions were improved, the actual information retrieval was found, in later testing, to still not be very satisfactory in the following aspects:
- Query parsing: while version 2.0 added some Boolean searching capability, it did not support complex, sophisticated queries.
- Search ranking: the search ranking was problematic when multiple keywords with the 'OR' relationship were entered by users, thus weakening the accuracy of the matches.
- Hit highlighting: matched keywords were not always correctly highlighted; some highlighted words did not match the input search terms.
For the past project period, we have investigated fixes to these important issues. We have chosen an implementation of Lucene to address these major concerns and improve the accuracy of the information delivered to users. We have recently finished an initial implementation of Lucene. However, it has caused some instability in the portal so further work and testing is needed before it can be integrated into the portal.
In our original proposal, we also outlined possible analysis tools to include with the portal, such as: multilingual text processing and translation; online forum to database conversion; and multimedia content classification. The first two tools have been completed and integrated. In the coming year, our plan for multimedia content classification includes the development and integration of a sentiment analysis tool, and successfully transitioning the existing Social Network Analysis tool (a function requested by more advanced users of the portal) later versions of the portal.
Partnerships with the Naval Postgraduate School, the Central Police University in Taiwan, and the ACM KDD community have been mutually beneficial.
This past project year has seen a great expansion of the Dark Web Forum Portal and a completely re-worked infrastructure. Preliminary user testing enabled us to focus in on portal issues that hindered users' abilities to identify, receive, and translate relevant matches. Users were facing difficulties with some inconsistencies in display, for example, and very slow translation.
As a result, the framework was first re-architected, and then Lucene was integrated. Lucene allows more sophisticated Boolean searching, improves the ranking of matching hits, and is more correct in its hit highlighting. In the coming months, once the portal has been restabilized and additional new storage is completely integrated, we will fully test its new functionality with users.
With additional funding provided by other agencies, we have created a video portal module which makes accessible a whole new class of materials. We have also reached out to other potential user communities. To see summaries of those efforts, see the paper presented at the Intelligence and Security Informatics 2011 Conference (Beijing, China), titled, "The Dark Web Forum Portal: From Multi-lingual to Video."
As outlined in our previous reports, this project is aimed at designing and implementing a general framework for Web forum data integration. Specifically, a Web-based knowledge portal, the Dark Web Forums Portal, has been built based on the framework. The portal incorporates the data collected from different international Jihadist forums and provides several important analysis functions, including forum browsing and searching (in single forum and across multiple forums), forum statistics analysis, multilingual translation, and social network visualization.
As a major type of social media in Web 2.0, Web forums facilitate intensive interactions among participants. International Jihadist groups often use Web forums to promote violence and distribute propaganda materials. These Dark Web forums are heterogeneous and widely distributed. Therefore, the ability to access and analyze the forum messages and interactions among participants has become important to researchers and others studying terrorism and extremism.
Update for 2009-2010: In this past year, we accomplished significant extensions to our previous work. These extensions include: greatly increasing the scope of data collection; adding an incremental spidering component for regular data updates; enhancing the searching and browsing functions; enhancing multilingual machine-translation for Arabic, French, German and Russian; and adding advanced Social Network Analysis.
Update for 2009-2010: Currently, the portal contains 29 Jihadist forums, among which 17 are Arabic forums, 7 are English forums, 3 are French forums and the other 2 are in German and Russian, respectively (compared to seven forums total at last report).
The additional forums have been carefully selected with significant input from terrorism researchers, security and military educators, and other experts. The Arabic-language forums selected include major jihadist websites. The English-language forums represent both extremist and more moderate groups. The French, German, and Russian forums provide representative content for extremist groups communicating in these languages, and provide additional opportunity to evaluate multilingual translation.
This brings the total number of messages available to about 13M; approximately 3M postings will be added annually through incremental spidering.
Different functions have developed and incorporated into the system as real-time services, including single and multiple forums browsing and searching, forum statistics analysis, multilingual translation, and social network visualization. For forum statistics analysis, Java applet-based charts were created to show the trends based on the numbers of messages produced over time. The multilingual translation function has been implemented using Google Translation API (http://code.google.com/apis/ ajaxlanguage/documentation/#Translation). The social network visualization function provides dynamic, user-interactive networks implemented using JUNG (http://jung.sourceforge.net/) to visualize the interactions among forum members. The portal provides functions to browse and search information in a particular forum, as well as across all the forums.
Update for 2009-2010: During the past year, we have also accomplished a number of bug fixes to ensure that the portal operates correctly.
In this study, we developed an integrated approach to search and analyze international Jihadist forums. A Web-based multilingual portal, the Dark Web Forums Portal, was developed based on the approach. The portal initially integrated forum data from seven major active international Jihadist forums identified by domain experts.
Different functions provided by the portal include single and multiple forums browsing & searching, forum statistics analysis, multilingual translation, and social network visualization. These functions were designed to help users locate and understand and eventually utilize the information they want quickly and easily. The Dark Web Forums Portal is an infrastructure to integrate heterogeneous forum data, and will serve as a strong complement to the current databases, news reports and other sources available to the research community.
Update for 2009-2010: In the following sections, we will describe the significant extensions to the previous work, which included greatly increasing the scope of data collection; adding an incremental spidering component for regular data updates; enhancing the searching and browsing functions; enhancing multilingual machine-translation for Arabic, French, German and Russian; and adding advanced Social Network Analysis.
Update for 2009-2010: In this component, spidering programs have been developed to collect the Web pages from online forums that contain Jihadist related content identified by domain experts. The spidering component is composed of complete spidering and incremental spidering (Figure 2), with incremental spidering being a main addition for this past year.
Complete Spidering is applied to forums the first time they are added to our collection, while incremental spidering is adopted if the forums already exist in the collection. When a forum is first added to our collection, the complete spidering is applied to collect all available postings. Incremental spiders are designed to identify and collect postings posted after the last updating time of the forum, so that only a small portion of forum data is collected and therefore makes the spidering process much more efficient. To achieve this goal, an incremental spider is developed for each forum in the collection.
The incremental spidering consists of three main steps: Sub-Forum List Page Spidering, Thread List Page Spidering and Incremental Spidering. Sub-Forum List Page Spidering: Forums generally contain one or more sub-forums representing different discussion themes. In this step, incremental spiders first spider and parse sub-forum list pages of a forum and identify URLs of sub-forums. Thread List Page Spidering: Thread list pages contain the metadata of discussion threads (such as title, date of the last update, and author name) which are sorted by dates of the last update decreasingly. For each sub-forum, the incremental spider starts from downloading the first thread list page of the sub-forum; and dates of the last update of discussion thread are then extracted. Threads updated later than the date of the latest posting in the database are considered to be new threads and their URLs are collected. If every thread listed in the first thread list page is a new thread, the spidering will move to the next thread page. Otherwise, the spidering of this sub-forum is complete. Incremental Spidering: After collecting all the URLs of new threads, the incremental spider begins to download all of the postings within the new threads.
The first time a forum is collected, a parsing program must be developed to extract the detailed forum data from the raw HTML Web pages and store it in a local database. For each forum, the structured, detailed forum data extracted include thread names, main message bodies, member names, and post dates. Another set of data (i.e., the data needed to determine the significance of each forum member and the link weight between two different members) used for social network analysis is created by aggregating the detailed forum data.
Update for 2009-2010: A significant number of new forums have been collected and added to the forum portal.
Update for 2009-2010: This section has been updated with screenshots from the most recent version of the portal showing new content and improved functioning.
Different functions have been developed and incorporated into the system as real-time services, including single and multiple forums, browsing and searching, forum statistics analysis, multilingual translation, and social network visualization. The Dark Web Forum portal is implemented using Apache Tomcat and the database is implemented using Microsoft SQL Server 2008. For forum statistics analysis, Java applet-based charts are created to show the trends based on the numbers of messages produced over time. The multilingual translation function has been implemented using Google Translation Service, which can automatically detect non-English texts and translate them into English. The social network visualization function provides dynamic, user-interactive networks implemented using JUNG to visualize the interactions among forum members.
The search function allows users to search for message titles using multiple keywords. User can choose the Boolean operations of the keywords to be either “AND” or “OR.”
In addition to browsing and searching information in a particular forum, our portal also supports multiple forum searching across all forums in the portal. For example, a total of 227 threads (Fig. 3) are retrieved across all forums that contain keywords “bomb,” “Iraq,” and “kill” (AND operation) in the thread title. Among them, 159 are from the forum “Gawaher,” 56 are from forum “Ansar1,” 5 are from forum “Ummah,” etc. “Gawaher” has more discussions on this topic than any of the other forums. Detailed searching results for each forum on these keywords can be found by clicking the row corresponding to a particular forum.
For each forum, statistical data such as the number of members, number of threads, number of messages, start date and end date of the forum is provided. In addition, a Java applet-based chart is created to show the trend based on the number of messages produced in different time periods, which can help users to understand the traffic of discussions over time.
The multilingual translation function can automatically translate the returned browsing and searching results from non-English to English. The function is implemented using Google Translation API (http://code.google.com/apis/ajaxlanguage/documentation/#Translation). To conduct automatic translation, the multilingual translation function first checks whether a returned browsing or searching result is in English or not. If not, it will then send the textual data to the remote server to conduct the translation and receive the translated data once the server is done.
The interface of the SNA function consists of three parts: the search panel (top box), analysis panel (middle box), and visualization panel (bottom box). The search panel allows the user to choose three search criteria: forum, keyword and time period. The threads that meet these search criteria are identified as “related threads” and are used to construct the social network. Any of the forums listed in the portal can be selected to perform SNA. The keywords are selected by the user, in any language, separated with space of comma. Thread names, user names, and postings are searched using these keywords, and a thread is identified as a related thread if the thread name, or at least one posting, or at least one user name, contain any of the keywords. The start date and the end date are used to constrain the postings in the search result. When related threads are returned, the social network will be constructed based on the structure of these threads.
The analysis panel allows the user to select different metrics for SNA, and to set the parameters for graph visualization. Every node in the social network has a set of attributes, including the screen name, the number of postings, and various social network metrics. After the social network is constructed, all nodes are ranked in descending order based on the number of postings. Since the resulting social network usually contains a large number of message authors, which makes the graph too crowded for analysis, the slide bar can be used to display only a portion of the top authors based on the ranking in order to make the graph easier to read. The label as well as the value of the selected metrics can be displayed beside each node by checking the corresponding box. An isolated node is defined as a node that has no connections to any other node. Removing isolated nodes is a useful function when too many cause noise in the graph. Checking the “Link to User Post” box will change the color of every node from green to red, and a click on any node will pop up a new window that shows all postings by this user during the selected time period.
The visualization panel displays the graph based on the settings in the analysis panel, with the thickness of the link proportionate to the intensity of interactions between two nodes. Any node can be dragged to any position in the panel, and all connected nodes and corresponding links will be highlighted when holding the mouse button pressed during the move of the node. Different layouts are also provided for graph visualization. Four types of layout algorithms are integrated into the component, including static layout, circle layout, 3 force-based layouts (Fruchterman-Reingold, Kamada-Kawai, and Spring) and a self-organizing layout (ISOM). If users want to perform advanced analysis on the graph using other SNA tools such as UCINET, Pajek and so on, clicking the “Export Graph” button allows the graph to be exported to a “.net” format file, which is the Pajek graph file format recognized by most SNA tools.
Funding provided by the National Science Foundation under Grant No. CNS-0709338. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
- Hsinchun Chen, MIS
- Mark Patton, MIS
- Cathy Larson, MIS
- Shan (Jonathan) Jiang
- Andy Pressman, MIS
- Hsinchun Chen, Dorothy Denning, Nancy Roberts, Catherine A. Larson, Ximing Yu, and Chun-Neng Huang. "The Dark Web Forum Portal: From Multi-lingual to Video." IEEE Intelligence and Security Informatics 2011 Conference, 2011.
- D. Zimbra, A. Abbasi, and H. Chen, “A Cyber-archeology Approach to Social Movement Research: Framework and Case Study,” Journal of Computer-Mediated Communication, 2010.
- Chun-Neng Huang , Tianjun Fu, and Hsinchun Chen, Fellow, IEEE, "Text-based Video Content Classification for Online Video-sharing Sites." JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2010.
- Yan Dang, Yulei Zhang, Hsinchun Chen, "A Lexicon-Enhanced Method for Sentiment Classification." IEEE Intelligent Systems 25 (4), 2010.
- Chen, H., and Dark Web Team (2008). "IEDs in the Dark Web: Genre Classification of Improvised Explosive Device Web Pages," IEEE International Intelligence and Security Informatics Conference (Taipei, Taiwan, July 17-20, 2008). Springer Lecture Notes in Computer Science.
- Chen, H., and the Dark Web Team (2008). "Discovery of Improvised Explosive Device Content in the Dark Web." IEEE International Intelligence and Security Informatics Conference (Taipei, Taiwan, July 17-20, 2008). Springer Lecture Notes in Computer Science.
- Chen, H. and the Dark Web Team, "Sentiment and Affect Analysis of Dark Web Forums: Measuring Radicalization on the Internet" (2008) IEEE International Intelligence and Security Informatics Conference (Taipei, Taiwan, July 17-20, 2008). Springer Lecture Notes in Computer Science.
Data network abstract graphic courtesy Shutterstock.