Last modified May 8, 2016
N2 Narrative Corpus
- Mark A. Finlayson (markaf@mit.edu)
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology - Steven R. Corman
Jeffry R. Halverson
Center for Strategic Communications
Arizona State University
The N2 Narrative Corpus is a collection of 96 stories, comprising 42,280 words, that will be a shared resource available to performers on the N2 Narrative Network DARPA program. The purpose of the corpus is to provide a set of stories, in a domain of interest to DARPA, that have been annotated for their syntax and semantics, to jumpstart and focus work in the program, and to enable comparison and collaboration between projects. Because it is expected that many projects will required annotated text, the corpus will serve as a starting point to help performers get off the ground more quickly. In addition to the corpus itself, MIT is providing the annotation tool, the Story Workbench, a Java API for interacting with the annotated data, and extensive documentation.
Update, July 17, 2013: The third release of the corpus is now available. This release contains all the originally-planned stories (96 texts), annotated for the first 14 layers. The next version, containing the reamaining 4 texts, will be available in the next month.
Corpus | Workbench | Draft Annotation Guides
Approximately half the corpus, by words, is from the Center for Strategic Communication's collection of Islamist Extremist texts. This portion of the corpus is subject to the N2 Data-Use Agreement. Due to the restrictions of that agreement, that portion of the corpus is will in the short term only be available to N2 program performers; that data will likely be made more generally available in the future.
Important: In addition to the 15 different annotation layers, the N2 Narrative Corpus will also contain 2 additional representations to be determined. N2 performers are encouraged to suggest text annotations that would be useful for their work should they use the corpus. Once the two additional layers are chosen, the Story Workbench and Java API will be updated to allow annotation in these new representations.
Note: The purpose of the corpus is solely to provide a shared resource to N2 teams who might find the corpus of use to their work. Use of the corpus is not required by the N2 program.
Please direct questions to Mark Finlayson at markaf@mit.edu.
Overview of Contents
The texts in the N2 Narrative Corpus are focused on the topic of Islamic Extremist Jihadism. This is a topic that is of great interest not only per se, in its impact on many regions around the world, but also as a good example of how stories can be used to shape and guide beliefs and action. The texts are drawn from three sources and organized into four types:
Sources:- CSC ASU Collection - Texts collected by the Center for Strategic Communications, subject to the N2 Data-Use Agreement.
- Inspire - Texts drawn from "jihad stories" section of Al Qaeda's magazine.
- Hadith - Islamic religious stories handed down through the generations.
- Current Affairs - Stories describing battles or other jihadi activities in the modern day.
- History - Stories about jihadi activities in Islamic history.
- Biography - Stories about the lives of modern martyrs or leaders.
- Religious Texts - Hadith about Jihad that are often appealed to by Islamist Extremists.
Text Type | Text Source | Number of Texts | Number of Words |
Current Affairs | CSC ASU Collection* | 13 | 5,419 |
Inspire (example) | 7 | 7,814 | |
History | CSC ASU Collection* | 8 | 2,480 |
Biography | CSC ASU Collection* | 8 | 11,795 |
Religious Texts | Hadith (example) | 64 | 14,972 |
Total | 100 | 42,480 |
Download
A redacted version of the corpus may be downloaded from the MIT CSAIL DSpace Archive. To obtain the full corpus, please contact Mark Finlayson.Overview of Annotations
The texts of the corpus will be annotated in 17 different representations, described in the table below. The "annotation style" column indicates how the annotations will be produced, as follows:
- Automatic - generated automatically by computer, and possibly checked and corrected by the lead researcher. These annotations will not be double-annotated.
- Semi-Automatic - initial annotations will be generated by computer, but they will be hand-checked by two human annotators and the results will be merged.
- Manual - annotations will be created from scratch by human annotators; these will be merged in the same way as the semi-automatic annotations.
No. | Name | Description | Annotation Style |
1. | Tokens | the constituent characters of each token, as tokenized according to Penn Treebank tokenization conventions | Automatic |
2. | Multi-Word Expressions | sets of tokens making up multi-word expressions | Semi-Automatic |
3. | Sentences | continuous runs of tokens comprising sentences | Automatic |
4. | Part of Speech Tags | a tag from the Penn Treebank tag set for each token and multiword | Semi-Automatic |
5. | Lemmas | a root form for each inflected token or multiword | Semi-Automatic |
6. | Wordnet Senses | a Wordnet sense for each token or multiword | Semi-Automatic |
7. | CFG Parse | a root form for each inflected token or multiword | Automatic |
8. | Referring Expressions | sets of tokens that refer to something | Semi-Automatic |
9. | Referent Attributes* | properties (unchanging attributes) of referents | Manual |
10. | Co-reference Bundles | sets of referring expressions that co-refer | Semi-Automatic |
11. | Time Expressions | location, type, and value of temporal expressions (TimeML) | Semi-Automatic |
12. | Events | location, features, and type of event mentions (TimeML) | Semi-Automatic |
13. | Temporal Links | event-event, event-time, or time-time relationships (TimeML) | Semi-Automatic |
14. | Semantic Roles | verb-argument structure (PropBank) | Semi-Automatic |
15. | StateML* | captures static relationships between referents | Manual |
16. | Discourse Relationships | PDTB-style discourse-level relationships between spans of text, as well as attribution | Manual |
17. | Character Roles* | marks major character roles such as Protagonist and Antagonist | Manual |
Detailed Manifest
The detailed manifest lists the texts that will be included in the corpus, as well as their size and a single line description of their contents.Example Modern Text
The following is an excerpt from document Inspire-1-60-62, in which the author describes battles during Jihad in Afghanistan.
/** * Title : The Fight Over the Mountains * Author : Adnan Muhammad 'Ali as-Sa'igh * Source : Al Qaeda Inspire Magazine, Issue 1, pp. 60-62 * Created on : 28 July 2011 * Edited by : Mark Finlayson, MIT * Words : 1,829 * Restrictions : None * */ Ten months before the September operation, in the city of Kandahar, Allah blessed me with the chance to go to the frontline to fight against Ahmad Shah Mas`ud’s army whereas many mujahidin remained behind in the training camps around the country. The journey was very rough since it was cold; I was with five mujahidin throughout this trip. Since the weather was bitterly cold, the Afghani people would stay in their homes most of the time. The Taliban were fighting two wars at the time; one in the North against Ahmad Shah Mas`ud, and the other against Dostum in the North East. The Taliban were having difficulty taking over the Hindu Kush from Mas`ud. There were big battles in North Kabul against his army. Bensheer is the name of the city in the Hindu Kush Mountains; this was the city I was fighting in. This war was a war of shari`ah versus the corrupt man-made laws; because of the man-made laws, people were worshipping graves, stealing from stores, and doing other criminal activity that made those parts of the country dangerous to reside in. In Kabul, there was very good security and peace because of shari`ah rule. The Taliban took all of Afghanistan except Bamiyan, a Shi`a stronghold and Bensheer. The war with the Shi`a was difficult since they were using horses in the mountains; so it was difficult tracking them down.
Example Hadith
The following is the full text of a representative Hadith.
/** * Title : Conquering of a Persian Fort * Source : Sunan al-Tirmidhi 24.1.1553 * Translated by : Unknown * Created on : 28 July 2011 * Extracted by : Jeffry Halverson, CSC, ASU * Edited by : Mark Finlayson, MIT * Words : 225 * Restrictions : None * */ Abu Bakhtari narrated: A Muslim army led by Sayyidina Salman Farisi (may Allah be pleased with him) surrounded one of the Persian forts. The men said to him, "Oh Abu Abdullah, shall we not pounce on them?' He said, "Let me invite them (to Islam). I had heard Allah's Messenger (peace be upon him) invite them (the enemy)." So, Salman went to them and said, "Indeed, I am a man of you, a Persian. You see the Arabs obey me. Thus, if you submit to Islam then for you is the like of what is for us and on you is that which is on us. And if you reject only to stay on your religion then we will leave at that and you will pay us the jizyah with your hands, disgraced." The narrator said that Salman (may Allah be pleased with him) spoke in Persian and also said, "You are not praiseworthy. And if you refuse, we warn you of bad things." They said, "We are not among those who pay the jizyah, but we will fight you." The (Muslim) men said, "Oh Abu Abdullah, shall we not pounce on them?" He said, 'No." He invited them in this way for three days and after that said, "Pounce on them." The narrator said: We poured ourselves over them and we conquered that fort.
Example of Annotation
An example of an annotated text shows a complete annotation, in the first fifteen representations, of the first paragraph of a Russian fairy tale. The file is encoded in the Story Workbench annotation format. When the corpus is released, the Story Workbench, accompanying documentation, and a Java API for manipulating the annotations, will be released as well.
N2 Data-Use Agreement
The use of the ASU CSC portion of the corpus ("the CSC data") by individuals, institutions, corporations, or other parties (the "Users") is governed by the N2 Corpus Data-Use Agreement:
- Users agree that use of the CSC data is limited to projects funded under the DARPA N2 program under “For Official Use Only” (FOUO) restrictions:
- Users agree not to furnish CSC data to third parties for any reason without written permission of the CSC.
- Users agree that, in the absence of a written exemption from the CSC, any written work relying on the CSC data will be embargoed from publication until after either (a) June 1st, 2013, or (b) the appearance in print of a CSC publication using and explicitly describing the CSC data (to be verified by the CSC), whichever comes first.
The CSC data is UNCLASSIFIED//FOR OFFICIAL USE ONLY (U//FOUO) and contains information that may be exempt from public release under the Freedom of Information Act (5 U.S.C. 552). It is to be controlled, stored, handled, transmitted, distributed, and disposed of in accordance with OpenSource.gov policy relating to FOUO information and is not to be released to the public or other personnel who do not have a valid "need-to-know" without prior approval of an authorized OpenSource.gov official.