Last modified May 8, 2016

N2 Narrative Corpus

The N2 Narrative Corpus is a collection of 96 stories, comprising 42,280 words, that will be a shared resource available to performers on the N2 Narrative Network DARPA program. The purpose of the corpus is to provide a set of stories, in a domain of interest to DARPA, that have been annotated for their syntax and semantics, to jumpstart and focus work in the program, and to enable comparison and collaboration between projects. Because it is expected that many projects will required annotated text, the corpus will serve as a starting point to help performers get off the ground more quickly. In addition to the corpus itself, MIT is providing the annotation tool, the Story Workbench, a Java API for interacting with the annotated data, and extensive documentation.

Update, July 17, 2013: The third release of the corpus is now available. This release contains all the originally-planned stories (96 texts), annotated for the first 14 layers. The next version, containing the reamaining 4 texts, will be available in the next month.

Download
Corpus | Workbench | Draft Annotation Guides

Approximately half the corpus, by words, is from the Center for Strategic Communication's collection of Islamist Extremist texts. This portion of the corpus is subject to the N2 Data-Use Agreement. Due to the restrictions of that agreement, that portion of the corpus is will in the short term only be available to N2 program performers; that data will likely be made more generally available in the future.

Important: In addition to the 15 different annotation layers, the N2 Narrative Corpus will also contain 2 additional representations to be determined. N2 performers are encouraged to suggest text annotations that would be useful for their work should they use the corpus. Once the two additional layers are chosen, the Story Workbench and Java API will be updated to allow annotation in these new representations.

Note: The purpose of the corpus is solely to provide a shared resource to N2 teams who might find the corpus of use to their work. Use of the corpus is not required by the N2 program.

Please direct questions to Mark Finlayson at markaf@mit.edu.

Overview of Contents

The texts in the N2 Narrative Corpus are focused on the topic of Islamic Extremist Jihadism. This is a topic that is of great interest not only per se, in its impact on many regions around the world, but also as a good example of how stories can be used to shape and guide beliefs and action. The texts are drawn from three sources and organized into four types:

Sources:
  1. CSC ASU Collection - Texts collected by the Center for Strategic Communications, subject to the N2 Data-Use Agreement.
  2. Inspire - Texts drawn from "jihad stories" section of Al Qaeda's magazine.
  3. Hadith - Islamic religious stories handed down through the generations.
Types:
  1. Current Affairs - Stories describing battles or other jihadi activities in the modern day.
  2. History - Stories about jihadi activities in Islamic history.
  3. Biography - Stories about the lives of modern martyrs or leaders.
  4. Religious Texts - Hadith about Jihad that are often appealed to by Islamist Extremists.
Text Type Text Source Number of Texts Number of Words
Current Affairs CSC ASU Collection* 13 5,419
Inspire (example) 7 7,814
History CSC ASU Collection* 8 2,480
Biography CSC ASU Collection* 8 11,795
Religious Texts Hadith (example) 64 14,972
Total 100 42,480
* Texts from the CSC ASU Collection are governed by the N2 Data-Use Agreement

Download

A redacted version of the corpus may be downloaded from the MIT CSAIL DSpace Archive. To obtain the full corpus, please contact Mark Finlayson.

Overview of Annotations

The texts of the corpus will be annotated in 17 different representations, described in the table below. The "annotation style" column indicates how the annotations will be produced, as follows:

No. Name Description Annotation Style
1. Tokens the constituent characters of each token, as tokenized according to Penn Treebank tokenization conventions Automatic
2. Multi-Word Expressions sets of tokens making up multi-word expressions Semi-Automatic
3. Sentences continuous runs of tokens comprising sentences Automatic
4. Part of Speech Tags a tag from the Penn Treebank tag set for each token and multiword Semi-Automatic
5. Lemmas a root form for each inflected token or multiword Semi-Automatic
6. Wordnet Senses a Wordnet sense for each token or multiword Semi-Automatic
7. CFG Parse a root form for each inflected token or multiword Automatic
8. Referring Expressions sets of tokens that refer to something Semi-Automatic
9. Referent Attributes* properties (unchanging attributes) of referents Manual
10. Co-reference Bundles sets of referring expressions that co-refer Semi-Automatic
11. Time Expressions location, type, and value of temporal expressions (TimeML) Semi-Automatic
12. Events location, features, and type of event mentions (TimeML) Semi-Automatic
13. Temporal Links event-event, event-time, or time-time relationships (TimeML) Semi-Automatic
14. Semantic Roles verb-argument structure (PropBank) Semi-Automatic
15. StateML* captures static relationships between referents Manual
16. Discourse RelationshipsPDTB-style discourse-level relationships between spans of text, as well as attribution Manual
17. Character Roles* marks major character roles such as Protagonist and Antagonist Manual
* These representations are not taken from established Computational Linguistics work, but were specifically developed for annotating narrative corpora.

Detailed Manifest

The detailed manifest lists the texts that will be included in the corpus, as well as their size and a single line description of their contents.

Example Modern Text

The following is an excerpt from document Inspire-1-60-62, in which the author describes battles during Jihad in Afghanistan.

/**
  * Title         : The Fight Over the Mountains
  * Author        : Adnan Muhammad 'Ali as-Sa'igh
  * Source        : Al Qaeda Inspire Magazine, Issue 1, pp. 60-62
  * Created on    : 28 July 2011
  * Edited by     : Mark Finlayson, MIT
  * Words         : 1,829
  * Restrictions  : None
  *
  */

Ten months before the September operation, in the city of Kandahar, Allah 
blessed me with the chance to go to the frontline to fight against Ahmad Shah 
Mas`ud’s army whereas many mujahidin remained behind in the training camps 
around the country. The journey was very rough since it was cold; I was with 
five mujahidin throughout this trip. Since the weather was bitterly cold, the 
Afghani people would stay in their homes most of the time. The Taliban were 
fighting two wars at the time; one in the North against Ahmad Shah Mas`ud, and 
the other against Dostum in the North East. The Taliban were having difficulty 
taking over the Hindu Kush from Mas`ud. There were big battles in North Kabul 
against his army. Bensheer is the name of the city in the Hindu Kush 
Mountains; this was the city I was fighting in. This war was a war of shari`ah 
versus the corrupt man-made laws; because of the man-made laws, people were 
worshipping graves, stealing from stores, and doing other criminal activity 
that made those parts of the country dangerous to reside in. In Kabul, there 
was very good security and peace because of shari`ah rule. The Taliban took 
all of Afghanistan except Bamiyan, a Shi`a stronghold and Bensheer. The war 
with the Shi`a was difficult since they were using horses in the mountains; so 
it was difficult tracking them down.

Example Hadith

The following is the full text of a representative Hadith.

/**
  * Title         : Conquering of a Persian Fort
  * Source        : Sunan al-Tirmidhi 24.1.1553
  * Translated by : Unknown
  * Created on    : 28 July 2011
  * Extracted by  : Jeffry Halverson, CSC, ASU
  * Edited by     : Mark Finlayson, MIT
  * Words         : 225
  * Restrictions  : None
  *
  */

Abu Bakhtari narrated:

A Muslim army led by Sayyidina Salman Farisi (may Allah be pleased with him) 
surrounded one of the Persian forts. The men said to him, "Oh Abu Abdullah, 
shall we not pounce on them?' He said, "Let me invite them (to Islam). I had 
heard Allah's Messenger (peace be upon him) invite them (the enemy)." So, 
Salman went to them and said, "Indeed, I am a man of you, a Persian. You see 
the Arabs obey me. Thus, if you submit to Islam then for you is the like of 
what is for us and on you is that which is on us. And if you reject only to 
stay on your religion then we will leave at that and you will pay us the 
jizyah with your hands, disgraced." The narrator said that Salman (may Allah 
be pleased with him) spoke in Persian and also said, "You are not praiseworthy. 
And if you refuse, we warn you of bad things." They said, "We are not among 
those who pay the jizyah, but we will fight you." The (Muslim) men said, "Oh 
Abu Abdullah, shall we not pounce on them?" He said, 'No." He invited them in 
this way for three days and after that said, "Pounce on them." The narrator 
said: We poured ourselves over them and we conquered that fort.

Example of Annotation

An example of an annotated text shows a complete annotation, in the first fifteen representations, of the first paragraph of a Russian fairy tale. The file is encoded in the Story Workbench annotation format. When the corpus is released, the Story Workbench, accompanying documentation, and a Java API for manipulating the annotations, will be released as well.

N2 Data-Use Agreement

The use of the ASU CSC portion of the corpus ("the CSC data") by individuals, institutions, corporations, or other parties (the "Users") is governed by the N2 Corpus Data-Use Agreement:

  1. Users agree that use of the CSC data is limited to projects funded under the DARPA N2 program under “For Official Use Only” (FOUO) restrictions:
  2. The CSC data is UNCLASSIFIED//FOR OFFICIAL USE ONLY (U//FOUO) and contains information that may be exempt from public release under the Freedom of Information Act (5 U.S.C. 552). It is to be controlled, stored, handled, transmitted, distributed, and disposed of in accordance with OpenSource.gov policy relating to FOUO information and is not to be released to the public or other personnel who do not have a valid "need-to-know" without prior approval of an authorized OpenSource.gov official.

  3. Users agree not to furnish CSC data to third parties for any reason without written permission of the CSC.
  4. Users agree that, in the absence of a written exemption from the CSC, any written work relying on the CSC data will be embargoed from publication until after either (a) June 1st, 2013, or (b) the appearance in print of a CSC publication using and explicitly describing the CSC data (to be verified by the CSC), whichever comes first.