Producing the Digital Body
Name | Size | Last Modified | SHA2-256 | SHA3-256 |
---|---|---|---|---|
README.md | 8,646 | 2024-01-08 14:08:11Z | n/a | n/a |
corpora-video-x00.zip | 1,600,605,511 | 2024-01-08 14:15:03Z | n/a | n/a |
corpora-video-x01.zip | 1,589,036,913 | 2024-01-08 14:15:02Z | n/a | n/a |
corpora-video-x02.zip | 1,519,838,970 | 2024-01-08 14:15:02Z | n/a | n/a |
corpora-video-x03.zip | 1,293,641,634 | 2024-01-08 14:15:02Z | n/a | n/a |
corpora-video-x04.zip | 1,283,782,465 | 2024-01-08 14:15:02Z | n/a | n/a |
corpora-video-x05.zip | 1,280,377,873 | 2024-01-08 14:15:14Z | n/a | n/a |
corpora-video-x06.zip | 1,290,396,228 | 2024-01-08 14:15:20Z | n/a | n/a |
corpora-video-x07.zip | 1,264,021,902 | 2024-01-08 14:15:27Z | n/a | n/a |
corpora-video-x08.zip | 1,287,739,875 | 2024-01-08 14:15:38Z | n/a | n/a |
corpora-video-x09.zip | 1,307,003,371 | 2024-01-08 14:15:50Z | n/a | n/a |
corpora-video-x10.zip | 1,287,035,945 | 2024-01-08 14:15:53Z | n/a | n/a |
corpora-video-x11.zip | 1,293,757,959 | 2024-01-08 14:16:06Z | n/a | n/a |
corpora-video-x12.zip | 1,284,772,937 | 2024-01-08 14:16:09Z | n/a | n/a |
corpora-video-x13.zip | 1,256,479,747 | 2024-01-08 14:16:14Z | n/a | n/a |
corpora-video-x14.zip | 1,287,781,400 | 2024-01-08 14:16:22Z | n/a | n/a |
corpora-video-x15.zip | 1,256,368,568 | 2024-01-08 14:16:24Z | n/a | n/a |
corpora-video-x16.zip | 1,295,441,802 | 2024-01-08 14:16:36Z | n/a | n/a |
corpora-video-x17.zip | 1,258,447,238 | 2024-01-08 14:16:39Z | n/a | n/a |
corpora-video-x18.zip | 1,272,871,307 | 2024-01-08 14:16:44Z | n/a | n/a |
corpora-video-x19.zip | 1,274,769,616 | 2024-01-08 14:16:52Z | n/a | n/a |
corpora-video-x20.zip | 1,278,007,097 | 2024-01-08 14:16:52Z | n/a | n/a |
corpora-video-x21.zip | 1,293,590,250 | 2024-01-08 14:17:07Z | n/a | n/a |
corpora-video-x22.zip | 1,265,229,884 | 2024-01-08 14:17:08Z | n/a | n/a |
corpora-video-x23.zip | 1,303,651,210 | 2024-01-08 14:17:16Z | n/a | n/a |
corpora-video-x24.zip | 1,288,771,215 | 2024-01-08 14:17:22Z | n/a | n/a |
corpora-video-x25-x00.zip | 1,442,508,685 | 2024-01-08 14:17:22Z | n/a | n/a |
corpora-video-x25-x01.zip | 2,180,916,903 | 2024-01-08 14:17:37Z | n/a | n/a |
corpora-video-x25-x02.zip | 810,595,012 | 2024-01-08 14:17:38Z | n/a | n/a |
corpora-video-x25-x03.zip | 1,795,794,536 | 2024-01-08 14:17:47Z | n/a | n/a |
corpora-video-x25-x04.zip | 1,617,047,436 | 2024-01-08 14:17:53Z | n/a | n/a |
corpora-video-x25-x05.zip | 497,195,019 | 2024-01-08 14:17:57Z | n/a | n/a |
corpora-video-x25-x06.zip | 1,029,624,153 | 2024-01-08 14:17:57Z | n/a | n/a |
corpora-video-x25-x07.zip | 1,109,632,953 | 2024-01-08 14:18:09Z | n/a | n/a |
corpora-video-x25-x08.zip | 973,025,482 | 2024-01-08 14:18:22Z | n/a | n/a |
corpora-video-x25-x09.zip | 485,711,102 | 2024-01-08 14:18:29Z | n/a | n/a |
corpora-video-x25-x10.zip | 745,253,955 | 2024-01-08 14:18:30Z | n/a | n/a |
corpora-video-x25-x11.zip | 1,525,394,257 | 2024-01-08 14:18:32Z | n/a | n/a |
corpora-video-x25-x12.zip | 499,977,623 | 2024-01-08 14:18:37Z | n/a | n/a |
corpora-video-x25-x13.zip | 936,817,145 | 2024-01-08 14:18:41Z | n/a | n/a |
corpora-video-x25-x14.zip | 2,358,409,522 | 2024-01-08 14:18:45Z | n/a | n/a |
corpora-video-x25-x15.zip | 1,736,444,545 | 2024-01-08 14:18:47Z | n/a | n/a |
corpora-video-x25-x16.zip | 447,463,904 | 2024-01-08 14:18:49Z | n/a | n/a |
corpora-video-x25-x17.zip | 847,092,600 | 2024-01-08 14:19:00Z | n/a | n/a |
corpora-video-x25-x18.zip | 959,990,209 | 2024-01-08 14:19:04Z | n/a | n/a |
corpora-video-x25-x19.zip | 1,372,981,200 | 2024-01-08 14:19:08Z | n/a | n/a |
corpora-video-x25.zip | 1,289,862,522 | 2024-01-08 14:19:21Z | n/a | n/a |
corpora-video-x26.zip | 1,279,445,380 | 2024-01-08 14:19:29Z | n/a | n/a |
corpora-video-x27.zip | 1,275,999,569 | 2024-01-08 14:19:30Z | n/a | n/a |
corpora-video-x28.zip | 1,261,999,662 | 2024-01-08 14:19:41Z | n/a | n/a |
corpora-video-x29.zip | 1,287,107,332 | 2024-01-08 14:19:42Z | n/a | n/a |
corpora-video-x30.zip | 1,285,028,029 | 2024-01-08 14:19:50Z | n/a | n/a |
corpora-video-x31.zip | 1,270,693,210 | 2024-01-08 14:20:00Z | n/a | n/a |
corpora-video-x32.zip | 1,279,808,537 | 2024-01-08 14:20:01Z | n/a | n/a |
corpora-video-x33.zip | 1,284,038,107 | 2024-01-08 14:20:12Z | n/a | n/a |
corpora-video-x34.zip | 1,160,116,969 | 2024-01-08 14:20:12Z | n/a | n/a |
corpora-video-x35.zip | 1,104,937,770 | 2024-01-08 14:20:23Z | n/a | n/a |
corpora-video-x36.zip | 1,170,951,957 | 2024-01-08 14:20:31Z | n/a | n/a |
corpora-video-x37.zip | 1,172,813,633 | 2024-01-08 14:20:33Z | n/a | n/a |
corpora-video-x38.zip | 1,176,488,498 | 2024-01-08 14:20:41Z | n/a | n/a |
corpora-video-x39.zip | 1,167,686,027 | 2024-01-08 14:20:44Z | n/a | n/a |
corpora-video-x40.zip | 1,149,581,433 | 2024-01-08 14:20:50Z | n/a | n/a |
corpora-video-x41.zip | 1,195,518,418 | 2024-01-08 14:21:00Z | n/a | n/a |
corpora-video-x42.zip | 1,190,054,852 | 2024-01-08 14:21:02Z | n/a | n/a |
corpora-video-x43.zip | 1,164,763,215 | 2024-01-08 14:21:11Z | n/a | n/a |
corpora-video-x44.zip | 1,212,414,906 | 2024-01-08 14:21:13Z | n/a | n/a |
corpora-video-x45.zip | 1,185,607,588 | 2024-01-08 14:21:19Z | n/a | n/a |
corpora-video-x46.zip | 1,163,263,438 | 2024-01-08 14:21:31Z | n/a | n/a |
corpora-video-x47.zip | 1,160,471,547 | 2024-01-08 14:21:33Z | n/a | n/a |
corpora-video-x48.zip | 1,162,925,829 | 2024-01-08 14:21:39Z | n/a | n/a |
corpora-video-x49.zip | 1,177,476,594 | 2024-01-08 14:21:44Z | n/a | n/a |
corpora-video-x50.zip | 1,158,960,004 | 2024-01-08 14:21:48Z | n/a | n/a |
corpora-video-x51.zip | 1,160,182,574 | 2024-01-08 14:21:58Z | n/a | n/a |
corpora-video-x52.zip | 1,171,747,553 | 2024-01-08 14:22:02Z | n/a | n/a |
corpora-video-x53.zip | 1,152,325,983 | 2024-01-08 14:22:07Z | n/a | n/a |
corpora-video-x54.zip | 1,202,252,760 | 2024-01-08 14:22:13Z | n/a | n/a |
Following the release of the curated CC-MAIN-2021-31-PDF-UNTRUNCATED corpus, this new corpus contains over 5.0 million files collected by a team at NASA’s Jet Propulsion Laboratory (JPL) or synthetically generated by Kudu Dynamics’ “Voice of the Offense” (VoO) team for the Defense Advanced Research Project Agency (DARPA)’s SafeDocs Program.
The common-crawl component of this new corpus (over 3.9 million PDF files as collected by a team at NASA’s Jet Propulsion Laboratory (JPL)) includes some truncated PDF files, as described in the paper “Building a Wide Reach Corpus for Secure Parser Development" by Allison et al slides paper. This component was utilized within the SafeDocs program as the starting point for many generated files, as discussed below, and as the Program’s evaluation corpus, and. was then expanded and improved to become CC-MAIN-2021-31-PDF-UNTRUNCATED.
The corpus contains many files which may cause a parser to segmentation fault (crash), or may cause a parser to hang in an infinite processing loop. There are also many files which will parse differently with different parsers and/or parser options. This will cause issues such as rendering differentials and pdftotext differentials when processing these files. At the time of this corpus release, there is no file in the corpus that is known to be "malicious". However, any file downloaded from the Internet can be dangerous because they may contain viruses, malware, or harmful content that could harm your computer, compromise your personal data, or lead to potential security risks. Therefore, anyone downloading these files should handle them with proper security protocols.
This corpus contains file and streaming formats used during the SafeDocs program research, Hackathon, and Evaluation events:
the corpora-pdf-.zip files contains over 1.5 million unsafe PDF files; the corpora-icc-.zip files contains corpora for ICC color profiles; the corpora-nitf-.zip files contains corpora for the National Imagery Transmission Format (NITF) file format; and the corpora-video-.zip files contains corpora for several streaming video formats. The corpora-pdf-*.zip files contains over 1.5 million PDF files generated by the VoO team on behalf of the SafeDocs program. The corpus was designed to achieve the necessary diversity and complexity, providing a representative baseline for the assessment of both performer-created parsers and extant PDF parser security. The PDF files leverage three unique sources developed by the SafeDocs TA3 performer, Kudu Dynamics, for the creation of malformed PDF files that may exhibit potentially dangerous behavior in the anti-pattern known as a “shotgun parser”:
The structure of the SafeDocs corpus consists of the following:
This dataset was gathered by a team at NASA’s Jet Propulsion Laboratory (JPL), California Institute of Technology while supporting the Defense Advanced Research Project Agency (DARPA)’s SafeDocs Program. The JPL team included Chris Mattmann (PI), Wayne Burke, Dustin Graf, Tim Allison, Ryan Stonebraker, Mike Milano, Philip Southam, and Anastasia Menshikova.
The JPL and Kudu Dynamics teams collaborated with Peter Wyatt, the Chief Technology Officer of the PDF Association and PI on the SafeDocs program, in the design and documentation of this corpus. The JPL team and PDF Association would like to thank Simson Garfinkel and the Digital Corpora Project for taking ownership of this dataset and publishing it. Our thanks are extended to the Amazon Open Data Sponsorship Program for enabling this large corpus to be free and publicly available as part of the Digital Corpora Project initiative.
Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement by the United States Government or the Jet Propulsion Laboratory, California Institute of Technology.
The research was carried out at the NASA (National Aeronautics and Space Administration) Jet Propulsion Laboratory, California Institute of Technology under a contract with the Defense Advanced Research Projects Agency (DARPA) SafeDocs program. Government sponsorship acknowledged.