You are now leaving the website that is under the control and management of DARPA. The appearance of hyperlinks does not constitute endorsement by DARPA of non-U.S. Government sites or the information, products, or services contained therein. Although DARPA may or may not use these sites as additional distribution channels for Department of Defense information, it does not exercise editorial control over all of the information that you may find at these locations. Such links are provided consistent with the stated purpose of this website.

After reading this message, click to continue immediately.

Go Back

/ Information Innovation Office (I2O)

Memex (Domain-Specific Search)

Today's web searches use a centralized, one-size-fits-all approach that searches the Internet with the same set of tools for all queries. While that model has been wildly successful commercially, it does not work well for many government use cases. To help overcome these challenges, DARPA launched the Memex program in September 2014. Memex seeks to develop software that advances online search capabilities far beyond the current state of the art. The goal is to invent better methods for interacting with and sharing information, so users can quickly and thoroughly organize and search subsets of information relevant to their individual interests. Creation of a new domain-specific indexing and search paradigm will provide mechanisms for improved content discovery, information extraction, information retrieval, user collaboration, and extension of current search capabilities to the deep web, the dark web, and nontraditional (e.g. multimedia) content.

Program Manager: Mr. Wade Shen


The content below has been generated by organizations that are partially funded by DARPA; the views and conclusions contained therein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA or the U.S. Government.

Report a problem:

Last updated: June 27, 2016

InferLink Landmark Extractor Text Extraction Library to extract semi-structured data from similar web pages based on rules. (Python) ALv2
Carnegie Mellon University TAD (Temporal Anomaly Detector) Time series statistics Temporal scan anomaly detection algorithm for time series. (Python) MIT
MIT-LL Text.jl Natural Language Processing Text.jl provided numerous tools for text processing optimized for the Julia language. Functionality supported include algorithms for feature extraction, text classification, and language identification. (Julia) ALv2
SRI International Hidden Service Forum Spider Infrastructure An interactive web forum analysis tool that operates over Tor hidden services. This tool is capable of passive forum data capture and posting dialog at random or user-specifiable intervals. (Python) SRI open source license
SRI International HSProbe (The Tor Hidden Service Prober) Infrastructure HSProbe is a python multi-threaded STEM-based application designed to interrogate the status of Tor hidden services (HSs) and extracting hidden service content. It is an HS-protocol savvy crawler, that uses protocol error codes to decide what to do when a hidden service is not reached. HSProbe tests whether specified Tor hidden services (.onion addresses) are listening on one of a range of pre-specified ports, and optionally, whether they are speaking over other specified protocols. As of this version, support for HTTP and HTTPS is implemented. Hsprobe takes as input a list of hidden services to be probed and generates as output a similar list of the results of each hidden service probed. (Python) SRI open source license
The Tor Project, SRI International Tor Infrastructure The core software for using and participating in the Tor network. (C) BSDv3
Diffeo, Inc. Dossier Stack Machine Learning Dossier Stack provides a framework of library components for building active search applications that learn what users want by capturing their actions as truth data. The frameworks web services and javascript client libraries enable applications to efficiently capture user actions such as organizing content into folders, and allows back end algorithms to train classifiers and ranking algorithms to recommend content based on those user actions. (Python, JavaScript, Java) MIT
Carnegie Mellon University TJBatchExtractor Natural Language Processing Regex based information extractor for online advertisements (Java). MIT
Georgetown University (publications) Dumpling Information Retrieval, Search Algorithms, Machine Learning Dumpling implements a novel dynamic search engine which refines search results on the fly. Dumpling utilizes the Winwin algorithm and the Query Change retrieval Model (QCM) to infer the user's state and tailor search results accordingly. Dumpling provides a friendly user interface for user to compare the static results and dynamic results. (Java, JavaScript, HTML, CSS) Public Domain
Georgetown University (publications) TREC-DD Annotation Search Result Evaluation This Annotation Tool supports the annotation task in creating ground truth data for TREC Dynamic Domain Track. It adopts drag and drop approach for assessor to annotate passage-level relevance judgement. It also supports multiple ways of browsing and search in various domains of corpora used in TREC DD. (Python, JavaScript, HTML, CSS) Public Domain
Georgetown University (publications) CubeTest Search Result Evaluation Official evaluation metric used for evaluation for TREC Dynamic Domain Track. It is a multiple-dimensional metric that measures the effectiveness of complete a complex and task-based search process. (Perl) Public Domain
Hyperion Gray, LLC, Scrapinghub Autologin Data Collection AutoLogin is a utility that allows a web crawler to start from any given page of a website (for example the home page) and attempt to find the login page, where the spider can then log in with a set of valid, user-provided credentials to conduct a deep crawl of a site to which the user already has legitimate access. AutoLogin can be used as a library or as a service. (Python) ALv2
Hyperion Gray, LLC, Scrapinghub Formasaurus Data Collection Formasaurus is a Python package that tells users the type of an HTML form: is it a login, search, registration, password recovery, join mailing list, contact form or something else. Under the hood it uses machine learning. (Python) ALv2
Hyperion Gray, LLC, Scrapinghub HG Profiler Data Collection HG Profiler is a tool that allows users to take a list of entities from a particular source and look for those same entities across a pre-defined list of other sources. (Python) ALv2
Hyperion Gray, LLC, Scrapinghub SourcePin Data Collection SourcePin is a tool to assist users in discovering websites that contain content they are interested in for a particular topic, or domain. Unlike a search engine, SourcePin allows a non-technical user to leverage the power of an advanced automated smart web crawling system to generate significantly more results than the manual process typically does, in significantly less time. The User Interface of SourcePin allows users to quickly across through hundreds or thousands of representative images to quickly find the websites they are most interested in. SourcePin also has a scoring system which takes feedback from the user on which websites are interesting and, using machine learning, assigns a score to the other crawl results based on how interesting they are likely to be for the user. The roadmap for SourcePin includes integration with other tools and a capability for users to actually extract relevant information from the crawl results. (Python, JavaScript) ALv2
Hyperion Gray, LLC, Scrapinghub Frontera Data Collection Frontera (formerly Crawl Frontier) is used as part of a web crawler, it can store URLs and prioritize what to visit next. (Python) BSD
Hyperion Gray, LLC, Scrapinghub Distributed Frontera Data Collection Distributed Frontera is an extension to Frontera (, providing replication, sharding, and isolation of all parts of Frontera-based crawler to scale and distribute it. Frontera (also in the DARPA Open Catalog) is a crawl frontier framework, the part of a crawling system that decides the logic and policies to follow when a crawler is visiting websites (what pages should be crawled next, priorities and ordering, how often pages are revisited, etc.). (Python) BSD
Hyperion Gray, LLC, Scrapinghub Arachnado Data Collection Arachnado is a simple management interface for launching a deep crawl of a specific website. It provides a Tornado-based HTTP API and a web UI for a Scrapy-based crawler. (Python) MIT
Hyperion Gray, LLC, Scrapinghub Splash Data Collection Lightweight, scriptable browser as a service with an HTTP API. (Python) BSD
Hyperion Gray, LLC, Scrapinghub Scrapy-Dockerhub Data Collection Scrapy-Dockerhub is a deployment setup for Scrapy spiders that packages the spider and all dependencies into a Docker container, which is then managed by a Fabric command line utility. With this setup, users can run spiders seamlessly on any server, without the need for Scrapyd which typically handles the spider management. With Scrapy-Dockerhub, users issue one command to deploy spider with all dependencies to the server and second command to run it. There are also commands for viewing jobs, logs, etc. (Python) ALv2
University of Southern California Information Sciences Institute Karma Infrastructure Karma is an information integration tool that enables users to quickly and easily integrate data from a variety of data sources including databases, spreadsheets, delimited text files, XML, JSON, KML and Web APIs. Users integrate information by modelling it according to an ontology of their choice using a graphical user interface that automates much of the process. (Java, JavaScript) ALv2
University of Southern California Information Sciences Institute, Next Century Corporation DIG Visualization DIG is a visual analysis tool based on a faceted search engine that enables rapid, interactive exploration of large data sets. Users refine their queries by entering search terms or selecting values from lists of aggregated attributes. DIG can be quickly configured for a new domain through simple configuration. (JavaScript) ALv2
IST Research, streamparse Analytics, Distributed Programming, Infrastructure, Systems Integration streamparse runs Python code against real-time streams of data. It allows users to spin up small clusters of stream processing machines locally during development. It also allows remote management of stream processing clusters that are running Apache Storm. It includes a Python module implementing the Storm multi-lang protocol; a command-line tool for managing local development, projects, and clusters; and an API for writing data processing topologies easily. (Python, Clojure) ALv2
IST Research, pykafka Analytics, Distributed Programming, Infrastructure, Systems Integration pykafka is a Python driver for the Apache Kafka messaging system. It enables Python programmers to publish data to Kafka topics and subscribe to existing Kafka topics. It includes a pure-Python implementation as well as an optional C driver for increased performance. It is the only Python driver to have feature parity with the official Scala driver, supporting both high-level and low-level APIs, including balanced consumer groups for high-scale uses. (Python) ALv2
Jet Propulsion Laboratory (publications), Continuum, Kitware ImageSpace Analysis, Visualization ImageSpace provides the ability to analyze and search through large numbers of images. These images may be text searched based on associated metadata and OCR text or a new image may be uploaded as a foundation for a search. (Python) ALv2
Jet Propulsion Laboratory (publications), Continuum FacetSpace Analysis FacetSpace allows the investigation of large data sets based on the extraction and manipulation of relevant facets. These facets may be almost any consistent piece of information that can be extracted from the dataset: names, locations, prices, etc. (JavaScript)
Jet Propulsion Laboratory (publications) ImageCat Infrastructure ImageCat analyses images and extracts their EXIF metadata and any text contained in the image via OCR. It can handle millions of images. (Python, Java) ALv2
Jet Propulsion Laboratory (publications), Continuum MemexGATE Text Analysis, Text Engineering, Indexing Server side framework, command line tool and environment for running large scale General Architecture Text Engineering tasks over document resources such as online ads, debarment information, federal and district court appeals, press releases, news articles, social media streams, etc. (Java) ALv2
Jet Propulsion Laboratory (publications), Kitware SMQTK Analysis Kitware's Social Multimedia Query Toolkit (SMQTK) is an open-source service for ingesting images and video from social media (e.g. YouTube, Twitter), computing content-based features, indexing the media based on the content descriptors, querying for similar content, and building user-defined searches via an interactive query refinement (IQR) process. (Python) BSD
U.S. Naval Research Laboratory (publications) The Tor Path Simulator (TorPS) Experimentation Support, Security TorPS quickly simulates path selection in the Tor traffic-secure communications network. It is useful for experimental analysis of alternative route selection algorithms or changes to route selection parameters. (C++, Python, Bash) BSD
U.S. Naval Research Laboratory (publications) Shadow Experimentation Support, Security Shadow is an open-source network simulator/emulator hybrid that runs real applications like Tor and Bitcoin over a simulated Internet topology. It is light-weight, efficient, scalable, parallelized, controllable, deterministic, accurate, and modular. (C) BSD
NYU (publications) ACHE Data Collection, Information Retrieval ACHE is a focused crawler. Users can customize the crawler to search for different topics or objects on the Web. (Java) GPL
NYU (publications) ACHE - DDT Data Collection, Information Retrieval DDT is an interactive system that helps users explore and better understand a domain (or topic) as it is represented on the Web. It achieves this by integrating human insights with machine computation (data mining and machine learning) through visualization. DDT allows a domain expert to visualize and analyze pages returned by a search engine or a crawler, and easily provide feedback about relevance. This feedback, in turn, can be used to address two challenges: (1) Guide users in the process of domain understanding and help them construct effective queries to be issued to a search engine; and (2) Configure focused crawlers that efficiently search the Web for additional pages on the topic. DDT allows users to quickly select crawling seeds as well as positive and negatives required to create a page classifier for the focus topic. (Python, Java, JavaScript) BSD
Jet Propulsion Laboratory (publications), NYU (publications), Continuum Analytics Memex Explorer Analytics, Visualization, research integration, interface Memex Explorer is a pluggable framework for domain specific crawls, search, and unified interface for Memex Tools. It includes the capability to add links to other web-based apps (not just Memex) and the capability to start, stop, and analyze web crawls using 2 different crawlers - ACHE and Nutch. (Python) BSD
NYU (publications), Continuum Analytics, Jet Propulsion Laboratory (publications) Topic Space Analytics, Visualization Tool for visualization for topics in document collections. (Python) ASL
Uncharted Software TellFinder Visualization, Analytics, Information Retrieval TellFinder provides efficient visual analytics to automatically characterize and organize publicly available Internet data. Compared to standard web search engines, TellFinder enables users to research case-related data in significantly less time. Reviewing TellFinder's automatically characterized groups also allows users to understand temporal patterns, relationships and aggregate behavior. The techniques are applicable to various domains. (JavaScript, Java) MIT
ArrayFire ArrayFire Analytics, API, Distributed Programming, Image Processing, Machine Learning, Signal Processing, Visualization ArrayFire is a high performance software library for parallel computing with an easy-to-use API. Its array-based function set makes parallel programming simple. ArrayFire's multiple backends (CUDA, OpenCL, and native CPU) make it platform independent and highly portable. A few lines of code in ArrayFire can replace dozens of lines of parallel computing code, saving users valuable time and lowering development costs. (C, C++, Python, Fortran, Java) BSDv3
Stanford University DeepDive Infrastructure DeepDive is a new type of knowledge base construction system that enables developers to analyze data on a deeper level than ever before. Many applications have been built using DeepDive to extract data from millions of documents, Web pages, PDFs, tables, and figures. DeepDive is a trained system, which means that it uses machine-learning techniques to incorporate domain-specific knowledge and user feedback to improve the quality of its analysis. DeepDive can deal with noisy and imprecise data by producing calibrated probabilities for every assertion it makes. DeepDive offers a scalable, high-performance learning engine. (SQL, Python, C++) ALv2
Qadium Plumber Infrastructure Plumber is designed to facilitate distributed data exploration. With the "plumb" command line tool, developers and data scientists can deploy and manage data enhancers on a Kubernetes cluster. Plumber provides a system to write python scripts to perform data enhancement e.g. perform a regex, make a call to a database, link data together, etc. and automatically distribute python scripts to optimize performance. (Python) ALv2
Qadium CommonCrawlJob Analysis Extract regular expressions from Common Crawl. This is a useful library for collecting unique identfifiers at Internet scale without any crawling and using only Python and AWS (Python, AWS) ALv2
Qadium Link Visualization, Analytics, Information Retrieval Link is a domain-specific, entity-centric search tool designed for analysts. Link is a front-end web app that sits on top of data enhanced by Plumber and provides a framework that must be tailored on a per domain basis. (javascript) ALv2
Qadium Omakase Infrastructure Omakase provides a simple and flexible interface to share data, computations, and visualizations between a variety of user roles in both local and cloud environments. (Python, Clojure) EPL
Qadium credstmpl Infrastructure Command-line tool to instantiate templates from credentials stored in CredStash. credstmpl makes it easy to share secret credentials across a large team. (Python) ALv2
Qadium linkalytics Analytics Linkalytics is a suite of back-end analytics to link together disparate data. Linkalytics is intended to be hosted as an API that users can use to enhance, group, and cluster data. (Python) ALv2
Sotera Defense Solutions DataWake Analytics, Distributed Programming, Data Collection The Datawake project aggregates user browsing data via a plug-in using domain-specific searches. This captured, or extracted, data is organized into browse paths and elements of interest. This information can be shared or expanded amongst teams of individuals. Elements of interest which are extracted either automatically, or manually by the user, are given weighted values. The exported data can be used to specify a new domain and seed crawlers.(Python, Java, Scala, Clojure, JavaScript) ALv2
Uncharted Software Aperture Tile-Based Visual Analytics Visualization New tools for raw data characterization of 'big data' are required to suggest initial hypotheses for testing. The widespread use and adoption of web-based maps has provided a familiar set of interactions for exploring abstract large data spaces. Building on these techniques, we developed tile based visual analytics that provide browser-based interactive visualization of billions of data points. (JavaScript/Java) MIT
MIT-LL MITIE Analytics Trainable named entity extractor (NER) and relation extractor. (C) ALv2
MIT-LL Topic Analytics This tool takes a set of text documents, filters by a given language, and then produces documents clustered by topic. The method used is Probabilistic Latent Semantic Analysis (PLSA). (Python) ALv2
IST Research Scrapy Cluster Information Collection, Distributed Environments Scrapy Cluster is a scalable, distributed web crawling cluster based on Scrapy and coordinated via Kafka and Redis. It provides a framework for intelligent distributed throttling as well as the ability to conduct time-limited web crawls. (Python) BSD
Columbia Univeristy ColumbiaImageSearch Analytics, Classification, Hashing, Image search ColumbiaImageSearch provides highly efficient solutions for finding images of similar content from large collections in real time. It combines a unique image representation, call DeepSentiBank, and novel hashing techniques to encode each image with a very compact hash code, which can reduce the computational and storage costs by orders of magnitude and allows searching over millions of images in real time. The search tool API is exposed as a php file, and input/output processing is performed inside a python script, while the actual image search is done in C++ for efficiency. (Python, C++, PHP, CSS, Javascript) BSD
Georgetown University Designing States, Actions, and Rewards for Using POMDP in Session Search
Georgetown University Detecting the Eureka Effect in Complex Search
Georgetown University Is the First Query the Most Important: An Evaluation of Query Aggregation Schemes in Web Search Sessions
Georgetown University A Fragment-Based Similarity Measure for Concept Hierarchies and Ontologies
Georgetown University Query Aggregation in Session Search
Georgetown University Modeling Rich Interactions in Session Search - Georgetown University at TREC 2014 Session Track
Georgetown University Dynamic Information Retrieval Modeling
Georgetown University Browsing Hierarchy Construction by Minimum Evolution
Georgetown University The Query-Change Model: Modeling Session Search as a Markov Decision Process
Georgetown University A Term-Based Methodology for Query Reformulation Understanding
U.S. Naval Research Laboratory Genuine Onion: Simple, Fast, Flexible, and Cheap Website Authentication
NYU Bootstrapping Focused Crawlers in Sparse Topics
Jet Propulsion Laboratory, Georgetown University Multimedia Metadata-based Forensics in Human Trafficking Web Data
Jet Propulsion Laboratory Clustering Web Pages Based on Structure and Style Similarity
Jet Propulsion Laboratory A New Application of the Cosine Similarity Metric for Scalable Domain Discovery