You are now leaving the website that is under the control and management of DARPA. The appearance of hyperlinks does not constitute endorsement by DARPA of non-U.S. Government sites or the information, products, or services contained therein. Although DARPA may or may not use these sites as additional distribution channels for Department of Defense information, it does not exercise editorial control over all of the information that you may find at these locations. Such links are provided consistent with the stated purpose of this website.

After reading this message, click to continue immediately.

Go Back

/ Information Innovation Office (I2O)


XDATA is developing an open source software library for big data to help overcome the challenges of effectively scaling to modern data volume and characteristics. The program is developing the tools and techniques to process and analyze large sets of imperfect, incomplete data. Its programs and publications focus on the areas of analytics, visualization, and infrastructure to efficiently fuse, analyze, and disseminate these large volumes of data.

Program Manager: Mr. Wade Shen


The content below has been generated by organizations that are partially funded by DARPA; the views and conclusions contained therein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA or the U.S. Government.

Report a problem:

Last updated: November 13, 2015

TeamProjectDescriptionInstructional MaterialCategoryCodeStatsLicense
Boeing, University of Pittsburgh SMILE-WIDE: A scalable Bayesian network library SMILE-WIDE is a scalable Bayesian network library. Initially, it is a version of the SMILE library, as in SMILE With Integrated Distributed Execution. The general approach has been to provide an API similar to the existing API SMILE developers use to build "local," single-threaded applications. However, we provide "vectorized" operations that hide a Hadoop-distributed implementation. Apart from invoking a few idioms like generic Hadoop command line argument parsing, these appear to the developer as if they were executed locally. (Java) Analytics stats ALv2
Aptima, Inc. Network Query by Example Hadoop MapReduce-over-Hive based implementation of network query by example utilizing attributed network pattern matching. (Java) Not Available Analytics stats ALv2
Carnegie Mellon University (publications) skl-groups A package extending the Python machine learning toolkit scikit-learn with support for operating on sets ("groups") as features. Available Analytics stats BSD
Continuum Analytics (publications) Blaze Blaze is the next-generation of NumPy. It is designed as a foundational set of abstractions on which to build out-of-core and distributed algorithms over a wide variety of data sources and to extend the structure of NumPy itself. Blaze allows easy composition of low level computation kernels (C, Fortran, Numba) to form complex data transformations on large datasets. In Blaze, computations are described in a high-level language (Python) but executed on a low-level runtime (outside of Python), enabling the easy mapping of high-level expertise to data without sacrificing low-level performance. Blaze aims to bring Python and NumPy into the massively-multicore arena, allowing it to leverage many CPU and GPU cores across computers, virtual machines and cloud services. (Python) Available Infrastructure stats BSD
Continuum Analytics (publications) Numba Numba is an Open Source NumPy-aware optimizing compiler for Python sponsored by Continuum Analytics, Inc. It uses the LLVM compiler infrastructure to compile Python syntax to machine code.

It is aware of NumPy arrays as typed memory regions and so can speed-up code using NumPy arrays. Other, less well-typed code is translated to Python C-API calls effectively removing the "interpreter" but not removing the dynamic indirection.

Numba is also not a tracing just in time (JIT) compiler. It compiles your code before it runs either using run-time type information or type information you provide in the decorator.

Numba is a mechanism for producing machine code from Python syntax and typed data structures such as those that exist in NumPy. (Python)
Available Infrastructure stats BSD
Continuum Analytics (publications) Bokeh Bokeh (pronounced bo-Kay or bo-Kuh) is a Python interactive visualization library for large datasets that natively uses the latest web technologies. Its goal is to provide elegant, concise construction of novel graphics in the style of Protovis/D3, while delivering high-performance interactivity over large data to thin clients. (Python/JavaScript/Coffeescript) Available Visualization stats BSD
Data Tactics Corporation Vowpal Wabbit The Vowpal Wabbit (VW) project is a fast out-of-core learning system sponsored by Microsoft Research and (previously) Yahoo! Research. Support is available through the mailing list. There are two ways to have a fast learning algorithm: (a) start with a slow algorithm and speed it up, or (b) build an intrinsically fast learning algorithm. This project is about approach (b), and it's reached a state where it may be useful to others as a platform for research and experimentation. There are several optimization algorithms available with the baseline being sparse gradient descent (GD) on a loss function (several are available). The code should be easily usable. Its only external dependence is on the boost library, which is often installed by default. (C) Visualization stats BSD
Data Tactics Corporation Darpa Open Catalog Generator Code and templates for building the DARPA open catalog as hosted on Analytics stats ALv2
Data Tactics Corporation Circuit Circuit reduces the human development and sustenance costs of complex massively-scaled systems nearly to the level of their single-process counterparts. It is a combination of proven ideas from the Erlang ecosystem of distributed embedded devices and Go's ecosystem of Internet application development. Circuit extends the reach of Go's linguistic environment to multi-host/multi-process applications. (Go) Infrastructure stats ALv2
Georgia Tech / GTRI (publications) SmallK: A high-performance library for nonnegative matrix factorization and hierarchical clustering SmallK is a high-performance, parallel library for nonnegative matrix factorization on both dense and sparse matrices written in C++. Implementations of several different NMF algorithms are provided, including multiplicative updating, hierarchical alternating least squares, nonnegative least squares with block principal pivoting, and a new rank2 algorithm. The library provides an implementation of hierarchical clustering based on the rank2 NMF algorithm. (C++) Available Analytics stats ALv2
Giant Oak, Inc. Markov modulated Poisson process for event detection Markov Modulated Poisson Process for Event Detection allows R users to accurately detect unusual events and anomalies in time series of counts. (R) Available Analytics stats GPLv2
IBM Research (publications) SKYLARK: Randomized Numerical Linear Algebra and ML SKYLARK implements Numerical Linear Algebra (NLA) kernels based on sketching for distributed computing platforms. Sketching reduces dimensionality through randomization, and includes Johnson-Lindenstrauss random projection (JL); a faster version of JL based on fast transform techniques; sparse techniques that can be applied in time proportional to the number of nonzero matrix entries; and methods for approximating kernel functions and Gram matrices arising in nonlinear statistical modeling problems. We have a library of such sketching techniques, built using MPI in C++ and callable from Python, and are applying the library to regression, low-rank approximation, and kernel-based machine learning tasks, among other problems. (C++/Python) Analytics stats ALv2
Institute for Creative Technologies, USC (publications) Immersive Body-Based Interactions Provides innovative interaction techniques to address human-computer interaction challenges posed by Big Data. Examples include:
* Wiggle Interaction Technique: user-induced motion to speed visual search.
* Immersive Tablet Based Viewers: low-cost 3D virtual reality fly-throughs of data sets.
* Multi-touch interfaces: browsing/querying multi-attribute and geospatial data, hosted by SOLR.
* Tablet-based visualization controller: eye-free rapid interaction with visualizations.
Not Available Visualization stats ALv2
USC (publications) Parallel Louvain Community Fast MPI based parallel Louvain community detection algorithm easily mappable to Map Reduce. Available Analytics stats ALv2
USC (publications) Parallel High Betweenness Nodes identification Fast MPI based high betweenness centrality identification algorithm extendible to cloud graph processing platforms such as Giraph++ or GoFFish. Available Analytics stats ALv2
Johns Hopkins University (publications) igraph igraph is a collection of network analysis tools with the emphasis on efficiency, portability and ease of use. igraph can be programmed in GNU R, Python and C/C++. Available Analytics stats GPLv2
Trifacta (Stanford, University of Washington, Kitware, Inc. Team) Vega Vega is a visualization grammar, a declarative format for creating and saving visualization designs. With Vega you can describe data visualizations in a JSON format, and generate interactive views using either HTML5 Canvas or SVG. (JavaScript) Visualization stats BSD
Kitware, Inc. (publications), Sotera Defense Solutions, Inc. (publications) Tangelo Tangelo provides a flexible HTML5 web server architecture that cleanly separates your web applications (pure JavaScript, HTML, and CSS) and web services (pure Python). This software is bundled with some great tools to get you started. (JavaScript/Python) Visualization stats ALv2
Sotera Defense Solutions, Inc. (publications) Newman Newman is a tool to quickly analyze and explore email using advanced analytics and visualization techniques - things not possible with traditional email applications. (JavaScript, Python) Visualization ALv2
Sotera Defense Solutions, Inc. (publications) GEQE Geo Event Quey by Example - Leverage geo-located temporal text data in order to identify similar locations or events. (JavaScript, Python) Analytics stats UN
Harvard University (publications), Kitware, Inc. (publications) LineUp LineUp is a novel and scalable visualization technique that uses bar charts. This interactive technique supports the ranking of items based on multiple heterogeneous attributes with different scales and semantics. It enables users to interactively combine attributes and flexibly refine parameters to explore the effect of changes in the attribute combination. This process can be employed to derive actionable insights as to which attributes of an item need to be modified in order for its rank to change. Additionally, through integration of slope graphs, LineUp can also be used to compare multiple alternative rankings on the same set of items, for example, over time or across different attribute combinations. We evaluate the effectiveness of the proposed multi-attribute visualization technique in a qualitative study. The study shows that users are able to successfully solve complex ranking tasks in a short period of time. (Java) Visualization stats BSD
Harvard University (publications), Kitware, Inc. (publications) LineUp Web LineUpWeb is the web version of the novel and scalable visualization technique. This interactive technique supports the ranking of items based on multiple heterogeneous attributes with different scales and semantics. It enables users to interactively combine attributes and flexibly refine parameters to explore the effect of changes in the attribute combination. Visualization stats BSD
The New School (publications) Visualization Widgets These visualizations were created to demonstrate the type of standalone visualization widgets that might compliment a composite dashboard display for a decision-maker. They are built using D3 and leverage relevant APIs to show the latest available data. (JavaScript) Not Available Visualization stats ALv2
Stanford University, University of Washington, Kitware, Inc. (publications) Lyra Lyra is an interactive environment that makes custom visualization design accessible to a broader audience. With Lyra, designers map data to the properties of graphical marks to author expressive visualization designs without writing code. Marks can be moved, rotated and resized using handles; relatively positioned using connectors; and parameterized by data fields using property drop zones. Lyra also provides a data pipeline interface for iterative, visual specification of data transformations and layout algorithms. Visualizations created with Lyra are represented as specifications in Vega, a declarative visualization grammar that enables sharing and reuse. (JavaScript) Visualization stats BSD
Phronesis stat_agg stat_agg is a Python package that provides statistical aggregators that maximize ensemble prediction accuracy by weighting individual learners in an optimal way. When used with the laputa package, learners may be distributed across a cluster of machines. The package also provides fault-tolerance when one or more learners becomes unavailable. (Python) Analytics stats ALv2
Phronesis flexmem Flexmem is a general, transparent tool for out-of-core (OOC) computing in the R programming environment. It is launched as a command line utility, taking an application as an argument. All memory allocations larger than a specified threshold are memory-mapped to a binary file. When data are not needed, they are stored on disk. It is both process- and thread-safe. (C) Infrastructure stats ALv2
Phronesis laputa Laputa is a Python package that provides an elastic, parallel computing foundation for the stat_agg (statistical aggregates) package. (Python) Infrastructure stats ALv2
Phronesis bigmemory Bigmemory is an R package to create, store, access, and manipulate massive matrices. Matrices are allocated to shared memory and may use memory-mapped files. Packages biganalytics, bigtabulate, synchronicity, and bigalgebra provide advanced functionality. (R) Infrastructure ALv2
Phronesis bigalgebra Bigalgebra is an R package that provides arithmetic functions for R matrix and big.matrix objects. (R) Infrastructure ALv2
Jet Propulsion Laboratory (publications) Apache OODT APACHE OODT enables transparent access to distributed resources, data discovery and query optimization, and distributed processing and virtual archives. OODT provides software architecture that enables models for information representation, solutions to knowledge capture problems, unification of technology, data, and metadata. (Java) Available Infrastructure ALv2
USC/Information Sciences Institute Wings WINGS is a semantic workflow system that can be used to automate data analysis processes represented as workflows of computations. A unique feature of WINGS is that its workflow representations incorporate semantic constraints about datasets and workflow components, and are used to create and validate workflows and to generate metadata for new data products. WINGS submits workflows to scalable execution frameworks such as Apache OODT to run workflows at large scale in distributed resources. (Java/JavaScript) Available Infrastructure stats ALv2
Jet Propulsion Laboratory (publications) Apache Tika The Apache Tika(TM) toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries. Available Infrastructure ALv2
Jet Propulsion Laboratory (publications) Apache Gora The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop MapReduce support. Available Infrastructure stats ALv2
Jet Propulsion Laboratory (publications) Apache Nutch A large scale web crawler framework that implements the search engine architecture as originally defined by Brin and Page. Nutch was started by Doug Cutting and is the predecessor to the Apache Hadoop technology. It includes parsers, a protocol framework, language detection, indexing capabilities and language and query models to fully implement a search engine. Available Infrastructure ALv2
Jet Propulsion Laboratory (publications) DRAT A distributed, parallelized (Map Reduce) wrapper around Apache RAT to allow it to complete on large code repositories of multiple file types where Apache RAT hangs forever. Available Infrastructure stats ALv2
Jet Propulsion Laboratory (publications) Khooshe A Big Data-Points Visualization Tool Available Visualization stats ALv2
MIT-LL (publications) Query By Example (Graph QuBE) Graph QuBE is a tool which enables efficient pattern-of-behavior search in data containing entities transacting over time. Available Analytics stats ALv2
MIT-LL (publications) VizLinc Vizlinc is a visual analytics platform that takes as input a corpus of text documents, extracts named entities (people, locations, and organizations) and the relations between those entities from the documents, and allows a user to explore the information contained in the documents from both a high-level corpus view point and with respect to more narrow queries.(Java/Groovy) Visualization stats GPL, CDDL
MIT-LL (publications) Julia Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and an extensive mathematical function library. Analytics stats MIT, GPL, LGPL, BSD
MIT-LL (publications) Topic This tool takes a set of text documents, filters by a given language, and then produces documents clustered by topic. The method used is Probabilistic Latent Semantic Analysis (PLSA). (Python) Analytics stats ALv2
MIT-LL (publications) SciDB Scientific Database for large-scale numerical data. Register on the forum to access the download page. (C++) Infrastructure GPLv3
MIT-LL (publications) Information Extractor Trainable named entity extractor (NER) and relation extractor. (C) Analytics stats ALv2
Next Century Corporation Ozone Widget Framework Ozone Widget Framework provides a customizable open-source web application that assembles the tools you need to accomplish any task and enables those tools to communicate with each other. It is a technology-agnostic composition framework for data and visualizations in a common browser-based display and interaction environment that lowers the barrier to entry for the development of big data visualizations and enables efficient exploration of large data sets. (JavaScript) Visualization stats ALv2
Next Century Corporation Neon Visualization Environment Neon is a framework that gives a datastore agnostic way for visualizations to query data and perform simple operations on that data such as filtering, aggregation, and transforms. It is divided into two parts, neon-server and neon-client. Neon-server provides a set of RESTful web services to select a datastore and perform queries and other operations on the data. Neon-client is a JavaScript API that provides a way to easily integrate neon-server capabilities into a visualization, and also aids in 'widgetizing' a visualization, allowing it to be integrated into a common OWF based ecosystem. (Groovy/JavaScript) Visualization stats ALv2
Uncharted Software (formerly Oculus Info Inc.) ApertureJS ApertureJS is an open, adaptable and extensible JavaScript visualization framework with supporting REST services, designed to produce visualizations for analysts and decision makers in any common web browser. Aperture utilizes a novel layer based approach to visualization assembly, and a data mapping API that simplifies the process of adaptable transformation of data and analytic results into visual forms and properties. Aperture vizlets can be easily embedded with full interoperability in frameworks such as the Ozone Widget Framework (OWF). (JavaScript/Java) Visualization stats MIT
Uncharted Software (formerly Oculus Info Inc.) Influent Influent is an HTML5 tool for visually and interactively following transaction flow, rapidly revealing actors and behaviors of potential concern that might otherwise go unnoticed. Summary visualization of transactional patterns and actor characteristics, interactive link expansion and dynamic entity clustering enable Influent to operate effectively at scale with big data sources in any modern web browser. Influent has been used to explore data sets with millions of entities and hundreds of millions of transactions. (JavaScript/Java) Visualization stats MIT
Uncharted Software (formerly Oculus Info Inc.) Aperture Tile-Based Visual Analytics New tools for raw data characterization of 'big data' are required to suggest initial hypotheses for testing. The widespread use and adoption of web-based maps has provided a familiar set of interactions for exploring abstract large data spaces. Building on these techniques, we developed tile based visual analytics that provide browser-based interactive visualization of billions of data points. (JavaScript/Java) Visualization stats MIT
Uncharted Software (formerly Oculus Info Inc.) Uncharted Ensemble Clustering Uncharted Ensemble Clustering is a flexible multi-threaded clustering library for rapidly constructing tailored clustering solutions that leverage the different semantic aspects of heterogeneous data. The library can be used on a single machine using multi-threading or distributed computing using Spark. (Java) Analytics stats MIT
Raytheon BBN Content and Context-based Graph Analysis: PINT, Patterns in Near-Real Time Patterns in Near-Real Time will take any corpus as input and quantify the strength of the query match to a SME-based process model, represent process model as a Directed Acyclic Graph (DAG), and then search and score potential matches. (Python) Not Available Analytics stats ALv2
Raytheon BBN Content and Context-based Graph Analysis: NILS, Network Inference of Link Strength Network Inference of Link Strength will take any text corpus as input and quantify the strength of connections between any pair of entities. Link strength probabilities are computed via shortest path. (Python) Not Available Analytics stats ALv2
Royal Caliber (publications) A vertex-centric CUDA/C++ API for large graph analytics on GPUs using the Gather-Apply-Scatter abstraction Allows users to express graph algorithms as a series of Gather-Apply-Scatter (GAS) steps similar to GraphLab. Runs these vertex programs using a single or multiple GPUs - demonstrates a large speedup over GraphLab. (C++) Not Available Analytics stats ALv2
Scientific Systems Company, Inc. (publications), MIT (publications), University of Louisville (publications) BayesDB BayesDB is an open-source implementation of a predictive database table. It provides predictive extensions to SQL that enable users to query the implications of their data -- predict missing entries, identify predictive relationships between columns, and examine synthetic populations -- based on a Bayesian machine learning system in the back end. (Python) Analytics stats ALv2
Sotera Defense Solutions, Inc. (publications) Zephyr Zephyr is a big data, platform agnostic Extract-Transform-Load (ETL) API, with Hadoop MapReduce, Storm, and other big data bindings. (Java) Infrastructure stats ALv2
Sotera Defense Solutions, Inc. (publications) Aggregate Micro Paths An analytic to help infer movement patterns from large amounts of geospatial-temporal data in a cloud environment. (Python, Scala) Analytics stats ALv2
Sotera Defense Solutions, Inc. (publications) RHIPE ARIMA This implementation of the ARIMA (AutoRegressive Integrated Moving Average) algorithm is based on R, Hadoop, and the RHIPE (Hree-pay) framework. (R) Analytics stats ALv2
Sotera Defense Solutions, Inc. (publications) Correlation Approximation Spark implementation of the Google Correlate algorithm to quickly find highly correlated vectors in huge datasets. (Scala) Analytics stats ALv2
Sotera Defense Solutions, Inc. (publications) Graphene Graphene is a web-based application that provides combined query, visualization, link identification and analysis, and other analytic capabilities within a single system. It allows a user the ability to intelligently search structured data from multiple data sources and can display transactional views, transactional graphs evolving over time, related-entity (shared-attribute network) graphs, drillable transaction histograms, directional transaction charts, activity plots, data export, and more. The combination of these capabilities makes it a valuable tool for analyzing any kind of data that can be manipulated to reveal transactions and relationships of any sort. (Java, JavaScript) Visualization stats ALv2
Sotera Defense Solutions, Inc. (publications) Distributed Graph Analytics Distributed Graph Analytics (DGA) is a compendium of graph analytics written for Bulk-Synchronous-Parallel (BSP) processing frameworks such as Giraph and GraphX. The analytics included are High Betweenness Set Extraction, Weakly Connected Components, Page Rank, Leaf Compression, and Louvain Modularity. (Scala, Java) Analytics stats ALv2
Sotera Defense Solutions, Inc. (publications) Track Communities An analytic for creating networks from geo-temporal track data based on time/space co-occurrence. Includes UI for visualization of communities and tracks. This tool is a synthesis of several analytic components and visualization techniques that allow a user to browse a network of communities, follow tracks of movement, and observe co-location highlights within a dynamic graph. (Python, JavaScript) Analytics, Visualization stats ALv2
Stanford University - Boyd (publications) LowRankModels.jl LowRankModels.jl is a julia package for modeling and fitting generalized low rank models (GLRMs). GLRMs model a data array by a low rank matrix, and include many well known models in data analysis, such as principal components analysis (PCA), matrix completion, robust PCA, nonnegative matrix factorization, k-means, and many more. LowRankModels fits GLRMs using an alternating directions proximal gradient algorithm. GLRMs and algorithms for fitting GLRMs are described in detail in an associated paper, at (which also links to the code). (Julia) Analytics stats MIT
Stanford University - Boyd (publications) SCS (Self-dual Cone Solver) Implementation of a solver for general cone programs, including linear, second-order, semidefinite and exponential cones, based on an operator splitting method applied to a self-dual homogeneous embedding. The method and software supports both direct factorization, with factorization caching, and an indirect method, that requires only the operator associated with the problem data and its adjoint. The implementation includes interfaces to CVX, CVXPY, matlab, as well as test routines. This code is described in detail in an associated paper, at (which also links to the code). (C) Analytics stats ALv2
Stanford University - Boyd (publications) ECOS: An SOCP Solver for Embedded Systems ECOS is a lightweight primal-dual homogeneous interior-point solver for SOCPs, for use in embedded systems as well as a base solver for use in large scale distributed solvers. It is described in the paper at (C) Analytics stats ALv2
Phronesis, LLC Geofin Geofin allows exploration of patterns in data at global, national, local, and human scales. Quickly triage locations and entities to find patterns and trends using customizable summary statistics and visualizations. (JavaScript, Elasticsearch) Visualization ALv2
Stanford University - Boyd (publications) CVXPY CVXPY is a Python-embedded modeling language for convex optimization problems. It allows you to express your problem in a natural way that follows the math, rather than in the restrictive standard form required by solvers. (Python) Analytics stats GPLv3
Stanford University - Hanrahan (publications) imMens imMens is a web-based system for interactive visualization of large databases. imMens uses binned aggregation to produce summary visualizations that avoid the shortcomings of standard sampling-based approaches. Through data decomposition methods (to limit data transfer) and GPU computation via WebGL (for parallel query processing), imMens enables real-time (50fps) visual querying of billion+ element databases. (JavaScript) Visualization stats BSD
PNNL (publications), Purdue University (publications) Tessera Tessera is an open source environment for deep analysis of big data. At the front end, the analyst programs in R, and has access to the thousands of methods of statistics, machine learning, and visualization implemented in R. At the back end is a distributed parallel computational environment such as Hadoop, that enables scaling to big data. In between are three Tessera packages: datadr, Trelliscope, and RHIPE (R and Hadoop Integrated Programming Environment). These packages enable the data analyst to communicate with the back end by simple R commands, and not have to worry about the details of distributed parallel computation. Tessera is powered by a statistical approach to large complex data, Divide and Recombine (D&R). The data are parallelized, not the thousands of methods, which makes back end computation typically very nearly embarrassingly parallel, and therefore very fast. But at the same time, D&R statistical division and recombination methods ensure good statistical performance. Infrastructure, Analytics, Visualization stats BSD, ALv2
Stanford University - Hanrahan (publications) Riposte Riposte is a fast interpreter and JIT for R. The Riposte VM has two cooperative subVMs for R scripting (like Java) and for R vector computation (like APL). Our scripting code has been 2-4x faster in Riposte than in R's recent bytecode interpreter. Vector-heavy code is 5-10x faster. Speeding up R can greatly increase the analyst's efficiency. (C/R) Analytics stats BSD
Stanford University - Olukotun (publications) Delite Delite is a compiler framework and runtime for parallel embedded domain-specific languages (DSLs). (Scala) Infrastructure stats BSD
Stanford University - Olukotun (publications) DeepDive DeepDive is a new type of knowledge base construction system that enables developers to analyze data on a deeper level than ever before. Many applications have been built using DeepDive to extract data from millions of documents, Web pages, PDFs, tables, and figures. DeepDive is a trained system, which means that it uses machine-learning techniques to incorporate domain-specific knowledge and user feedback to improve the quality of its analysis. DeepDive can deal with noisy and imprecise data by producing calibrated probabilities for every assertion it makes. DeepDive offers a scalable, high-performance learning engine. (SQL, Python, C++) Infrastructure stats ALv2
Stanford University - Olukotun (publications) SNAP Stanford Network Analysis Platform (SNAP) is a general purpose network analysis and graph mining library. It is written in C++ and easily scales to massive networks with hundreds of millions of nodes, and billions of edges. It efficiently manipulates large graphs, calculates structural properties, generates regular and random graphs, and supports attributes on nodes and edges. (C++) Infrastructure stats BSD
Stanford University - Olukotun (publications) is a Python interface for SNAP. SNAP is a general purpose, high performance system for analysis and manipulation of large networks. Infrastructure stats BSD
SYSTAP, LLC bigdata Bigdata enables massively parallel graph processing on GPUs and many core CPUs. The approach is based on the decomposition of a graph algorithm as a vertex program. The initial implementation supports an API based on the GraphLab 2.1 Gather Apply Scatter (GAS) API. Execution is available on GPUs, Intel Xenon Phi (aka MIC), and multi-core GPUs. Not Available Infrastructure stats GPLv2
SYSTAP, LLC MapGraph MapGraph enables massively parallel graph processing on GPUs and many core CPUs. The approach is based on the decomposition of a graph algorithm as a vertex program. The initial implementation supports an API based on the GraphLab 2.1 Gather Apply Scatter (GAS) API. Execution is available on GPUs, Intel Xenon Phi (aka MIC), and multi-core GPUs. Available Analytics stats ALv2
University of California - Davis Gunrock Gunrock is a CUDA library for graph primitives that refactors, integrates, and generalizes best-of-class GPU implementations of breadth-first search, connected components, and betweenness centrality into a unified code base useful for future development of high-performance GPU graph primitives. (CUDA C/C++) Analytics stats ALv2
Draper Laboratory (publications) USER-ALE: User Activity Logging Engine Analytic Activity Logger is an API that creates a common message passing interface to allow heterogeneous software components to communicate with an activity logging engine. Recording a user's analytic activities enables estimation of operational context and workflow. Combined with psychophysiology sensing, analytic activity logging further enables estimation of the user's arousal, cognitive load, and engagement with the tool. (JavaScript) Available Infrastructure stats ALv2
University of California - Berkeley (publications) BDAS BDAS, the Berkeley Data Analytics Stack, is an open source software stack that integrates software components being built by the AMPLab to make sense of Big Data. Infrastructure ALv2, BSD
University of California - Berkeley (publications) Spark Apache Spark is an open source cluster computing system that aims to make data analytics both fast to run and fast to write. To run programs faster, Spark offers a general execution model that can optimize arbitrary operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like Hadoop. To make programming faster, Spark provides clean, concise APIs in Python, Scala and Java. You can also use Spark interactively from the Scala and Python shells to rapidly query big datasets. (Java/Scala/Python) Infrastructure stats ALv2
University of California - Berkeley (publications) Shark Shark is a large-scale data warehouse system for Spark that is designed to be compatible with Apache Hive. It can execute Hive QL queries up to 100 times faster than Hive without any modification to the existing data or queries. Shark supports Hive's query language, metastore, serialization formats, and user-defined functions, providing seamless integration with existing Hive deployments and a familiar, more powerful option for new ones. (Scala) Infrastructure stats ALv2
University of California - Berkeley (publications) BlinkDB BlinkDB is a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data. It allows users to trade-off query accuracy for response time, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars. To achieve this, BlinkDB uses two key ideas: (1) An adaptive optimization framework that builds and maintains a set of multi-dimensional samples from original data over time, and (2) A dynamic sample selection strategy that selects an appropriately sized sample based on a query's accuracy and/or response time requirements. We have evaluated BlinkDB on the well-known TPC-H benchmarks, a real-world analytic workload derived from Conviva Inc. and are in the process of deploying it at Facebook Inc. (Scala) Infrastructure stats ALv2
University of California - Berkeley (publications) Mesos Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark, and other applications on a dynamically shared pool of nodes. (C++/Java/Python) Infrastructure stats ALv2
University of California - Berkeley (publications) Tachyon Tachyon is a fault tolerant distributed file system enabling reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce. It achieves high performance by leveraging lineage information and using memory aggressively. Tachyon caches working set files in memory, and enables different jobs/queries and frameworks to access cached files at memory speed. Thus, Tachyon avoids going to disk to load datasets that are frequently read. (Java) Infrastructure stats BSD
USC (publications) GoFFish The GoFFish project offers a distributed framework for storing timeseries graphs and composing graph analytics. It takes a clean-slate approach that leverages best practices and patterns from scalable data analytics such as Hadoop, HDFS, Hive, and Giraph, but with an emphasis on performing native analytics on graph (rather than tuple) data structures. This offers an more intuitive storage, access and programming model for graph datasets while also ensuring performance optimized for efficient analysis over large graphs (millions-billions of vertices) and many instances of them (thousands-millions of graph instances). (Slash/Java) Available Infrastructure stats ALv2
Data Tactics Corporation Escher Escher is a minimal metaphor programming language that plays as a lego block for intelligent translation across foreign semantics of heterogenous technologies. (Go) Infrastructure stats ALv2
Harvard, Kitware, Inc. (publications) UpSet Understanding relationships between sets is an important analysis task that has received widespread attention in the visualization community. The major challenge in this context is the combinatorial explosion of the number of set intersections if the number of sets exceeds a trivial threshold. To address this, we introduce UpSet, a novel visualization technique for the quantitative analysis of sets, their intersections, and aggregates of intersections.
UpSet is focused on creating task-driven aggregates, communicating the size and properties of aggregates and intersections, and a duality between the visualization of the elements in a dataset and their set membership. UpSet visualizes set intersections in a matrix layout and introduces aggregates based on groupings and queries. The matrix layout enables the effective representation of associated data, such as the number of elements in the aggregates and intersections, as well as additional summary statistics derived from subset or element attributes.
Sorting according to various measures enables a task-driven analysis of relevant intersections and aggregates. The elements represented in the sets and their associated attributes are visualized in a separate view. Queries based on containment in specific intersections, aggregates or driven by attribute filters are propagated between both views. UpSet also introduces several advanced visual encodings and interaction methods to overcome the problems of varying scales and to address scalability. (JavaScript)
Available Visualization stats MIT
Qadium Data Microscopes Data Microscopes is a collection of robust, validated Bayesian nonparametric models for discovering structure in data. Models for tabular, relational, text, and time-series data can accommodate multiple data types, including categorical, real-valued, binary, and spatial data. Inference and visualization of results respects the underlying uncertainty in the data, allowing domain experts to feel confident in the quality of the answers they receive. (Python, C++) Analytics stats BSD
Carnegie Mellon University (publications) Active Search ActiveSearch takes a collection of emails (or any dataset where a similarity can be generated between elements) and recommends related messages based on user feedback. The user provides an initial seed email then enters into a cycle where ActiveSearch provides a similar email and the user reports whether or not the email was interesting.ActiveSearch is useful for anyone navigating a large set of emails and looking for related messages on a specific topic. As it considers the similarities between emails as well as user feedback, it is an improvement in accuracy, time, and effort over basic text search or a brute force search.(Java, Perl) Analytics stats MIT
Boeing/Pitt Impact of Precision of Bayesian Network Parameters on Accuracy of Medical Diagnostic Systems
Boeing/Pitt An Empirical Comparison of Bayesian Network Parameter Learning Algorithms for Continuous Data Streams
Carnegie Mellon University Efficient Learning on Point Sets
Carnegie Mellon University Learning from Point Sets with Observational Bias
Carnegie Mellon University On Learning from Collective Data
Carnegie Mellon University More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server
Carnegie Mellon University A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks
Carnegie Mellon University Parallel Markov Chain Monte Carlo for Nonparametric Mixture Models
Continuum Analytics, Indiana University Preserving Data While Rendering
Continuum Analytics, Indiana University Overplotting: Unified Solutions under Abstract Rendering
Continuum Analytics, Indiana University Abstract Rendering: Out-of-core Rendering for Information Visualization
Georgia Tech / GTRI To Gather Together for a Better World: Understanding and Leveraging Communities in Micro-lending Recommendation
Georgia Tech / GTRI A Better World for All: Understanding and Promoting Micro-finance Activities in
Georgia Tech / GTRI UTOPIAN: User-Driven Topic Modeling Based on Interactive Nonnegative Matrix Factorization
Georgia Tech / GTRI Dyadic Event Attribution in Social Networks with Mixtures of Hawkes Processes
Georgia Tech / GTRI Scalable Influence Estimation in Continuous-Time Diffusion Networks
Georgia Tech / GTRI Uncover Topic-Sensitive Information Diffusion Networks
Georgia Tech / GTRI Hierarchical Clustering of Hyperspectral Images using Rank-Two Nonnegative Matrix Factorization
Georgia Tech / GTRI Fast Rank-2 Nonnegative Matrix Factorization for Hierarchical Document Clustering
Georgia Tech / GTRI Augmenting MATLAB with Semantic Objects for an Interactive Visual Environment
Georgia Tech / GTRI Mixture of Mutually Exciting Processes for Viral Diffusion
Georgia Tech / GTRI Learning Social Infectivity in Sparse Low-rank Networks Using Multi-dimensional Hawkes Processes
Georgia Tech / GTRI Learning Triggering Kernels for Multi-dimensional Hawkes Processes
IBM Research Random Projections for Support Vector Machines
IBM Research Efficient Dimensionality Reduction for Canonical Correlation Analysis
IBM Research Improved Matrix Algorithms via the Subsampled Randomized Hadamard Transform
IBM Research Near-optimal Coresets For Least-Squares Regression
IBM Research Deterministic Feature Selection for K-means Clustering
IBM Research Low-Rank Approximation and Regression in Input Sparsity Time
IBM Research Subspace Embeddings and lp-Regression Using Exponential Random Variables
IBM Research Revisiting Asynchronous Linear Solvers: Provable Convergence Rate Through Randomization
IBM Research Highly Scalable Linear Time Estimation of Spectrograms - A Tool for Very Large Scale Data Analysis
IBM Research Near-Optimal Column-Based Matrix Reconstruction
IBM Research Faster Subset Selection for Matrices and Applications
IBM Research Sketching Structured Matrices for Faster Nonlinear Regression
IBM Research Quantile Regression for Large-scale Applications
Johns Hopkins University Locality Statistics for Anomaly Detection in Time Series of Graphs
Johns Hopkins University Universally Consistent Vertex Classification for Latent Positions Graphs
Johns Hopkins University Seeded Graph Matching for Large Stochastic Block Model Graphs
Johns Hopkins University Perfect Clustering for Stochastic Blockmodel Graphs via Adjacency Spectral Embedding
Johns Hopkins University Out-of-sample Extension for Latent Position Graphs
Johns Hopkins University Generalized Canonical Correlation Analysis for Classification in High Dimensions
Johns Hopkins University Seeded Graph Matching for Correlated Erdos-Renyi Graphs
Johns Hopkins University On the Incommensurability Phenomenon
Johns Hopkins University Vertex Nomination Schemes for Membership Prediction
Johns Hopkins University Robust Vertex Classification
Johns Hopkins University Consistent Latent Position Estimation and Vertex Classification for Random Dot Product Graphs
Johns Hopkins University A Central Limit Theorem for Scaled Eigenvectors of Random Dot Product Graphs
Johns Hopkins University Statistical Inference on Errorfully Observed Graphs
Johns Hopkins University Seeded Graph Matching
Harvard University Graphlet Decomposition of a Weighted Network
Harvard University, Kitware, Inc. Entourage: Visualizing Relationships between Biological Pathways using Contextual Subsets
Harvard University, Kitware, Inc. LineUp: Visual Analysis of Multi-Attribute Rankings
MDA Information Systems, Inc., University of Southern California/Information Sciences Institute Unlocking Big Data
MDA Information Systems, Inc., University of Southern California/Information Sciences Institute Mapping Semantic Workflows to Alternative Workflow Execution Engines
MDA Information Systems, Inc., University of Southern California/Information Sciences Institute Capturing Data Analytics and Visualization Expertise with Workflows
MDA Information Systems, Inc., University of Southern California/Information Sciences Institute Time-Bound Analytic Tasks on Large Datasets through Dynamic
MDA Information Systems, Inc., University of Southern California/Information Sciences Institute Configuration of Workflows
MDA Information Systems, Inc., University of Southern California/Information Sciences Institute Large-Scale Multimedia Content Analysis Using Scientific Workflows
University of Southern California/Information Sciences Institute A Semantic Framework for Automatic Generation of Computational Workflows Using Distributed Data and Component Catalogs
University of Southern California/Information Sciences Institute Intelligent Workflow Systems and Provenance-Aware Software
University of Southern California/Information Sciences Institute Assisting Scientists with Complex Data Analysis Tasks through Semantic Workflows
University of Southern California/Information Sciences Institute Towards Workflow Ecosystems Through Semantic and Standard Representations
University of Southern California/Information Sciences Institute Structured Analysis of the ISI Atomic Pair Actions Dataset Using Workflows
University of Southern California/Information Sciences Institute Making Data Analysis Expertise Broadly Accessible through Workflows
University of Southern California/Information Sciences Institute A Framework for Efficient Data Analytics through Automatic Configuration and Customization of Scientific Workflows
MIT-LL Content + Context Networks for User Classification in Twitter
Oculus Info, Inc. Visual Thinking Design Patterns
Oculus Info, Inc. Aperture: An Open Web 2.0 Visualization Framework
Oculus Info, Inc. Tile Based Visual Analytics for Twitter Big Data Exploratory Analysis
Oculus Info, Inc. Interactive Data Exploration with 'Big Data Tukey Plots',
Oculus Info, Inc. Louvain Clustering for Big Data Graph Visual Analytics
Scientific Systems Company, Inc., MIT, University of Louisville Advanced Machine Learning and Statistical Inference Approaches for Big Data Analytics and Information Fusion
Sotera Defense Solutions, Inc. A Survey of Big Data Methods, Assessments, and Approaches
Sotera Defense Solutions, Inc. Correlation Using Pair-wise Combinations of Multiple Data Sources and Dimensions at Ultra-Large Scales
Sotera Defense Solutions, Inc. Movement Inference through Aggregate Trajectory Extraction
Stanford University - Hanrahan, Purdue University, PNNL Large-Scale Exploratory Analysis, Cleaning, and Modeling for Event Detection in Real-World Power Systems Data
Stanford University - Hanrahan, Purdue University, PNNL EDA and ML - A Perfect Pair for Large-Scale Data Analysis
Stanford University - Hanrahan, Purdue University, PNNL Power Grid Data Analysis with R and Hadoop
Stanford University - Hanrahan, Purdue University, PNNL imMens: Real-time Visual Querying of Big Data
Stanford University - Boyd Proximal Algorithms
Stanford University - Boyd A Primal-Dual Operator Splitting Method for Conic Optimization
Stanford University - Boyd Operator Splitting for Conic Optimization via Homogeneous Self-Dual Embedding
Stanford University - Boyd ECOS: An SOCP Solver for Embedded Systems
Stanford University - Boyd Code Generation for Embedded Second-Order Cone Programming
Stanford University - Hanrahan, Purdue University, PNNL Trelliscope: A System for Detailed Visualization in the Deep Analysis of Large Complex Data
Stanford University - Olukotun NIFTY: A System for Large Scale Information Flow Tracking and Clustering
Stanford University - Olukotun Composition and Reuse with Compiled Domain-Specific Languages
Stanford University - Olukotun Dimension Independent Similarity Computation
Stanford University - Olukotun On the Precision of Social and Information Networks
Stanford University - Olukotun Forge: Generating a High Performance DSL Implementation from a Declarative Specification
Stanford University - Olukotun Elementary: Large-scale Knowledge-base Construction via Machine Learning and Statistical Inference
The New School IAM - Incremental Agent-Based Mapping
The New School Expediting Cooperation in Government-funded Open Source Programs: Incremental Agent-based Mapping, a Pattern Language for Collaborative Cognition
The New School Design Methodology of the XDATA Program
The New School Data Visualization Design Guidelines
The New School Big Data and Knowledge Discovery Through Metapictorial Visualization
The New School Design and Visualization Best Practices for Big Data: Enhancing Data Discovery through Improved Usability
University of California - Berkeley Carat: Collaborative Energy Diagnosis for Mobile Devices
University of California - Berkeley Discretized Streams: Fault-Tolerant Streaming Computation at Scale
University of California - Berkeley Sparrow: Distributed, Low Latency Scheduling
University of California - Berkeley A General Bootstrap Performance Diagnostic
University of California - Berkeley MLI: An API for Distributed Machine Learning
University of California - Berkeley Leveraging Endpoint Flexibility in Data-Intensive Clusters
University of California - Berkeley Shark: SQL and Rich Analytics at Scale
University of California - Berkeley GraphX: A Resilient Distributed Graph System on Spark
University of California - Berkeley RTP: Robust Tenant Placement for Elastic In-Memory Database Clusters
University of California - Berkeley Bolt-on Causal Consistency
University of California - Berkeley BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data
University of California - Berkeley MDCC: Multi-Data Center Consistency
University of California - Berkeley The Case for Tiny Tasks in Compute Clusters
University of California - Berkeley Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
University of California - Berkeley MLbase: A Distributed Machine-learning System
University of California - Berkeley Coflow: A Networking Abstraction for Cluster Applications
Royal Caliber VertexAPI2 - A Vertex-Program API for Large Graph Computations on the GPU
Draper Laboratory Measuring the Value of Big Data Exploitation Systems: Quantitative, Non-Subjective Metrics with the User as a Key Component
USC Continuous Dataflow Update Strategies for Mission-Critical Applications
USC Exploiting Application Dynamism and Cloud Elasticity for Continuous Data Flows
USC GoFFish: A Framework for Distributed Analytics Over Timeseries Graphs
USC GoFFish: A Sub-Graph Centric Framework for Large-Scale Graph Analytics
USC Cost-efficient and Resilient Job Life-cycle Management on Hybrid Clouds
USC Constraint-Driven Adaptive Scheduling for Dynamic Dataflows on Elastic Clouds
USC Enabling Realtime Pro-Active Analytics On Time Evolving Graphs
USC Efficient Extraction of High Centrality Vertices in Distributed Graphs
USC Fast Parallel Algorithm for Unfolding of Communities in Large Graphs
USC PLAStiCC: Predictive Look-Ahead Scheduling for Continuous Dataflows on Clouds
Jet Propulsion Laboratory Ashok Meena View Presentation, AGU Fall Meeting 2013 Presentations
Stanford Simplifying Scalable Graph Processing with a Domain-Specific Language
Stanford Hardware Acceleration of Database Operations