Trisha Shetty (Editor)

Dataflow programming

Updated on
Edit
Like
Comment
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit

In computer programming, dataflow programming is a programming paradigm that models a program as a directed graph of the data flowing between operations, thus implementing dataflow principles and architecture. Dataflow programming languages share some features of functional languages, and were generally developed in order to bring some functional concepts to a language more suitable for numeric processing. Some authors use the term Datastream instead of Dataflow to avoid confusion with Dataflow Computing or Dataflow architecture, based on an indeterministic machine paradigm. Dataflow programming was pioneered by Jack Dennis and his graduate students at MIT in the 1960s.

Contents

Properties of dataflow programming languages

Traditionally, a program is modelled as a series of operations happening in a specific order; this may be referred to as sequential, procedural, Control flow (indicating that the program chooses a specific path), or imperative programming. The program focuses on commands, in line with the von Neumann vision of sequential programming, where data is normally "at rest"

In contrast, dataflow programming emphasizes the movement of data and models programs as a series of connections. Explicitly defined inputs and outputs connect operations, which function like black boxes. An operation runs as soon as all of its inputs become valid. Thus, dataflow languages are inherently parallel and can work well in large, decentralized systems.

State

One of the key concepts in computer programming is the idea of state, essentially a snapshot of various conditions in the system. Most programming languages require a considerable amount of state information, which is generally hidden from the programmer. Often, the computer itself has no idea which piece of information encodes the enduring state. This is a serious problem, as the state information needs to be shared across multiple processors in parallel processing machines. Most languages force the programmer to add extra code to indicate which data and parts of the code are important to the state. This code tends to be both expensive in terms of performance, as well as difficult to read or debug. Explicit parallelism is one of the main reasons for the poor performance of Enterprise Java Beans when building data-intensive, non-OLTP applications.

Where a linear program can be imagined as a single worker moving between tasks (operations), a dataflow program is more like a series of workers on an assembly line, each doing a specific task whenever materials are available. Since the operations are only concerned with the availability of data inputs, they have no hidden state to track, and are all "ready" at the same time.

Representation

Dataflow programs are represented in different ways. A traditional program is usually represented as a series of text instructions, which is reasonable for describing a serial system which pipes data between small, single-purpose tools that receive, process, and return. Dataflow programs start with an input, perhaps the command line parameters, and illustrate how that data is used and modified. The flow of data is explicit, often visually illustrated as a line or pipe.

In terms of encoding, a dataflow program might be implemented as a hash table, with uniquely identified inputs as the keys, used to look up pointers to the instructions. When any operation completes, the program scans down the list of operations until it finds the first operation where all inputs are currently valid, and runs it. When that operation finishes, it will typically output data, thereby making another operation become valid.

For parallel operation, only the list needs to be shared; it is the state of the entire program. Thus the task of maintaining state is removed from the programmer and given to the language's runtime. On machines with a single processor core where an implementation designed for parallel operation would simply introduce overhead, this overhead can be removed completely by using a different runtime.

History

A pioneer dataflow language was BLODI (BLOck DIagram), developed by John Larry Kelly, Jr., Carol Lochbaum and Victor A. Vyssotsky for specifying sampled data systems. A BLODI specification of functional units (amplifiers, adders, delay lines, etc.) and their interconnections was compiled into a single loop that updated the entire system for one clock tick.

More conventional dataflow languages were originally developed in order to make parallel programming easier. In Bert Sutherland's 1966 Ph.D. thesis, The On-line Graphical Specification of Computer Procedures, Sutherland created one of the first graphical dataflow programming frameworks. Subsequent dataflow languages were often developed at the large supercomputer labs. One of the most popular was SISAL, developed at Lawrence Livermore National Laboratory. SISAL looks like most statement-driven languages, but variables should be assigned once. This allows the compiler to easily identify the inputs and outputs. A number of offshoots of SISAL have been developed, including SAC, Single Assignment C, which tries to remain as close to the popular C programming language as possible.

The United States Navy funded development of ACOS and SPGN (signal processing graph notation) starting in early 1980's. This is in use on a number of platforms in the field today.

A more radical concept is Prograph, in which programs are constructed as graphs onscreen, and variables are replaced entirely with lines linking inputs to outputs. Incidentally, Prograph was originally written on the Macintosh, which remained single-processor until the introduction of the DayStar Genesis MP in 1996.

There are many hardware architectures oriented toward the efficient implementation of dataflow programming models. MIT's tagged token dataflow architecture was designed by Greg Papadopoulos.

Data flow has been proposed as an abstraction for specifying the global behavior of distributed system components: in the live distributed objects programming model, distributed data flows are used to store and communicate state, and as such, they play the role analogous to variables, fields, and parameters in Java-like programming languages.

Languages

  • Agilent VEE
  • Alteryx
  • ANI
  • DARTS
  • Swift
  • ASCET
  • AviSynth scripting language, for video processing
  • BLODI
  • BMDFM Binary Modular Dataflow Machine
  • CAL
  • SPGN - Signal Processing Graph Notation
  • COStream
  • Cassandra-vision - A Visual programming language with OpenCV support and C++ extension API
  • Сorezoid - cloud OS with visual development environment for business process management within any company
  • Cuneiform, a functional workflow language.
  • Curin
  • CMS Pipelines
  • Hume
  • Joule (programming language)
  • Juttle
  • KNIME
  • Korduene
  • LabVIEW, G
  • Linda
  • Lucid
  • Lustre
  • M, used as the backend of Microsoft Excel's ETL plugin Power Query.
  • MaxCompiler - Designed by Maxeler Technologies and is compatible with OpenSPL
  • Max/MSP
  • Microsoft Visual Programming Language - A component of Microsoft Robotics Studio designed for Robotics programming
  • Nextflow - A data-driven toolkit for computational pipelines based on the dataflow programming model
  • Node-RED - used for the design of Javascript functions for wiring together APIs, hardware and online services as part of the Internet of Things
  • OpenWire - adds visual dataflow programming capabilities to Delphi via VCL or FireMonkey components and a graphical editor (homonymous binary protocol is unrelated)
  • OpenWire Studio - A visual development environment which allows the development of software prototypes by non developers.
  • Orange - An open-source, visual programming tool for data mining, statistical data analysis, and machine learning.
  • Oz now also distributed since 1.4.0
  • Pifagor
  • Pipeline Pilot
  • POGOL
  • Prograph
  • Pure Data
  • Pythonect
  • Quartz Composer - Designed by Apple; used for graphic animations and effects
  • RaftLib - Massive multi-threading with C++ IOStream-like Operators
  • SAC Single Assignment C
  • Scala (with a library SynapseGrid)
  • SIGNAL (a dataflow-oriented synchronous language enabling multi-clock specifications)
  • Simulink
  • SIMPL
  • SISAL
  • StreamIt - A programming language and a compilation infrastructure, specifically engineered for modern streaming systems.
  • stromx - A visual programming environment focused on industrial vision (open source)
  • SPL - Data Flow description language of IBM InfoSphere Streams streaming engine
  • SWARM
  • SystemVerilog - A hardware description language
  • Tersus - Visual programming platform (open source)
  • Verilog - A hardware description language absorbed into the SystemVerilog standard in 2009
  • VHDL - A hardware description language
  • Vignette's VBIS language for business processes integration
  • vvvv
  • VSXu
  • Widget Workshop, a "game" designed for children which is technically a simplified dataflow programming language.
  • XEE (Starlight) XML Engineering Environment
  • XProc
  • AnandamideScript, flexible data-flow scripting language for C++ and QT
  • Application programming interfaces

  • Apache Airflow: Library for Python aimed at coordinating large computational workflows.
  • Apache Flink: Java/Scala library that allows streaming (and batch) computations to be run atop a distributed Hadoop (or other) cluster
  • DC: Library that allows the embedding of one-way dataflow constraints in a C/C++ program.
  • SystemC: Library for C++, mainly aimed at hardware design.
  • TensorFlow: A machine-learning library based on dataflow programming.
  • MDF: Library for python aimed at financial models
  • References

    Dataflow programming Wikipedia