Design notes

Pipelines v2.1

 

This 32bit version of Pipelines is designed to work in conjunction with ooRexx 32bit version 4+. It will install on a platform which hosts as a minimum requirement; the Microsoft Windows .NET Framework 2.0.

 

Home

Multiple Pipelines instances may execute concurrently.

 

Pipelines dispatches the stages in the order in which they appear in the pipeline, however; any stage may be the first to begin processing records. The relative order of the records flowing through a pipeline can be predicted; as long as the stage path only comprises stages that do not delay the records.

 

Unless the pipeline comprises a stage or stages' that accumulate records; for example the SORT stage, and, that the input records are not excessively long, Pipelines requires only a small amount of memory to process input files of any size; as only a handful of records will be in the pipeline at any one time.

 

Pipelines is not pre-emptive. When a stage reports an initialisation or runtime error; Pipelines begins terminating the pipeline by instructing all active stages to quiesce. When all active stages in the pipeline chain have responded to the quiesce command and have terminated; Pipelines (the StageManager) terminates.

 

Pipelines is designed to execute on a single processor, where each stage/process vies for service by the StageManager; the specific design of a stage controls how it interoperates within a multi-stream pipeline configuration.

 

Pipelines does not verify that a pipeline is semantically correct, only that it is syntactically correct. This means that you may construct a pipeline that does not execute in the way that you expect it to. It may produce output records in a format or an order that you did not intend or it may not produce any output records at all. In view of this; when developing a pipeline that replaces the contents of a disk-file, it is particularly prudent to test the pipeline against a copy of that file. Pipelines does not issue "are you sure?" messages!

 

Pipelines does not work with records containing MBCS or UNICODE data (this will be addressed in a future version of Pipelines; although this will require a massive re-work of the application – and this will take time), only the single-byte ASCII character set is supported. As a consequence, you should ensure that only ASCII-type input files are selected for modification. Pipelines cannot determine the format of an input file; it simply executes the pipeline that you specify.

 

Pipelines comprises a stall detection mechanism that determines when a pipeline is stalled; A stall occurs when Pipelines determines that every stage is either waiting to read a record or write a record. That is, there is no stage that is currently processing a record; all stages are either read-pending or write-pending. Pipelines writes the current status of each stage in the pipeline to a dump-file which can be inspected to determine the combination of stream connections that caused the stall.

 

When a stage does not specifically limit the number of input and/or output streams; the stage may process up to 4096 input streams and the unsigned integer value _MAX_INT_ output streams.  However, a pipeline configuration which connects more than a handful of input or output streams to any one stage should be considered; as badly designed.

 

Consider the following ooRexx script which concatenates three input files:

 

   **** Top of file ****
 1 Address Rxpipe
 2 
 3 'pipe (endchar ?)',
 4      '< myfile1.txt',
 5      '| a: fanin',
 6      '| > myjoinedfiles.txt',
 7      '?',
 8      '< myfile2.txt',
 9      '| a:',
10      '?',
11      '< myfile3.txt',
12      '| a:'
13 
14 Exit 0
   **** End of file ****

 

The pipeline above is limited and not easily extensible; a better approach might be:

 
   **** Top of file ****
 1 Address Rxpipe
 2
 3 'pipe filelist file=myfile* ext=txt',
 4      '| > myjoinedfiles.txt'
 5 
 6 Exit 0
   **** End of file ****

 

This pipeline is extensible by design. The FILELIST stage will select all the files with a pattern mask of: myfile*.txt.

 

Pipelines itself is extensible; it comprises an MS VC++ stage command API library which contains all the stage initialisation parsing functions and runtime extraction routines that support the current set of builtin stage filters. The API allows you to create new stage DLL's that augment the current builtin set. The API addresses' most of the needs that a stage might reasonably require; console locking and synchronisation, multi-stream connectivity, multiple column, word and field isolation, pre-process functionality, character range expansions, input and output record availability and more. Pipelines ships with a DEBUG and RELEASE API library version.

 

The Pipelines Stage command API utilises the Microsoft Foundation Class (MFC) CString class extensively and other MFC specific classes under the covers, as and when required.

 

Pipelines supports third-party non-API WIN32 console applications/modules through the SHELLEXECUTE stage command. SHELLEXECUTE will load and service any WIN32 application; reading input records from that process' STDOUT and STDERR I/O streams; writing records to the SHELLEXECUTE stages' primary and secondary output streams, respectively.

 

Since Pipelines version 1.6; the application documentation has been available online and that involved separating the package documentation from the install package, and allowing Pipelines to be installed on a disk-drive and in a directory of choice. As the location of the input-files for the example pipelines cannot be determined prior to installation; rather than programmatically, statically setting example input-file source locations during the install process, I replaced the input-file path in each example pipeline with a 'place-holder' or 'macro'. Those new definitions allow you to save/relocate an example pipeline to another directory and (as I may introduce new versions of example pipelines which illustrate new or extended functionality; you may want to retain older example versions for future reference), an example pipeline provided by Pipelines version 1.6 and any future versions of Pipelines will always reference the currently installed input-file directory.

 

The pipeline is not interpreted; Pipelines performs a single-pass parse of the pipeline; allocating the resources required by each stage and then it begins dispatching them.

 

Pipelines is an ALLUSERS application – every profile on the machine will have access to Pipelines.

 

Pipelines supports the sub-commands: PEEKTO, READTO and OUTPUT – which provide functionality similar to their CMS Pipelines versions. They work across the IPC divide between a calling pipeline and a called pipeline (subroutine) maintaining the relative record order.

 

The IN and OUT Stage commands are designed to be dual purpose. A pipeline which utilises the IN and/or OUT Stage command - launched through the CALLPIPE stage command; will service input and output records read from and written to the calling pipeline's CALLPIPE stage. Similarly – The IN and OUT Stage commands specified in an pipeline which is connected (piped through) from and to another WIN32 process will happily service their respective STDIN and STDOUT streams.

 

Rather than limit subroutine pipelines (in the way that traditional stored-procedures do – by embedding a called routine within the calling script) a subroutine pipeline operates as an autonomous unit. For example; by specifying the CALLPIPE QUIET option - you might use a pipeline as a back-end utility in an application that searches, replaces, sorts, translates or collates data.

 

Pipelines provides a convenient and easy way to create a new ooRexx script; simply right-click anywhere on your desktop or within a folder, to access to the 'New->Pipelines file' option. Selecting this option will create a very simple skeleton ooRexx file; ext (.REX). File associations under Windows can be a troublesome, especially when you try to re-name a file by extension - using this method; you can create a new ooRexx file with the minimum of effort.

 

Pipelines comprises two distinct processing phases; the initialisation-phase and the runtime-phase. The first; involves the parsing and validation of the pipeline source, the allocation of resources needed to support the pipeline and the dispatch of the stage command DLL's. The second; involves the actual execution (servicing input and output record requests), monitoring the record throughput and the de-allocation of acquired resources. The following two paragraphs provide a brief overview, in a little more detail.

 

Initialisation-phase

 

Pipelines' first task is to determine who called the application. Pipelines can be called by the system CMD processor, from a third-party process or from another pipeline. Depending on who called Pipelines; determines whether a new console should be initialised and attached or whether the callers' console will be used for user input and message output. Next, Pipelines reads through both the builtin and user loadlib directories and builds a list of all the stage DLLs that are available - builtin stages are stages that have been shipped with Pipelines and user stages are stages that have been created by a third-party developer; utilising the Pipelines API library.

 

Once the list of available stages has been created; for each entry in the list; the StageManager loads the corresponding loadlib DLL and attaches it to an execution thread. For each stage; the StageManager creates a set of locks that name/define which stream connects from the write-end of each input stream and which stream connects to the read-end of each output stream. At this point, if the StageManager has determined that the pipeline is a multi-stream pipeline – the Stall Detection Monitor is activated; so that record input/output can be monitored. The StageManager then begins dispatching the stages. Once all of the stages have been dispatched and have begun executing  the StageManager then waits for each one to parse its stage command argument, allocate whatever resources are required and then report back to indicate their initialisation status. When all the stages have reported that they have successfully completed their initialisation-phase – the StageManager calls into each stage DLL and releases them one by one – allowing them to enter their runtime-phase. The pipeline initialisation-phase is now complete.

 

Runtime-phase

 

From this point on - the StageManager does not have any control over the longevity of, or the centralised coordination of the stages. There are no commit levels or any other deterministic control of the order in which the stages are serviced or of the lifetime of an individual stage, unless a runtime error occurs (in which case the StageManager issues a pipeline quiesce). Instead; the design provides each stage with six basic stream i/o service routines; PeekRecord(), PeekAnyRecord(), ReadRecord(), ReadAnyRecord(), WriteRecord() and ConsumeRecord(). Each one of these service routines instructs the StageManager to get and set input and output stream locking – in effect - locking and releasing stage execution as record input and output is requested. The behaviour of the pipeline is entirely dependent on the pipeline configuration and the order in which each stage makes service calls into the StageManager. In this way – developing a new stage command is very straight forward (a stage designer need not know anything about Pipelines internals – all she/or he need know is how to read and write records). The operational characteristic of each stage depends on what it has been designed to do; in general a stage might typically complete its processing when it determines that one, some or all of its input or output streams are disconnected (that is - there are no more records to read or there is no attached output stream). Pipelines is not pre-emptive and it cannot anticipate how an unrecoverable error in one stage might affect the processing of any another stage in the pipeline – so - to quiesce the pipeline; Pipelines disconnects the input and output streams for all of the stages in the pipeline and the stages are allowed to cycle to termination; as if at end-of-file.) Once all of the stages/threads in the pipeline have run to completion, be that successfully or due to a quiesce - and have reported back to the StageManager; the loadlib stage DLLs are unloaded, all allocated resources are released and the Stall Detection Monitor is shutdown (if it is a multi-stream pipeline). The Pipelines runtime-phase is then complete and the process returns to its caller..

 

Pipelines is offered freely and without evaluation caveats; you may use it as you please. If you have any comments, suggestions or requests; please contact me via the link below.

If you use Pipelines, you use it at your own risk! – I do not take any responsibility implied or otherwise; for any damage caused through its use.