Unique

UNIQUE stage v1.1

Pipelines v2.1

Need line numbers for example section.

Purpose, Operands, Streams, Usage, Examples, Related

Syntax

                       ┌─NOPAD────┐                          ┌─LAST─────┐

>>──UNIQue──┬───────┬──┼──────────┼──┬─────────┬──┤ Range ├──┼──────────┼────────────><

            └─COUNT─┘  └─PAD─char─┘  └─ANYCase─┘             ├─FIRST────┤

                                                             └─COLLAPSE─┘

Range:

   ┌─1-*──────────────────────────────┐

├──┼──────────────────────────────────┼───────────────────────────────────────────────┤

   ├─inputrange───────────────────────┤

   │   ┌─<────────────────────────┐   │

   └─(─┴─inputrange──┬──────────┬─┴─)─┘

                     ├─NOPAD────┤

                     └─PAD─char─┘

Purpose

Use the UNIQUE stage to select unique or duplicate records.

By default, UNIQUE reads records from its primary input stream and compares each one with the following record to determine if they are the same. When the records are the same; UNIQUE continues reading records until it reaches a record that is different. The comparison is based on the contents of the entire input record. When a contiguous set of duplicates is read, UNIQUE selects only the last record in each set and discards the others. INIQUE writes the selected records to its primary output stream. The discarded records are written to the secondary output stream, if it is connected. Two matching records that are not contiguous are not considered to be duplicates. Therefore, the input stream for the UNIQUE stage must be in sorted order for UNIQUE to determine all the unique and duplicate records in its input stream.

Optionally, you can choose to perform a non-case-sensitive record comparison and you can specify that the comparison be based on one or more key fields; a specific range of words, fields or columns.

Operands

●

COUNT

When used in conjunction with the FIRST or LAST operand, COUNT prefaces each record in the primary output stream with a 10-character field which represents the record's position in a set of duplicate records. The number is right-justified with leading spaces. Consecutive records that have the same key fields are considered a set of duplicate records. The count is 1 when a record is unique. For example, when combined with the default operand LAST; if the first three records of an input stream are duplicates, the third record is written the primary output stream prefaced by the number 3. 3 is the position of the last record in the set of duplicates; this represents the total number of duplicates in that set. When used in conjunction with the COLLAPSE operand; COUNT counts the number of records that lie between the first and last records in a set of duplicates.

●

NOPAD

specifies that shorter key fields are not extended with a pad character before they are compared with longer key fields of other records. The NOPAD operand can be specified in two positions on the UNIQUE stage:

●

specified before the inputrange operands or if inputrange is not specified, NOPAD applies to the entire record. This is the default.

●

If you specify NOPAD after inputrange, NOPAD only applies to that particular key field.

●

PAD

specifies that shorter key fields are extended with a pad character before they are compared with longer key fields of other records. The PAD operand can be specified in two positions:

●

If PAD is specified before the inputrange operands or if inputrange is not specified, PAD applies to the entire record.

●

If you specify PAD after inputrange, PAD only applies to the particular key field.

●

char

is the pad character.

●

ANYCase

specifies that key fields are compared in uppercase. In effect this means that a non-case-sensitive comparison is made.

●

inputrange

is an integer column, word or field range which defines a key field. If you do not specify inputrange, the key field is the entire record. When you specify more than one inputrange; you must enclose the set of inputrange operands within parentheses.

●

LAST

writes all unique records and the last record of each set of duplicate records to the primary output stream. All duplicate records that are not written to the primary output stream are discarded or written to the secondary output stream, if it is connected. This is the default

●

FIRST

writes all unique records and the first record of each set of duplicate records to the primary output stream. All duplicate records that are not written to the primary output stream are discarded or written to the secondary output stream, if it is connected.

●

COLLAPSE

writes all unique records and the first and last record of each set of duplicate records to the primary output stream. When COLLAPSE is used on its own; all duplicate records that are not written to the primary output stream are discarded or written to the secondary output stream, if it is connected. When COLLAPSE is used in conjunction with the COUNT operand; COLLAPSE counts the number of duplicate records which lie between the first and last record in the set of duplicates. Once the last record in the set has been determined; a single record containing the number of duplicates (excluding the first and last in the set) is written to the secondary output stream. The number is a ten-character field, right-justified with leading spaces.

Streams used

The following streams are used by the UNIQUE stage:

Stream	Action

Primary input stream	UNIQUE reads records from its primary input stream.
Primary output stream	After selecting the specified records from its primary input stream, UNIQUE writes the selected records to its primary output stream.
Secondary output stream	UNIQUE writes the unselected input records to its secondary output stream.

Usage notes

1.	UNIQUE FIRST does not delay the records. UNIQUE LAST and UNIQUE COLLAPSE delays one record.
2.	If the UNIQUE stage discovers that all of its output streams are not connected, the UNIQUE stage ends.
3.	UNIQUE waits to write a record to its output stream until it has compared it to the next record in its input stream. However, if you specify the FIRST operand, the input record does not wait to be compared before it is written to the output stream.
4.	Use the SORT stage with the UNIQUE operand instead of separate SORT and UNIQUE stages when the input stream has many duplicate records and you do not wish to process the duplicate records further.
5.	UNIQUE verifies that its secondary input stream is not connected and then begins execution.

Examples

In the following example the pipeline reads the file: exercise.txt and writes a record to the console for each unique entry/record in each contiguous set of identical input records. Note. The SPECS stage which repositions its input stream data leaving a blank between the leading count number and the remainder of the output record.

exercise.txt (input)

...|...+....1....+....2....+....3....+....4....

**** Top of file ****

1 push-up

2 sit-up

3 sit-up

4 knee-lift

5 push-up

6 push-up

7 push-up

**** End of file ****

...|...+....1....+....2....+....3....+....4....

   **** Top of file ****

 1 Address Rxpipe

 3 'pipe < exercise.txt',         /* Read input file. */

 4    '| unique count',           /* Count unique records. */

 5    '| specs 1-10 1 11-* 12',   /* Position count and record. */

 6    '| console'                 /* Display on the console. */

 8 Exit 0

   **** End of file ****

output:

1 push-up

2 sit-up

1 knee-lift

3 push-up

Note. Although the first record and the last three records contain the same data (push-up), they are considered unique because they are not adjacent to one another.

This example reads the input file: workout.txt and, using a UNIQUE stage that specifies columns 20 through 30 as the key field; UNIQUE writes records to its primary and secondary output streams:

workout.txt (input)

...|...+....1....+....2....+....3....+....4....

**** Top of file ****

1 touch toes flexibility

2 bench press strength

3 jog aerobic

4 bicycle aerobic

5 row aerobic

6 biceps curl strength

7 heel press flexibility

8 lunge flexibility

**** End of file ****

...|...+....1....+....2....+....3....+....4....

   **** Top of file ****

 1 Address Rxpipe

 3 'pipe (endchar ?)

 4     < workfile.txt',             /* Read input file. */

 5    '| u: unique 20-30 first',    /* Take first in each group.. */

 6    '| > activity.txt',           /* ..and write them out. */

 7    '?',

 8    'u:',

 9    '| > duplicate.txt'           /* Duplicates in the set are.. */

10                                  /* written out here! */

11 Exit 0

   **** End of file ****

The resulting output files: activity.txt and duplicate.txt are shown below.

activity.txt (output)

...|...+....1....+....2....+....3....+....4....

   **** Top of file ****

 1 touch toes         flexibility

 2 bench press        strength

 3 jog                aerobic

 4 biceps curl        strength

 5 heel press         flexibility

   **** End of file ****

duplicate.txt (output)

...|...+....1....+....2....+....3....+....4....

   **** Top of file ****

 1 bicycle            aerobic

 2 row                aerobic

 3 lunge              flexibility

   **** End of file ****

Collapsing duplicate record sets

This example uses the UNIQUE COUNT and COLLAPSE operands (line: ??) in order to reduce each set of identical contiguous input records which lie between the first and last record in the set of duplicates. The first and last records in the output set are interleaved with a record that details the number of collapsed input records.

Related

SORT

History

Version

Date

Action

Description

Pipelines

1.1

28.12.2021

changed

Application-wide rewrite.

2.1

1.0

06.09.2007

created

First version.

1.0