XSP TECHNOLOGY    
Extended Set Processing Technology


DATA ACCESS  for  USER  PRODUCTIVITY
Record Accessing Structures vs. Set Processing Operations
D L Childs

For over four decades DBMS vendors have ignored ARPA research for improving user productivity. Fifty years of research and commercial development have produced a proven, non-proprietary mathematical foundation for data access, capable of providing users over an order of magnitude better performance than traditional physically dependent data access systems.

INTRODUCTION
In 1965 all data I/O strategies depended on prior knowledge of user applications. Many government agencies were collecting large volumes of data, but had no clue how applications might use the accumulated data. The Advanced Research Projects Agency, ARPA, reasoned that information contained in data relationships was intrinsic to the data itself and should not be diminished by a computer representation.[ARPA]

ARPA:  Data Relationships Are Independent of Data Representations.

In 1965 ARPA sponsored research to investigate the feasibility of a machine-independent data structure. Since mathematical objects maintain there relationships independent of their representations, and since set theory already provided mathematical representations of relationships, the research effort focused on how sets could represent data relationships and what set operations would be needed to support the requirements of data structures.[Why]

Record Accessing Structures rely on physical organizations of data representations.
Set Processing Operations rely only on intrinsic properties of data relationships.

- Nvdda.png - The basic distinction between record-accessing and set-processing systems is how data is delivered from storage. Record-accessing systems use application shared index structures. Set-processing systems use application independent set operations. Index structures degrade multiple application performance by sharing stored data. Set operations support simultaneous execution of applications, each having its own data. Index structures have to be constructed for all new data. Set operations are installed only once. Index structures are machine-dependent. Set operations are machine-independent.

By 1968 an ARPA developed set-theoretic data structure, STDS, for representing, managing, and accessing all data as mathematically well-defined sets, had been implemented on an IBM mainframe at the University of Michigan. STDS supported interactive ad hoc queries and served as the data access engine for a commercially available RDBMS from 1971 through 1998.[MICRO]

STDS: Set-Theoretic Data Structure
STDS was developed under ARPA as a data access alternative to using index structures. Though index structures are recognized as a superior mechanism for accessing individual records with known locations, ARPA's interests were in accessing collections of records satisfying common properties.

Since information is machine independent before being forced into a machine representation, research was initiated to separate those properties of data structures which were machine independent, the application environment, from those properties which were machine dependent, the storage environment, then to develop a data structure supporting isolation and separate control of these properties.[CONCOMP-D]

The ARPA specification was quite emphatic about separating application data representations from storage data representations, but rather vague about how data was to be exchanged between the two.

Sets & Mappings
If both the application environment and the storage environment are set-theoretically well-defined, then data embedding and data extraction functions can be defined to provide machine independent data exchange between applications and storage. Since feasibility of a machine-independent data structure had to be demonstrated, simple array structures were chosen for proof of concept.

Arrays can easily be represented as sets of n-tuples. Which is a nice property for a storage environment. Unfortunately, an array with just eight domains has over 40,000 different representations for the same data relationship. The research objective now focused on finding a unique set representation for data relationships and a mapping mechanism for coupling application data with appropriate storage data.

"What we lack in rigor, we make up for with notation." -David Maier

Need for Notation
Since a unique n-array relationship may have n! unique storage representations, a notation is needed that provides a single application representation and supports a 1-1 mapping with any possible storage representation. A slight perturbation of classical set notation provides such a solution.

Records as Sets
It is common practice to model records, [a,b,c], using tuples, a,b,c. Records impose an ordering structure. The record [a,b,c] is not the same as [b,c,a], even thought the relationship between a, b, and c may be the same.

Classical set theory does not support the concept of structure as part of set membership. The use of the tuple notation ‹a,b,c› is mostly a conceptual convince to denote an ordered sequence of items. Since the tuple notation does not represent a specific set, tuples can not be used as operands for set operations.[Skolem]

To support tuple operations, an alternate notation was chosen by assigning superscripts to elements of sets. The new notation equates tuples with specific structured sets: a,b,c ≡ {a1, b2, c3}. Classical sets are structured sets with null structure: {a,b,c} ≡ {a, b, c}.

Labeled Arrays
Being able to define records as sets was essential for modeling stored data where item ordering is relevant. But it did not provide any structure independence for modeling application data. Application data still needs knowledge of how data is structured in storage. An 8-column application array still has over 40,000 possible representations.

A simple solution was to expand the use of superscripts to include names along with numbers. Assume an 8-column array with arbitrary (but unique) names A, B, C, D, E, F, G, H. The typical element of a j-row array would be a set of the following form: {aA, bB, cC, dD, eE, fF, gG, hH} . Depending on how the system decided to physically represent the array in storage, any one of 40,320 mapping tuples could be used to relate application-array-set relationships to storage-array-set relationships.

Mapping Tuples
Mapping tuples are essentially array headers that define how application column names map to storage column numbers. For example: C,D,H,F,G,A,B,E and B,A,H,C,G,D,F,E are two system defined mappings of the same application array relationships to two different storage representations.

Another example uses mapping tuples to select a subset of columns A,B,C,H which returns a sub-array to an application with just the first three and the last column of an 8-column array.

Operations on Structured Sets
Since the initial objective was to demonstrate the feasibility of machine independent data operations using set operations, a necessary and sufficient condition was to implement all relevant Classical set operations.

The relevant operations were four Boolean and seven Relational.
        Boolean: Union, Intersection, Relative Compliment, and Symmetric Difference.
        Relational: Image, Range, Domain, Converse, Restriction, Relative Product, and Cartesian Product.
A total of 25 operations using structured sets as operands were implemented by 1968.[AFIP] These operations provided the data access engine for the University of Michigan's MICRO-STDS RDBMS. The final report of the CONCOMP Project in 1970 concluded that STDS established the feasibility and realization of a machine-independent data structure.

MICRO-STDS RDBMS
Interest in the data analysis capabilities of STDS operations resulted in development of an English language interface, MICRO, using STDS for data storage organization and access. Guided by ARPA's observation that data relationships were independent of their representation, that English was a familiar mechanism for expressing relationships, and that set membership was easily expressible in English it seemed quite natural to couple STDS with an English language user interface. Which MICRO developers did in 1970.

MICRO Architecture
Since the discovery and manipulation of data relationships from raw was the project's directive, index structures were not needed for locating individual records. Though not appreciated at the time, in retrospect the requirement for ad hoc querying of bulk data and the lack of index structures negated any need for schemas, normalization, destructive updates, file locking, or an SQL language.

Even though STDS had proven remarkably fast under research experiments, it was assumed that the lack of index structures would hinder overall performance. Since mining for unknown relationships was the primary objective, any performance at all was considered a success. To the surprise of everyone involved, performance defied expectations.

Access Acceleration
Since the MICRO-STDS architecture completely separated the English query statements from the set-theoretic organization of data, it was not obvious why extreme I/O activity should be fast. The IBM time sharing environment complicated any direct understanding since many applications were running simultaneously.

The explanation turned out to be quite simple, results of set operations could not be distinguished from previously existing sets. As these sets were generated in the course of a query execution they became used by the continuing execution, even though they had not existed at the start of the execution. These intermediate sets were generally quite small and contained highly relevant data. The overall effect was that the database of relevant data shrank rapidly as applications executed. Once this was discovered, set operation sequences were tuned for informationally dense I/O.

If MICRO were available today, it might be interesting to see how a set-processing I/O system would compare to the best available record-accessing I/O systems.

STDS & TPC-H
Though the MICRO-STDS implementation terminated in 1998 when the mainframe was abandon, STDS research and commercial usage continued. By 2001 STDS had been used in products for GM, XEROX, FAA, Delco, Arthur Anderson, and had been incorporated in a backend database machine developed by the STIS Corp. STIS was acquired by an affiliate of Hitachi in 1984. From 1985 trough today STDS has been incorporated in an interactive data access and manipulation software system, iXSP.

iXSP (interactive STDS)
iXSP is the current incarnation of STDS. It provides an interactive interface to STDS that allows ad hoc queries using set operations accessing available disk files. iXSP currently recognizes over 90 set operations and is extensible. iXSP also recognizes XML documents as sets.[SAG]

Since MICRO-STDS used an English user interface mapping to set operations to access storage and since current RDBMS use SQL translations of English queries to SELECT Statements, a performance comparison of Set Processing and Record Accessing systems can be made by using the SQL Select statement as input to iXSP.

Since the TPC-H Benchmark is the industry standard for record-accessing I/O systems, it is reasonable to assume that TPC-H results reflect the best performance of SQL Selects being supported by Index Structures. Since iXSP uses no Index Structures, one way to compare user productivity is to run the most challenging TPC-H SQL query on iXSP and compare with current RDBMSs on the same platform.

This is exactly how Information Builders Inc. evaluated set-processing I/O with record-accessing I/O. Results demonstrated that set-processing I/O provided well over an order of magnitude improvement in performance.[IBI]

CONCLUSION
The fundamental result of ARPA's research on defining and implementing a machine-independent data structure was the discovery that records could be recognized as sets. A notation was developed to manipulate records as sets. An implementation of this notation became STDS. The ability to manipulate records as sets is what distinguishes STDS implementations from traditional DBMS implementations.

STDS is a body of well-defined set operations that are non-proprietary and machine independent. The only restriction on such a body of set operations, being referenced as a STDS, is that the operations must be explicitly defined in terms of extended set notation, publicly available and verifiable under execution.

The key to STDS performance is being able to manipulate records as sets.[Ellis]

Since any logical or physical representation of data has a set identity, in terms of extended set notation, STDS implementations are ideal for supporting future applications involving: Data Access, Data Mining, Data Sharing, Data Analysis, Data Security, Data Validation, Data Integration, Data Reliability, Big Data, and Cloud Computing. .

Data That Can't Be Accessed, Can't Be Processed.
If Data Can Be Accessed, It Has A Set Identity.
If Data Has A Set Identity, It Can Be Processed By Set Operations.
If Data Can Be Processed By Set Operations, Processes Are Limited Only By Imagination.

ARPA Research
         STDS Development ARPA research on machine-independent data structure. 1965-1970
         Description of STDS: AFIP fall joint computer conference San Francisco CA, December 1968
         Extended Sets Data representations for near-optimal I/O performance.

Commercial RDBMS
         MICRO-STDS RDBMS from 1971-1998: English Queries - Set Storage
         Set I/O vs. Record I/O Performance comparison of STDS with Oracle & IBM - 2005
         RDM & SQL A missed opportunity for I/O access.

Sets Not Records?
         Why Not Sets? All computer behavior is set processing.
         Extended Set Theory A General Model For Very Large, Distributed, Backend Information Systems.
         RDM+XML XSP: An Integration Technology for Systems Development and Evolution - 2001
         SQL Rejuvenated An Open Source Opportunity

Formal Foundations
         Axioms and Models for an Extended Set Theory Tuples uniquely defined as sets. - 2011
         Functions as Set Behavior Formal modeling of systems behavior. - 2016

Data as Sets
        Twelve RDM tables R1 - R12 expressed by a single Labeled set Ri, RDM.
        A very simple XML-structure expressed as a labeled set, XML.
        Three extended relations expressed as labeled sets, Xrel1,
        Two complex extended relations expressed as labeled sets, Xrel2.

More detailed information to assist implementations and I/O optimization strategies using set operations to Manipulate Data by its Mathematical Identity can be found at Extended Set Processing.

Since STDS operations and the foundations for all extended set processing operations are already available on the web, it would seem that open source developers might get a jump on the rest of the industry by providing systems designed to improve user productivity.


References

  1. [IFIP] Feasibility of a Set-theoretic Data Structure A general structure based on a reconstituted definition of relation IFIP Congress, Edinburgh Scotland, August 1968
  2. [AFIP] Description of a Set-theoretic Data Structure: AFIPS fall joint computer conference San Francisco CA, December 1968
  3. [ARPA] Westervelt, F. H.: CONCOMP: Research in Conversational use of Computers: final report Office of Research Administration, Ann Arbor, December 1970
  4. [CONCOMP] CONCOMP Project Appendix D: Description of a Set-Theoretic Data Structure
  5. [MICRO] MICRO-STDS RDBMS 1971-1998
  6. [VLDB77] Extended Set Theory: A General Model For Very Large, Distributed, Backend Information Systems - VLDB77
  7. [LLL] Lawrence Livermore Lab.: Set Theoretic Data Structures (STDS): a tutorial Technical Report, 1977.
  8. [VLDB84] 1984 VLDB Panel: Inexpensive Large Capacity Storage Will Revolutionize The Design Of Database Management Systems Proceedings of the Tenth International Conference on Very Large Data Bases. Singapore, August, 1984
  9. [NATO] A Mathematical Foundation for Systems Development - NATO-ASI Series, Vol. F24, 1986
  10. [SAG] Champion, M.: XSP: An Integration Technology for Systems Development and Evolution - Software AG - 2001
  11. [IBIl Stout, R.: Information Access Accelerator Information Builders Inc. - 2005
  12. [WHY] Why Not Sets? - 2010
  13. [Nor10] North, K.: Three articles presenting a short historical perspective on the role of set theory, mathematically sound data models, and the importance of data independence. - 2010
    PART I: Sets, Data Models and Data Independence
    PART II: Laying the Foundation: Revolution, Math for Databases and Big Data
    PART III: Information Density, Mathematical Identity, Set Stores and Big Data
  14. [XST] Blass, A., Childs, D L: Axioms and Models for an Extended Set Theory - 2011
  15. [Skolem] Skolem, T.: Two Remarks on Set Theory (2. The ordered n-tuples as sets) MATH. SCAND, 5 (1957) 40-46
  16. [Ellis] Ellis, T.: Extended Set Theory: A Summary - 2015


Supplement

  1. Pebble Piles & Index Structures: A Parable
  2. XSP: Extended Set Processing: Mathematically Managing Data Representations (1 page summary)
  3. Set Processing vs. Record Processing Performance: Dynamic Data Restructuring vs. Prestructured Data Storage
  4. Functions As Set Behavior Essential Concepts: Conceptual & Formal
  5. SQL Rejuvenated Data, Relationships, RDM, Tables, NoSQL & Set Theory
  6. Content-Container Data Access Strategies Content for Functionality - Containers for Performance
  7. Set-Accessing I/O for SQL Relational Capabilities without SQL Overhead
  8. iXSP Information Access Accelerator Information Builders Inc. STDS performance evaluation results
  9. Data Representations as Mathematical Objects Considering Content Compatibility of Relational & XML Data Representations
  10. Adaptive Data Restructuring Functions A High Performance Alternative to Indexed Data Access Structures
  11. STDS: Glossary Obscura Exposing Key Concepts & Definitions
  12. Processes, Functions & Composition Essential Concepts


Copyright 2017   INTEGRATED INFORMATION SYSTEMS   Last modified on 10/13/2017