Getting Started with SGML
Home ] Up ]

 

Home    Top

 

Getting Started with SGML

A Guide to the Standard Generalized Markup Language
 and Its Role in Information Management.


As part of the move toward integrated information management, whole industries are implementing standards for exchanging information. Companies that keep up-to-date with these standards will be able to do business more efficiently and better compete in global markets. The Standard Generalized Markup Language (SGML) is one such standard that works as part of an overall information management strategy.

The business challenge

With the ever-changing and growing global market, companies and large organizations are searching for ways to become more viable and competitive. Downsizing and other cost-cutting measures demand more efficient use of corporate resources. One very important resource is an organization's information. An organization's success can depend on how effectively it identifies, manages, and uses its information.

 The most forward-thinking businesses are analyzing their information requirements and looking for long-term solutions to the problem of managing that information efficiently. This concept of information management takes an enterprise-wide approach to streamlining the information flow in order to bring products to the market more quickly and support and maintain them in a cost- effective way.

 Unleashing the power of information

 Traditional documents, and the methods for handling them, have many inherent limitations. The printed document is often the by-product of a sophisticated information process. Cut off from its information source, the printed document represents a dead-end in the information flow because the data has no link to the electronic information base. The raw data may start in the form of technical specifications or engineering data. This information must be gathered, sorted, organized, and then manually assembled into hard copy documents. With each step in the documentation process, the information may have changed. The problem can become so large that the majority of documents go out of date as soon as they are printed.

The further removed the process is from the original source of information, the greater the risk of erroneous data.

A systems approach to information management treats data as part of an organization's electronic information base. This method gives all systems the necessary access to the information. By taking a broad view of the information creation and delivery process, you can see documents as any kind of information--the output from a database query, a printed document, an online diagnostic manual, an illustrated parts catalog, or a read-only hypertext document. An integrated information management strategy seeks to coordinate all sources of information within an organization.

SGML allows you to manage information as data objects instead of as characters on a page. Rather than a stream of indistinguishable bits and bytes, the data is broken into discrete objects of information that carry intelligence about its meaning within the overall system. This technology enables you to store and reuse the information efficiently, share it with many users, and maintain it in a database.

Getting to know SGML

This white paper gives you an introduction to the existing SGML technology, its advantages and benefits, as well as an overview of some related standards and how they fit into an overall approach to managing information. We also define some of the terminology and acronyms to familiarize you with the language associated with SGML.

While SGML is a fairly recent technology, the use of markup in computer- generated documents has existed for a while. Let's first look at some earlier markup schemes that led to SGML.

 What is markup?

 Markup is everything in a document that is not content. The traditional meaning of markup is the manual marking up of typewritten text to give instructions for a typesetter or compositor about how to fit the text on a page and what typefaces to use. This kind of markup is known as procedural markup.

 Procedural markup

Most electronic publishing systems today use some form of procedural markup. Procedural markup is typically dedicated to a particular formatting system or word processing software. Each system has its own set of markup codes that make sense only to a specific typesetting system or software program running on a particular machine. Often this markup takes the form of formatting codes that are mixed in with or embedded in the text of the document. Procedural markup codes are good for one presentation of the information, such as a specific printed page format.

 Generic markup

Generic markup (also known as descriptive markup) describes the purpose of the text in a document, rather than its physical appearance on the page. A basic concept of generic markup is that the content of a document must be separate from the style. Generic markup identifies whole elements within a structure-- such as a chapter, a section, or a table of contents--using codes describing what the element is, but not how it will appear. Generic markup codes allow for multiple presentations of the information.

Drawbacks of procedural markup

Industries involved in large-scale documentation increasingly prefer generic over procedural markup schemes. Procedural markup is often tedious and error- prone. If style guidelines change, or if you need to present the same information in a different format, massive re-keying is required. When a company changes software or hardware systems, enormous data translation tasks arise, often resulting in errors. Because procedural markup is tied to one final printed product, you cannot change formats easily. Interchanging documents based on procedural markup works easily only if both parties have the same system.

What is SGML?

 The Standard Generalized Markup Language, or SGML, is an international standard (ISO 8879) published in 1986. SGML takes generic markup even further and specifies a method for setting up document hierarchy models where every element in a document fits into a logical, predictable structure. SGML defines a strict markup scheme with a syntax for defining document data elements and an overall framework for marking up documents.

 SGML is not a fixed set of document markup tags or types of documents. You can use SGML to design documents for each of your specific information needs. It allows for any set of tags and rules for virtually any type of document.

 SGML can describe and create documents that are not dependent on any hardware, software, formatter, or operating system. Since SGML documents conform to an international standard, they are portable. You can exchange them seamlessly with users who have different systems.

The world of photography provides an interesting analogy illustrating the power of standards. Today you can purchase a roll of film marked "ISO 100." You put the film in your camera, set the camera's film speed setting to 100, and you're ready to shoot. You don't have to worry that the brand of film is not compatible with your particular make of camera. The film and camera manufacturing industries--through the International Organization for Standardization (ISO) and American Standards Association (ASA)--have agreed on standards for film speeds. Many industries plan to use SGML so that documents are as easy to use on different computers as film is to use in different cameras.

 How does SGML work?

You can break a typical document into three layers: structure, content, and style. SGML works by separating these three aspects and deals mainly with the relationship between structure and content.

 Structure

At the heart of an application is a file called the DTD, or Document Type Definition. The DTD sets up the structure of a document, much like a database schema describes the types of information it handles. A DTD provides a framework for the types of elements (such as chapters and chapter headings, sections, and topics) that constitute a document.

A database schema also defines the relationships between the various types of data. Similarly, a DTD specifies rules such as "A chapter heading tag must be the first element after a chapter tag." These context rules defined in the DTD help ensure documents have a consistent, logical structure. A DTD should accompany a document wherever it goes. A document instance is a document whose content has been tagged in conformance with a particular DTD.

Content

Content is the information itself. The DTD provides the framework for organizing the information. The method for identifying the information and its meaning within this framework is called tagging. Creating an SGML document involves adding text and inserting tags around the text. These descriptive tags mark the beginning and end of each structure. An SGML application specifies rules for how to do this, such as every element must have a start tag and an end tag. For example:

 

<par>Content is the information itself.</par>

You can nest elements within other elements, like this:

 <topic><par>Contentis the information itself.</par></topic>Browser Fixed

This kind of tag nesting determines the organization of the document.

 Human beings cannot be expected to interpret a DTD when creating documents. The DTD is written in a rigorously defined language and is meant to be processed by a computer. The computer uses a software program called a parser to examine the document and the DTD together to verify that the document follows the rules of the DTD. You also need a parser to develop and modify DTDs--to verify that they are structurally correct and follow the rules of the standard.

Style

Printed output is just one display option for an SGML document. SGML does not standardize style or other processing methods for information stored in SGML. This issue is addressed by the international standard called Document Style Semantics and Specification Language (DSSSL) which was approved in early 1995 for release in final form the second half of 1995.

Before this international standard was developed, the U.S. Department of Defense CALS initiative developed a standard Output Specification (OS). The OS is a special DTD for creating a Formatting Output Specification Instance, or FOSI. The FOSI itself is a separate file, expressed in SGML. The FOSI is like a large style sheet which specifies the formatting characteristics for each tag (in each of its contexts) in the DTD. With the FOSI, the document, and the DTD, you have a complete interchange package for printed documents. The complete DSSSL standard covers a large scope, but subsets are being developed to handle varying levels of functionality. A subset whose functionality is approximately equivalent to FOSIs is expected, and work on tools to map FOSIs to and from DSSSL subsets is underway. It is expected that both DSSSL and FOSIs will be important standards for the foreseeable future.

 What does SGML give you?

In the life cycle of a product, the cost of gathering, producing, and maintaining the necessary technical information can exceed the initial hardware and equipment cost. For many industries, technical information is part of a deliverable product, or a product in itself that must be rigorously maintained. Any industry whose product line is heavily dependent on information can benefit from SGML.

SGML is most useful as a tool in an integrated information management strategy. Making such a strategic choice and planning the implementation should be decided by a company's high level management. There will be initial implementation costs in moving to SGML. But the payback comes from benefits which accrue over time and enhance your information investment. Any organization that exchanges information between systems, applications, departments, and companies will realize these benefits.

 Increased productivity

A structured approach to documents helps writers organize information within a meaningful hierarchy. With SGML, you keep document content separate from style. This separation enables you to set up centrally-controlled style guidelines, so authors can focus on actual document content, rather than on document style. You can improve productivity by keeping only one copy of information that's used by many so that authors don't re-create the same information. With SGML, you also save time by eliminating the endless data translation cycles involved when a hardware or software system becomes obsolete.

 Reusability

SGML gives many applications access to the same set of information. A printed document is just one of many possible output applications of SGML-based information. For example, a publications group can use tags to identify a chunk of data with a set of procedures as a task module. In this case, you identify the beginning and end of the procedure, and each step in the procedure. The task can now appear in several forms: maintenance and operational manuals, online manuals, training guides, etc. More importantly, the SGML identification of the task is machine readable. This feature enables a computer to manage and maintain the different uses of the task in a single place.

 Information longevity

Since information in SGML is system independent, it is not locked into your current hardware or software environment. Once you define documents, the information will always be viable. The information carries with it everything needed to create a document. So even if your hardware or software system becomes obsolete, your information won't.

 Improved data integrity

Document structures help ensure that the right information is in the right place, bringing more organization to the overall data pool. Because SGML eliminates data translation, you reduce the risk of losing information through filtering data from one format to another.

Better data control

With SGML, you can define and manipulate information elements at any level of detail. Tagged elements can have attributes that provide characteristics or properties about the element. This attribute information is not intended for printing but can control and manage the data elements. An ID (identifier) attribute can uniquely identify a single paragraph, a whole section, a legal notice, an illustration, a task, or any element, as seen in this example:

 <para id=431>Content is theinformation itself.</para>

Because IDs are machine readable, they can link related information and be used for a variety of information management controls. These controls can help you to:

  • manage the security of information by allowing only certain people to view or change information.
  • automate the information flow--for example, updating the data in one place can trigger the update of the same information in other applications.

 Shareability

Because SGML works with structured document components, you can build entire documents out of data components from various parts of the organization. This feature enables users to share the latest information without duplicating it. An example of this might be a standard legal notice or copyright statement appearing in documents throughout a company. The legal department maintains this module of information, updating it on occasion. A single tag in a document can pull in the current notice and you can print it on demand in any number of publications, eliminating needless duplication of information.

Portability of information

Today, information networks proliferate where different computers, operating systems, and applications must share information. In these sort of networks, portability becomes the key in making sure all who need it can access the information. Because SGML is hardware and software independent, you can exchange documents easily among different and networked environments.

 Flexibility beyond traditional publishing

The information you create today may be used a year from now in ways you haven't yet anticipated. SGML permits you and others to use that information for applications that extend beyond traditional publishing. For example:

  • active documents
  • information databases
  • diagnostic/expert systems
  • electronic mail
  • hypermedia and hypertext documents
  • database publishing
  • CD-ROM publishing
  • Internet publishing
  • Interactive Electronic Technical Manuals (IETMs)
  • electronic review

 Because SGML tags clearly identify the meaning of information, you can use whole classes of information objects in future applications. As long as these objects contain the most recent data, the changes continue to flow back to the information base.

 Ability to participate in global markets

The International Standards Organization, a large body of more than 90 member countries, has been adopted by the International Standards Organization, using it will become a key benchmark of large businesses that are ready to do business internationally.

 Is SGML right for you?

In evaluating how SGML can help your organization, you may wish to consider some strategic business issues to help in your information management plan. A strategic approach should prompt you to examine your current information needs and your current document management methodology. Some questions to consider include:

  • Does your information require a long lifespan? (For example, technical information related to airplanes often needs to be maintained for over 20 years.)
  • Is it costing you too much to keep information up-to-date?
  • Do you need to exchange documents across mixed hardware environments?
  • Do you need to produce large documents with a recognizable structure?
  • Do your documents contain information common to other documents within a department, across corporate divisions, or even across separate organizations?
  • Do you ever have information that's used for different purposes? (For example, a part number may appear in a maintenance manual as well as a parts inventory database.)
  • Does your information change frequently and get used often?
  • Do you produce information that needs to comply to industry or company guidelines

 By examining your requirements, you can evaluate how SGML fits into your information management strategy. Standardizing on SGML doesn't mean you need to use it for all documents. SGML is most useful for documents with a definable structure. Since SGML handles documents as collections of distinguishable data elements, it is useful to think in terms of modules of information, rather than complete printed documents.

 Who uses SGML now?

A number of organizations representing whole industries have recognized the benefits SGML offers and have adopted it for information management. These groups include:

 AAP - The American Association of Publishers developed The American National Standard for Electronic Manuscript Preparation and Markup, a general purpose book DTD for publishers, authors and editors.

 ATA - The Air Transport Association, a consortium representing the commercial airline industry, developed several DTDs under the ATA-100 specification. The ATA's European counterpart, AECMA, is also adopting standards based on SGML.

 Davenport - The DocBook DTD was developed for computer software user manuals and programming references. DocBook maintenance is performed under the aegis of the Davenport Group, a discussion forum sponsored by indivi-duals representing large-scale producers and consumers of software documentation.

 DoD - The U.S. Department of Defense created the Computer-aided Acquisition and Logistic Support (CALS) initiative. Through CALS, the government hopes to reduce information costs for design through maintenance of military equipment.

Pinnacles - The Pinnacles Initiative is an effort to define an information interchange standard that will enable electronic components manufacturers to create Electronic Data Books that include all of the data that a company wishes to provide to facilitate the design and support of a component. This information may include not only the information currently provided in print publications but also such computer-sensible data types as CAD files, behavioral and functional models, audio, and video.

 SAE - The Society of Automotive Engineers developed the SAE J2008 DTD for electronic interchange of diagnostic and repair information.

 TCIF - The Telecommunications Industry Forum is an international association of carriers and major vendors of telecommunications products and services. The TCIF initiative is focused on the re-use of technical information across multiple applications and different environments.

 In Europe, SGML is gaining wide acceptance. The European Airbus, a consortium of companies in the commercial airline industry in Europe, adopted SGML. Telecommunications, aerospace, manufacturing, and other commercial and military interests throughout Europe are also using SGML.

 Glossary

ASCII (American Standard Code for Information Interchange)This standard character encoding scheme is used extensively in data transmission. ANSI (American National Standards Institute)This group is the U.S. member organization that belongs to the ISO, the International Organization for Standardization. attribute An attribute provides more information about an element such as classification level, unique reference identifiers, or formatting information. CCITT Group 4 (International Consultative Committee on Telegraphy and Telephony)This CALS standard for raster graphics incorporates tiling, which divides a large image into smaller tiles. You can exchange graphic files in CCITT/4 format in a compressed state so they take up much less file space. CITIS (Contractor Integrated Technical Information Service) As part of CALS Phase II, CITIS is a draft functional specification for services. DoD acquisition managers designed CITIS as a plan to gain access to product-related digital technical information. CGM (Computer Graphics Metafile) CGM is one of the CALS standard formats for representing 2-D technical illustrations. CGM is an object-oriented graphic format.DSSSL (Document Style Semantics and Specification Language) This draft international standard (DIS 10179) applies to the specification of processing information for SGML documents. DSSSL became an international standard in 1995. DTD (Document Type Definition) A DTD is the formal definition of the elements, structures, and rules for marking up a given type of SGML document. You can store a DTD at the beginning of the document or externally in a separate file. EDI (Electronic Data Interchange) This is a set of computer interchange standards for business documents such as invoices, bills, and purchase orders. element An element is a piece of data within a document that may contain either text or other subelements such as a paragraph, a chapter, and so on. element declaration A statement in the DTD defining an element and declaring the order in which it may appear in the document and what other elements it may include.entity An entity is a self-contained piece of data that can be referenced as a unit. You can refer to an entity by a symbolic name in the DTD or the document. An entity can be a string of characters, a character that cannot be entered on a keyboard (such as a special symbol), a separate text file, or a separate graphic file.entity declaration A statement in the DTD or document that assigns an SGML name to an entity so you can reference it. FOSI (Formatting Output Specification Instance) A FOSI is used for formatting SGML documents. It is a separate file that contains formatting information for each element in a document. IGES (Initial Graphics Exchange Specification) The IGES standard for engineering, product design, and manufacturing drawings is one of the CALS standard graphics formats.ISO (International Organization for Standardization) The ISO is an industry- supported organization that establishes world-wide standards for everything from data interchange formats to film speed specifications. markup Markup is anything added to the content of the document that describes the text. parser A parser is a specialized software program that recognizes SGML markup in a document. A parser that reads a DTD and checks and reports on markup errors is a validating SGML parser. A parser can be built into an SGML editor to prevent incorrect tagging and to check whether a document contains all the required elements. PDES/STEP (Product Data Exchange Standard/Standard for the Exchange of Product Model Data). PDES/STEP are standards under development for communicating a complete product model with sufficient information content that advanced CAD/CAM applications can interpret. PDES is under development as a national standard, and STEP is under development as its international counterpart.

 Getting Started With SGML was written by ArborText, Inc., a member of OASIS.

An international consortium, OASIS is dedicated to accelerating the widespread adoption of ISO 8879, the Standard Generalized Markup Language. Members include vendors providing a broad range of SGML software and services, augmented by an advisory board of industry leaders and analysts and liaison relationships with customer user groups.

 Copyright � 1995 by ArborText, Inc. All rights reserved.

 

OASIS - Main / Library / whitepapers / getstart