The most basic and indivisible unit of the COBOL language is the character. The basic character set includes the letters of the Latin alphabet, digits, and special characters. In the COBOL language, individual characters are joined to form character-strings and separators. Character-strings and separators, then, are used to form the words, literals, phrases, clauses, statements, and sentences that form the language.
The basic characters used in forming character-strings and separators in source code are shown in Basic COBOL character set (Table 1).
For certain language elements, the basic character set is extended with the ASCII Double-Byte Character Set (DBCS).
DBCS characters occupy 2 adjacent bytes to represent one character. DBCS characters are also called multibyte characters. A character-string that contains DBCS characters in source code is a multibyte character-string.
Multibyte characters can be used in forming user-defined words.
The content of alphanumeric literals, comment lines, and comment entries can include any of the characters in the computer's compile-time character set, and can include both single-byte and multibyte characters.
Runtime data can include any characters from the runtime character set of the computer. The runtime character set of the computer can include alphanumeric characters, multibyte characters, and national characters. National characters are represented in UTF-16, a 16-bit encoding form of Unicode.
When the NSYMBOL (NATIONAL) compiler option is in effect, literals identified by the opening delimiter N“ or N' are national literals and can contain any single-byte or multibyte characters, or both, that are valid for the compile-time code page. Characters contained in national literals are represented as national characters at run time.
For details, see User-defined words with multibyte characters, DBCS literals, and National literals.
| Character | Meaning |
|---|---|
| Space | |
| + | Plus sign |
| - | Minus sign or hyphen |
| * | Asterisk |
| / | Forward slash or solidus |
| = | Equal sign |
| $ | Currency sign |
| , | Comma |
| ; | Semicolon |
| . | Decimal point or period |
| ” | Quotation mark |
| ( | Left parenthesis |
| ) | Right parenthesis |
| > | Greater than |
| < | Less than |
| : | Colon |
| ' | Apostrophe |
| A - Z | Alphabet (uppercase) |
| a - z | Alphabet (lowercase) |
| 0 - 9 | Numeric characters |
A character set is a set of letters, numbers, special characters, and other elements used to represent information. A character set is independent of a coded representation. A coded character set is the coded representation of a set of characters, where each character is assigned a numerical position, called a code point, in the encoding scheme. The basic COBOL character set is an example of a character set that is independent of a coded representation. ASCII and EBCDIC are examples of types of coded character sets. Each variation of ASCII or EBCDIC is a specific coded character set.
The term code page refers to a coded character set. Each code page that IBM defines is identified by a code page name, for example IBM-1252, and a coded character set identifier (CCSID), for example 1252.
The compile-time code page must be an ASCII single-byte or ASCII double-byte code page. The specific code page is indicated by the compile-time locale.
The source program (including user-defined words and the content of alphanumeric, DBCS, and national literals) is encoded in the code page indicated by the locale in effect at compile time.
The code page used at run time is determined by a combination of a data item's USAGE clause, the compiler options in effect, and the locale (or environment variable value) in effect.
When the CHAR(NATIVE) compiler option is in effect, data items described with USAGE DISPLAY or USAGE DISPLAY-1 are encoded in an ASCII code page as indicated by the runtime locale.
When the CHAR(EBCDIC) compiler option is in effect, data items described with USAGE DISPLAY or USAGE DISPLAY-1 are encoded in an EBCDIC code page, except when the NATIVE phrase is specified in the item's USAGE clause. If the NATIVE phrase is specified, the code page used is the ASCII code page indicated by the runtime locale.
For EBCDIC, the code page is determined from the EBCDIC_CODEPAGE environment variable, if set. If the EBCDIC_CODEPAGE environment variable is not set, the default EBCDIC code page associated with the current runtime locale is used. The default EBCDIC code page associated with each supported locale is identified in Locales and code pages supported in the COBOL for Windows Programming Guide.
The code page for data items described with USAGE NATIONAL and national literals is UTF-16LE (little endian), CCSID 1202. The source text representation of national literals is converted at run time from the compile-time code page to UTF-16LE.
A reference to UTF-16 in this document is a reference to UTF-16LE.