EBCDIC → ASCII Conversion Issues
To migrate resources from a legacy mainframe system to an open system environment using OpenFrame, you must convert the resources from EBCDIC to ASCII.
There are a number of issues that may arise during the conversion process, especially in relation to COBOL application source code. The three main issues are:
-
Hexadecimal value processing
-
Character sort order processing
-
Double byte space processing
This chapter describes the causes of these issues and provides some solutions to them.
1. Hexadecimal Processing
The following sample COBOL application source file includes logic that uses hex values. This source has been converted to ASCII by using the dsmigin tool.
01 WORK06-AREA. 05 W06-KOUZA-NO. 10 FILLER PIC X(01) VALUE X'F1'. 10 W06-NO-1 PIC X(01). 10 FILLER PIC X(01) VALUE X'F2'. 10 W06-NO-2 PIC X(01). 10 FILLER PIC X(01) VALUE X'F3'. 10 W06-NO-3 PIC X(01). 10 FILLER PIC X(01) VALUE X'F4'. 10 W06-NO-4 PIC X(01). 10 FILLER PIC X(01) VALUE X'F5'. 10 W06-NO-5 PIC X(01). 10 FILLER PIC X(01) VALUE X'F6'. 10 W06-NO-6 PIC X(01). 10 FILLER PIC X(01) VALUE X'F7'. 10 W06-NO-7 PIC X(01).
Although the previous example seems to be converted successfully from EBCDIC to ASCII, it may contain the following problems.
Let’s look at the following line from the sample source.
10 FILLER PIC X(01) VALUE X'F1'.
During the conversion process, the application logic has been unintentionally modified.
Assume that X’F1' in the original code points to the EBCDIC value '1' instead of representing the character F1. The EBCDIC value '1' cannot be processed correctly through a typical character set conversion process. Therefore, after converting the source code to ASCII, you must manually change the value to '31', which is equivalent to the ASCII value of '1' as shown in the following.
10 FILLER PIC X(01) VALUE X'31'.
The following is the result of converting the original source file to ASCII by using dsmigin and then replacing the character values with their respective decimal representation.
01 WORK06-AREA. 05 W06-KOUZA-NO. 10 FILLER PIC X(01) VALUE X '31'. 10 W06-NO-1 PIC X(01). 10 FILLER PIC X(01) VALUE X '32'. 10 W06-NO-2 PIC X(01). 10 FILLER PIC X(01) VALUE X '33'. 10 W06-NO-3 PIC X(01). 10 FILLER PIC X(01) VALUE X '34'. 10 W06-NO-4 PIC X(01). 10 FILLER PIC X(01) VALUE X '35'. 10 W06-NO-5 PIC X(01). 10 FILLER PIC X(01) VALUE X '36'. 10 W06-NO-6 PIC X(01). 10 FILLER PIC X(01) VALUE X '37'. 10 W06-NO-7 PIC X(01).
The hexadecimal processing problem is common in source code that explicitly specify hex values. However, the opposite situation may also occur.
01 YEAR-TABLE. 05 FILLER PIC X(12) VALUE '{{{{{{JJJJJJ'. 05 FILLER PIC X(12) VALUE '{{{{{{{JJJJJ'. 05 FILLER PIC X(12) VALUE '{{{{{{{{JJJJ'. 05 FILLER PIC X(12) VALUE '{{{{{{{{{JJJ'. 05 FILLER PIC X(12) VALUE '{{{{{{{{{{JJ'. 05 FILLER PIC X(12) VALUE '{{{{{{{{{{{J'. 05 FILLER PIC X(12) VALUE '{{{{{{{{{{{{'. 05 FILLER PIC X(12) VALUE 'A{{{{{{{{{{{'. 05 FILLER PIC X(12) VALUE 'AA{{{{{{{{{{'. 05 FILLER PIC X(12) VALUE 'AAA{{{{{{{{{'. 05 FILLER PIC X(12) VALUE 'AAAA{{{{{{{{'. 05 FILLER PIC X(12) VALUE 'AAAAA{{{{{{{'.
In the previous example, a similar problem occurs with the character '{'. This character is used not as a character but as the zoned decimal (ZD) value X’C0', which corresponds to the hex character '{' .
In this case, the ASCII ZD value X'30' corresponds to the mainframe ZD value X’C0', so the ASCII character that corresponds to X'30' must be modified to '0' to preserve the application logic. However, it is recommended that you manually modify the source code after it has been converted from EBCDIC to ASCII.
01 YEAR-TABLE. 05 FILLER PIC X(12) VALUE '000000qqqqqq'. 05 FILLER PIC X(12) VALUE '0000000qqqqq'. 05 FILLER PIC X(12) VALUE '00000000qqqq'. 05 FILLER PIC X(12) VALUE '000000000qqq'. 05 FILLER PIC X(12) VALUE '0000000000qq'. 05 FILLER PIC X(12) VALUE '00000000000q'. 05 FILLER PIC X(12) VALUE '000000000000'. 05 FILLER PIC X(12) VALUE '100000000000'. 05 FILLER PIC X(12) VALUE '110000000000'. 05 FILLER PIC X(12) VALUE '111000000000'. 05 FILLER PIC X(12) VALUE '111100000000'. 05 FILLER PIC X(12) VALUE '111110000000'.
As stated earlier, the problem with converting hex values is due to the uncertainty about whether a hex value actually represents a hex value or an EBCDIC character, and vice versa.
In order to correctly interpret hex values, you must use COBOL syntax for COBOL program source files and BMS macro syntax for BMS map files.
To evaluate hex values in a source file, an in-depth analysis must be performed by an experienced TmaxSoft consultant before the source code is converted from EBCDIC to ASCII. An automatic analysis tool is currently planned for development for future versions of OpenFrame. |
2. Character Sort Order Processing
After source code is converted from EBCDIC to ASCII, you may not detect any visible errors at first. However, because of the unique characteristics of each character set, problems may become evident when you compile and run the program.
This section describes the character sort order problem caused by character set conversion and provides a solution to it.
The following example is from a source code that has been converted from EBCDIC to ASCII.
IF W01-XX <= '99' THEN MOVE 'Y' TO W01-CC ELSE MOVE 'N' TO W01-CC END-IF.
Although the previous example seems to be converted successfully from EBCDIC to ASCII, the conversion process may have unintentionally modified the application logic. |
The following shows the sort orders for EBCDIC and ASCII characters.
-
EBCDIC: a < z < A < Z < 0 < 9
-
ASCII: 0 < 9 < A < Z < a < z
In the previous example, assume that the value of W01-XX is 'AA'. In this case, if the program is run on mainframe, W01-CC is set to 'Y'; however, if the same example is converted to ASCII and run on UNIX, W01-CC is set to 'N'.
To ensure that application logic is preserved, you must manually modify the previous example as follows:
IF W01-XX < 'zz' THEN MOVE 'Y' TO W01-CC ELSE MOVE 'N' TO W01-CC END-IF.
The previously mentioned character sort order processing issue can be, though not simple, addressed to a certain extent by modifying the user program. However, there is another character sort order problem that may seriously affect application end users.
Application developers generally understand that they need to account for sort order differences between EBCDIC and ASCII ('ZZ' < '99' in EBCDIC and '99' < 'ZZ' in ASCII). However, application end users are generally not aware of this difference.
The following example illustrates the sort order issue that is presented to an end-user.
[User Address List] -------------------------------------------------------------------------- ID : AAAAAAAA ID NAME ADDRESS ------------------------------------------------------------------------- AAAAAAAA KIM SEOUL BBBBBBBB LEE PUSAN CCCCCCCC PARK SEOUL HHHHHHHH AHN DAEGU LLLLLLLL CHO GWANGJU MMMMMMMM CHOI INCHEON NNNNNNNN KWAK BUPYOUNG XXXXXXXX IM SUNGNAM ZZZZZZZZ SEO GURI -------------------------------------------------------------------------- <F1> Menu <F2> Prev <F3> Next <Enter> Search
If this application is run on mainframe, "AAAAAAAA" (the smallest ID value) could be used to query the entire ID list. However, if the same application is converted to ASCII and run on Unix, querying "AAAAAAAA" would not provide the user with the entire ID list.
As another example, assume that the ID "11111111" exists. If the application is run on mainframe, pressing <F3> will display "11111111" on the next screen. But if the application is converted to ASCII and run on Unix, no IDs beyond "ZZZZZZZZ" will be displayed. End users accustomed to the mainframe environment might not realize that the ID "11111111" exists in the system.
The following example shows how to query the entire ID list in an open system environment by using "00000000" instead of "AAAAAAAA".
[User Address List] -------------------------------------------------------------------------- ID : 00000000 ID NAME ADDRESS ------------------------------------------------------------------------- 11111111 NOH SEOUL 88888888 KANG DAEJEON AAAAAAAA KIM SEOUL BBBBBBBB LEE PUSAN CCCCCCCC PARK SEOUL HHHHHHHH AHN DAEGU LLLLLLLL CHO GWANGJU MMMMMMMM CHOI INCHEON NNNNNNNN KWAK BUPYOUNG -------------------------------------------------------------------------- <F1> Menu <F2> Prev <F3> Next <Enter> Search
The character sort order issue is most easily identified through a professional analysis of each user application. This must be performed by an experienced TmaxSoft consultant. Tools will be provided in future versions of OpenFrame to automatically perform this analysis.
3. 2-byte Space Processing
In a mainframe environment, 2-byte space is X'4040', which is recognized as two 1-byte space, X'40', values.
10 W-K-00. 20 W-K-00-1 PIC X(01). 20 W-K-00-2 PIC X(01). 10 W-K-01 REDEFINES W-K-00. 20 W-K-01-1 PIC G(01). * . . . MOVE SPACE TO W-K-01-1. IF W-K-00-1 = SPACE THEN DISPLAY 'DOUBLE BYTE SPACE = SINGLE BYTE SPACE * 2' END-IF.
If this COBOL program is executed on mainframe, the message "DOUBLE BYTE SPACE = SINGLE BYTE SPACE * 2" is displayed. If, however, the same application is converted to ASCII and then executed on Unix, no message is displayed. This demonstrates that when an application that uses 2-byte space is converted for an open system environment, the application logic can be modified unintentionally.
To use double-byte Korean characters in an open environment, they are converted to EUC-KR. However, EUC-KR character set uses X'8140' for 2-byte space and X'20' for 1-byte space. This means that the formula "double byte space = single byte space + single byte space" is not applicable for the EUC-KR character set.
This problem may or may not be solvable depending on the functions provided by the compiler.
OpenFrame attempts to resolve this problem by avoiding 2-byte spaces wherever possible. If necessary, OpenFrame uses two 1-byte spaces to replace 2-byte spaces. Sometimes 2-byte spaces can be ignored, easily resolving the problem.
You can use the OpenFrame CPM utility in situations where you must use 2-byte spaces, such as when OpenFrame data must be transferred to a mainframe environment or to a TN3270 terminal emulator where 1-byte and 2-byte characters cannot be intermixed.