Skip to content

Java Charset that encodes only ACH-safe characters, which are a subset of US-ASCII

License

Notifications You must be signed in to change notification settings

bhanafee/ACHCharset

Repository files navigation

ACH-safe Charset

ACH files are allowed to use only a subset of US-ASCII. This character set ensures that no disallowed characters are encoded or decoded. The allowed subset consists of "ASCII values greater than hexadecimal 0x1F." US-ASCII itself is limited to 7-bits, resulting in a range from 0x20 through 0x7F.

Newline handling

Although the ACH specification does not allow values below 0x20 or above 0x7F, there are some exceptions implemented by this ACH Charset:

  • 0x0A is a linefeed. While below the 0x1F limit, it is often used as a record separator. It is allowed for compatibility with common implementations.
  • 0x0D is a carriage return. It is sometimes found in conjunction with a linefeed in files generated by Windows and related operating systems. Both the encoder and decoder silently skip carriage returns, and the encoder's canEncode((char) 0x000D) method returns false.

The above rules cause CRLF to be encoded and decoded as LF on all platforms. In addition:

  • 0x7F is an unprintable control character called DEL. It is not allowed.
  • 0x85 is encoded as a newline character. In Unicode, it is equivalent to the EBCDIC NL character used by mainframe systems as a line delimiter. For compatibility, this character set only encodes newline as a linefeed. Encoding is safe because the character is definitely a Unicode newline. Decoding a 0x85 byte would not be safe because it would require guessing the actual (non-ASCII) encoding of the input stream. If it was UTF-8 then 0x85 would be the second or later byte of a multibyte encoding. If it was WIN-1252 then 0x85 would be a horizontal ellipsis (…). If it was ISO-8859-1 then 0x85 would be undefined.

Disallowed character handling

Java Charset can be configured for one of three different actions when it encounters an error encoding or decoding a character:

  1. Report, which in most cases results in a CharacterCodingException
  2. Replace, which replaces the unknown code with a predefined placeholder
  3. Ignore, which causes the output to be shorter

The default is to replace the character, which is often the best approach. ACH files have a fixed-width record format, so ignoring errors by skipping characters may cause downstream processing to fail. Reporting errors with an exception may lead to an unrecoverable error requiring manual intervention.

Examples

Decoding an InputStream to a Reader

Length-preserving reads of an ACH input stream

An input stream that is expected to contain only characters allowed by ACH may encounter an unexpected value. Reporting the error with an exception could abort and delay the entire file ingestion stage due to a single field on a single record. Ignoring the error by skipping over the unexpected character may cause an offset that breaks subsequent processing of a fixed-width field. The best approach may be to substitute a replacement character into the stream and allow processing to continue. Using a Unicode replacement character (�) is the default action for a Java Charset.

InputStream bytesIn = new FileInputStream("input.ach");
// Charset can be passed by name because it has a provider resource in the classpath
Reader reader = new InputStreamReader(bytesIn, "ACH");
// Reader will replace unexpected bytes with the Unicode replacement character

Forcing failure if the input contains unexpected characters

If an input stream that is expected to contain only characters allowed by ACH encounters an unexpected value, it can be configured to report the error with an exception. This prevents missing or replacement characters being passed, which ensures that only completely clean inputs continue processing. This is not the default action for a Java Charset, so the behavior must be configured by modifying the Decoder.

InputStream bytesIn = new FileInputStream("input.ach");
// Retrieve Charset by name because it has a provider resource in the classpath
Charset ACH = Charset.forName("ACH");
// Obtain an explicit decoder and override the default behavior on malformed input
CharsetDecoder decoder = ACH.newDecoder().onMalformedInput(CodingErrorAction.REPORT);
// Use the constructor that accepts a CharsetDecoder
Reader reader = new InputStreamReader(bytesIn, decoder);
// Reader will throw an exception if it encounters an unexpected byte

Encoding an OutputStream to a Writer

Length-preserving writes to an ACH output stream

ACH files require each record to be 94 characters. The critical fields necessary for processing a file are usually generated by well-tested templates. A template may include text fields from a source that contains a wider range of characters than ACH allows. Injecting an unexpected character could cause problems for downstream systems. Reporting the error with an exception could abort and delay the entire file generation stage due to a single field on a single record. The best solution in this case is to replace the unexpected character with a replacement. Replacing the offending character with the encoder's default replacement is the default action for a Java Charset. The default replacement is a question mark (?).

OutputStream bytesOut = new FileOutputStream("output.ach");
// Charset can be passed by name because it has a provider resource in the classpath
Writer writer = new OutputStreamWriter(bytesOut, "ACH");
// Writer will replace unexpected characters with '?'

Forcing failure if the output contains unexpected characters

If a single bad character is considered sufficient cause to abort generation of an ACH file, the encoding can be configured to throw an exception rather than continuing. This is not the default action for a Java Charset, so the behavior must be configured by modifying the Encoder.

OutputStream bytesOut = new FileOutputStream("output.ach");
// Retrieved Charset by name because it has a provider resource in the classpath
Charset ACH = Charset.forName("ACH");
// Obtain an explicit encoder and override the default behavior on unmappable output
CharsetEncoder encoder = ACH.newEncoder().onUnmappableCharacter(CodingErrorAction.REPORT);
// Use the constructor that accepts a CharsetEncoder
Writer writer = new OutputStreamWriter(bytesOut, encoder);
// Writer will throw an exception if it encounters an unexpected character

About

Java Charset that encodes only ACH-safe characters, which are a subset of US-ASCII

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages