Evolution of character encoding

Posted on October 20, 2016 by Marco Barnig

In computing, character encoding is used to represent a repertoire of characters by some kind of encoding system. Character encoding started with the telegraph code. The numerical values that make up a code space are called code points (or code positions).

ASCII

One of the most used character encoding scheme is ASCII, abbreviated from American Standard Code for Information Interchange. ASCII comprises 128 code points in the range 00hex to 7Fhex. Work on ASCII standardization began in 1960 and the first standard was published in 1963. The first commercial use of ASCII was as a seven-bit teleprinter code promoted by Bell Data Services in 1963. ASCII encodes 128 specified characters into seven-bit integers. The characters encoded are numbers 0 to 9, lowercase letters a to z, uppercase letters A to Z, basic punctuation symbols, a space and some non-printing control codes. The eight bit in an ASCII byte, unused for coding, was often used for error control in transmission protocols.

Extended ASCII

Extended ASCII uses all 8 bits of an ASCII byte and comprises 256 code points in the range 00hex to FFhex. The term is misleading because it does not mean that the ASCII standard has been updated to include more than 128 characters or that the term unambiguously identifies a single encoding. There are hundreds of character encoding schemes, based on ASCII, which use the eight bit to encode 128 additional characters used in other languages than american english or used for special purposes. Some of these codes are listed hereafter:

EBCDIC : Extended Binary Coded Decimal Interchange Code, used mainly by IBM
ISO 8859 : a joint ISO and IEC series of standards, comprising 15 variants; the most popular is ISO 8859-1 (called Latin 1)
ATASCII and PETSCII : introduced by ATARI and Commodore for the first home computers
Mac OS Roman : launched by Apple Computer

Unicode

Unicode Logo

Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems and historical scripts. The standard is maintained by the Unicode Consortium, a non-profit organization. The most recent version of Unicode is 9.0, published in June 2016. Unicode comprises 1.114.112 code points in the range 00hex to 10FFFFhex. The Unicode code space is divided into seventeen planes (the basic multilingual plane, and 16 supplementary planes), each with 65.536 (= 2 exp 16) code points. Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. The Unicode Standard has been adopted by all industry leaders in the information technology domain..

Emoji

Emoji are ideograms and smileys used in electronic messages and web pages. The characters exist in various genres, including facial expressions, common objects, places and animals. Originating on Japanese mobile phones in the late 1990s, emoji have become increasingly popular worldwide since their international inclusion in Apple’s iPhone, which was followed by similar adoption by Android and other mobile operating systems. The word emoji comes from Japanese and means pictogram, the resemblance to the English words “emotion” and “emoticon” is just a coincidence. Emoji are now included in Unicode.

Emoji Candidates

Anyone can submit a proposal for a new emoji character, but the proposal needs to have all the right information for it to have a chance of being accepted. The conditions and the process is described on the Submitting Emoji Character Proposals webpage of the Unicode Consortium. The following figure shows the 8 Emoji candidates for the next meeting (Q4 2016) of the Unicode Technical Committee (UTC). When approved, these characters will be added to Unicode 10.0, for release in June, 2017.

New Emoji Candidates 2016

Unicode Adapt-a-Character

Unicode launched the initiative Adopt-a-Character to help the non-profit consortium in its goal to support the world’s languages. There are three sponsorship levels : Gold, Silver and Bronze. All sponsors are acknowledged in Unicode’s Sponsors of Adopted Characters and their public Twitter feed with their level of support, and they receive a custom digital badge for their character. Donation for a bronze adoption are only 100 US$.

Unicode Encoding

Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8, UTF-16, UTF-32. A comparison of the encoding schemes is available at Wikipedia.

UTF-8

UTF-8 is a character encoding capable of encoding all possible characters, or code points, defined by Unicode. UTF-8 was originally designed by Ken Thompson and Rob Pike. The encoding is variable-length and uses 8-bit code units. UTF-8 uses one byte for any ASCII character, all of which have the same code values in both UTF-8 and ASCII encoding, and up to four bytes for other characters. There are three sort of units in a variable-width encoding (multibyte encoding).

Singleton : a single unit (one byte)
Lead : a lead unit comes first in a multibyte encoding
Trail : a trail unit comes afterwards in a multibyte encoding

UTF-8 was first presented in 1993.

UTF-16

UTF-16 is a character encoding capable of encoding all 1,112,064 possible characters in Unicode. The encoding is variable-length, as code points are encoded with one or two 16-bit code units. UTF-16 are incompatible with ASCII files. UTF-16 was developed from an earlier fixed-width 16-bit encoding known as UCS-2 (for 2-byte Universal Character Set) once it became clear that a fixed-width 2-byte encoding could not encode enough characters to be truly universal.

UTF-32

UTF-32 is a protocol to encode Unicode code points that uses exactly 32 bits per Unicode code point. This makes UTF-32 a fixed-length encoding, in contrast to all other Unicode transformation formats which are variable-length encodings. The UTF-32 form of a code point is a direct representation of that code point’s numerical value.

The main advantage of UTF-32, versus variable-length encodings, is that the Unicode code points are directly indexable. This makes UTF-32 a simple replacement in code that uses integers to index characters out of strings, as was commonly done for ASCII. The main disadvantage of UTF-32 is that it is space inefficient.

Unicode Equivalence

Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent the same character. This feature was introduced in the standard to allow compatibility with preexisting standard character sets. Unicode provides two such notions, canonical equivalence and compatibility.

Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed. Sequences that are defined as compatible are assumed to have possibly distinct appearances, but the same meaning in some contexts. Sequences that are canonically equivalent are also compatible, but the opposite is not necessarily true.

Unicode Normalization

The Unicode standard also defines a text normalization procedure, called Unicode normalization, that replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the normalization form or normal form of the original text. For each of the two equivalence notions, Unicode defines two normal forms, one fully composed (where multiple code points are replaced by single points whenever possible), and one fully decomposed (where single points are split into multiple ones). Each of these four normal forms can be used in text processing :

NFD : Normalization Form Canonical Decomposition
NFC : Normalization Form Canonical Composition
NFKD : Normalization Form Compatibility Decomposition
NFKC : Normalization Form Compatibility Composition

All these algorithms are idempotent transformations, meaning that a string that is already in one of these normalized forms will not be modified if processed again by the same algorithm.

Combining characters

In the context of Unicode, character composition is the process of replacing the code points of a base letter followed by one or more combining characters into a single precomposed character; and character decomposition is the opposite process.

NFC character:	C	é		l	i	n	e
NFC code point	0043	00e9		006c	0069	006e	0065
NFD code point	0043	0065	0301	006c	0069	006e	0065
NFD character	C	e	◌́	l	i	n	e

Unicode support in OS X Swift

In Swift, strings are comprised of a special data structure called Characters (with capital C), which are made of Unicode scalar values (unique 21-bit numbers). We have seen in the above chapter that ” é ” can be described in two ways :

let eAcute = ” \ u {E9} ” // é
let combinedEAcute = ” \ u {65} \ u {301} // e followed by ´

The accent scalar (U+0301) is called ” COMBINING ACUTE ACCENT “. A Swift Character can contain multiple Unicode scalars, but only if those scalars are supposed to be displayed as a single entity. Otherwise Swift throws an error.

The swift function Process( ) (formerly NSTask) used to run Unix excecutables in OS X apps encodes characters using decomposed unicode forms (NFD). This can be a problem when handling special characters in Swift (hash, dcmodify, …).

The following example shows the use of the DCMTK function dcmodify where the arguments to insert a new tag are handled with the argument

 let argument = ["-i", "(4321,1011)=Ostéomyélite à pyogènes"]

The accented characters are passed in the decomposed form. A workaround is to save the value Ostéomyélite à pyogènes in a first step in a text file (Directory/Temp.txt) into a temporary folder and to use the dcmodify “if” option to pass the text file in the argument:

 let argument = ["-if","(4321, 1011)=" + Path to Directory/Temp.txt]

The accented characters are now passed in the composed form as wanted.

UTF-8 support in DICOM

Today UTF-8 characters are supported in most DICOM applications, if they are configured correctly. The Specific Character Set (0008,0005) identifies the Character Set that expands or replaces the Basic Graphic Set for values of Data Elements that have Value Representation of SH, LO, ST, PN, LT or UT. The defined term for multi-byte character sets without code extensions is ISO_IR 192.

The next figure shows the use of UTF-8 characters in the Orthanc DICOM server.

Orthanc Explorer in 3 Web-Browsers : Safari, Firefox and Microsoft Edge

Orthanc Explorer “Patient Page” in different Web-Browsers : Safari, Firefox and Microsoft Edge

Orthac Explorer "Instance Page in different Web-Browsers : Safari, Firefox and Microsoft Edge

Orthanc Explorer “Instance Page” in different Web-Browsers : Safari, Firefox and Microsoft Edge

OrthancMac OS X El Capitan

Posted on October 18, 2016 by Marco Barnig

Last update : February 28, 2018

Introduction

I started the edition of this contribution in June 2015 when I did my first trials with the Orthanc server. In the meantime I created OrthancPi, a mini headless PACS server which is used to host the DICOM teaching files for RadioLogic, an educational tool for radiologists which is currently in alpha test state. It’s now time to update and finalize my post about the installation of the Orthanc server on my MacBookAir computer. The goal is the development of OrthancMac, a midi PACS server for RadioLogic which is more powerful and user-friendly than OrthancPi. Some figures included in the present post refer to earlier versions of Orthanc and to OS X Yosemite because it would be waste time to replace them all with current screenshots.

Some informations provided in the present post are trivial and redundant with my other posts about DICOM and Orthanc. I assembled them for my own needs to get familiar with Orthanc and OS X developments.

Orthanc

Orthanc is a open-source, lightweight DICOM server for healthcare and medical research. It’s now also called a VNA (Vendor Neutral Archive). Orthanc can turn any computer running Windows, Linux or OS X into a PACS (picture archiving and communication system) system. Orthanc provides a RESTful API and is built on the top of DCMTK (collection of libraries and applications implementing large parts of the DICOM standard). Orthanc is standalone because all the dependencies can be statically linked.

The developer of Orthanc is Sébastian Jodogne, a belgian medical imaging engineer (2011) of the CHU of Liège (University Hospital) who holds a PhD in computer science (2006) from the University of Liège (ULG).

Orthanc source code

The Orthanc source code is available at Bitbucket. The latest stable Orthanc version is 1.3.1 released on November 2, 2017. Some changes have been done since that date. I downloaded the default (mainline) zip file from the Bitbucket project page and saved the unzipped orthanc folder into a directory named orthancmac located at the Mac OSX (El Capitan) desktop. My configuration is slightly different than the assumed structure in the Darwin compilation instructions, but I prefer this development setup.

The following folders and files are included in the orthanc folder :

Core/
OrthancExplorer/
OrthancServer
Plugins/
Resources/
UnitTestSources/
.travis.yml (to trigger automated builds)
CMakeLists.txt
README, AUTHORS, COPYING, INSTALL, NEWS, THANKS
LinuxCompilation.txt and DarwinCompilation.txt

The build infrastructure of Orthanc is based upon CMake. The build scripts are designed to embed all the third-party dependencies directly inside the Orthanc executable. Cmake uses the concept Out of source Build where the build directory is separated from the source directory.

I created a folder build inside the orthanc directory and opened a terminal window inside this build folder.

cd desktop/orthancmac/orthanc/build

To prepare the build process (configuration) on Mac OS X El Capitan, I entered the following command in the terminal window :

cmake .. -GXcode -DCMAKE_OSX_DEPLOYMENT_TARGET=10.11
-DSTATIC_BUILD=ON -DSTANDALONE_BUILD=ON -DALLOW_DOWNLOAD=ON
~/desktop/orthancmac/build

The cmake options are :

-G : specify a makefile generator
-D : create a cmake cache entry

The cmake cache entries are :

CMAKE_OSX_DEPLOYMENT_TARGET : 10.11
STATIC_BUILD : ON
STANDALONE_BUILD : ON
ALLOW_DOWNLOAD : ON

The following figure shows the configuration process when using the CMake-GUI :

Orthanc configuration with CMake GUI

During the configuration process, the following files have been downloaded from the website http://www.montefiore.ulg.ac.be/~jodogne/Orthanc/ThirdPartyDownloads/ :

boost-_1_60_0_bcpdigest-1.0.1.tar.gz : (boost C++ librairies)
curl-7.50.3.tar.gz : (curl tool and library for transferring data with URL syntax)
dcmtk-3.6.0.zip : (DICOM librairies and applications)
gtest–1.7.0.zip : (C++ Google framework to write tests)
jsoncpp-0.10.5.tar.gz : (C++ library that allows manipulating JSON values)
jpegsrc.v9a.tar.gz : (library for JPEG image compression from the IJG)
libpng-1.5.12.tar.gz : (official PNG reference C89 library)
lua-5.1.5.tar.gz : (powerful, fast, lightweight, embeddable scripting language)
mongoose-3.8.tgz : (easy to use web server)
openssl–1.0.2d.zip : (toolkit implementing SSL v2/v3 andTLS protocols)
pugixml–1.4.tar.gz : (C++ light-weight XML processing library)
zlib-1.2.7.tar.gz : (Unobtrusive Compression Library)

All the files have been saved in a new folder orthancmac/orthanc/ThirdPartyDownloads. The programs SQlite3 and Python 2.7.10 have been found installed.

Configuration messages, warnings and errors

During the configuration process, the following messages, warnings and errors have been stated :

Files not found

The following files and definitions have not been found during the processing of DCMTK : fstream, malloc, ieeefp, iomanip, iostream, io, png, ndir, new, sstream, stat, strstream, strstrea, sync, sys/ndir, sys/utime, thread, unix, cuserid, _doprnt, itoa, sysinfo, _findfirst, isinf, isnan, uchar, ulong, longlong, ulonglong

Patching

The following files have been patched :

dcmtk-3.6.0/dcmnet/libsrc/dulfsm.cc
dcmtk-3.6.0/dcmnet/libsrc/dul.cc

Cmake policies

– policy CMP0042 not set
– policy CMP0054 not set
To avoid the cmake_policy warning, I added the following command to the CmakeLists.txt file at the beginning :

if(POLICY CMP0042)
cmake_policy(SET CMP0042 NEW)
endif()
if(POLICY CMP0054)
cmake_policy(SET CMP0054 NEW)
endif()

INFOS

DCMTK’s builtin private dictionary support will be disabled
Thread support will be disabled
OS X Path not specified for the following targets:
– ModalityWorklists
– ServeFolders

DOXYGEN

Doxygen not found. The documentation will not be built.

Orthanc Xcode Building

The Build directory contains the following folders and files after the configuration and generation process :

AUTOGENERATED/
boost_1_60_0/
curl-7.50.3/
dcmtk-3.6.0/
gtest-1.7.0/
jpeg-9a/
jsoncpp-0.10.5/
libpng-1.5.12/
lua-5.1.5/
mongoose/
openssl-1.0.2d/
pugixml-1.4/
zlib-1.2.7/
CMakeFiles/
CMakeTmp/
CMakeScripts/
CMakeCache.txt
cmake_install.cmake
cmake_uninstall.cmake
Orthanc.xcodeproj

To build the project, I entered the following command in the terminal window inside the build folder :

xcodebuild -configuration Release

The build was successful, the following warnings have been issued :

3 -> curl
1 -> lua
4 -> monggose
5 -> zlib
3 -> openssl
> 100 -> DCMTK
1 -> Orthanc

The following figure shows the building process when using the Xcode GUI.

Orthanc Building with Xcode GUI

The eight targets are show at the left in red. To build the Release version, I modified the scheme in the Xcode-GUI. Building with the command line is much easier.

After the succesfull build, the following folders and files were added to the Build folder :

Release/
Orthanc.build/

The Release folder contains the executables Orthanc, UnitTest and OrthancRecoverCompression, the libraries libCoreLibrary.a, libOpenSSL.a, libServerLibrary.a, libServeFolders.mainline.dylib and libModalityWorklists.mainline.dylib. These files are the targets of the Xcode building process.

Running the Orthanc server

The Orthanc configuration file Configuration.json is located in the folder orthanc-default/Resources. I copied this file into the Release folder and started the DICOM server with the command

./Orthanc Configuration.json

inside the Release folder.

Orthanc Server Start with Terminal Window

At the first start of the server, a new folder OrthancStorage is created inside the Release directory. The OrthancStorage folder contains the SQLite files index, index-shm and index-wal.

Entering the url localhost:8024 in the Safari address field opens the main window (explorer) of the Orthanc server.

Orthanc Explorer at localhost

Clicking the upload button opens an new window in the Orthanc server where I added some DICOM files from CD’s (drag and drop).

Uploading DICOM files with Orthanc Server

The DICOM files are saved in sub-folders in the OrthancStorage directory in a flat structure.

I modified the configuration.json file to allow the remote access to the server from another computer located in the same local network.

/**
* Security-related options for the HTTP server
**/
// Whether remote hosts can connect to the HTTP server
"RemoteAccessAllowed" : true,

The remote anonymous access is now possible. When the remote access is not allowed, the server requests a user-id and password when entering the URL in the browser address bar :

http://192.168.168.55:8042/app/explorer.html

Orthanc allows the following actions :

Action	Patient	Study	Series	Instance
protect/unprotect	x	–	–	–
delete	x	x	x	x
send to remote modality	x	x	x	x
anonimize	x	x	x	–
preview	–	–	x	x
download ZIP	x	x	x	–
download DICOMDIR	x	x	x	–
download DICOM file	–	–	–	x
download JSON file	–	–	–	x

Protection

Because of its focus on low-end computers, Orthanc implements disk space recycling: the oldest series of images can be automatically deleted when the available disk space drops below a threshold, or when the number of stored series grows above a threshold. This enables the automated control of the disk space. Recycling is controlled by the MaximumStorageSize and the MaximumPatientCount options in the Orthanc configuration file. It is possible to prevent patient data from being automatically recycled by using the Unprotected/Protected switch that is available in Orthanc Explorer.

Testing the server

When the UnitTests executable is launched from the terminal window in the Release folder, 163 tests from 43 test cases were run. All these 163 tests passed. Two tests were disabled.

Orthanc UnitTests

Two new folders were created in the Release folder by the testing process : UnitTestsResults and UnitTestsStorage.

RESTful API

The following list shows the main RESTful commands (links work only in my local network):

Patients : http://localhost:8042/patients
Studies : http://localhost:8042/studies
Series : http://localhost:8042/series
Instances : http://localhost:8042/instances
Patient Name : http://localhost:8042/patients/ba1682fb-3fc01dc1-acaf1294-c0d61888-69ba054b
Study CT Colonne cervicale : http://localhost:8042/studies/d4c42ef2-91794610-dfda2fe3-89fff37f-6d38b159
Series Col. Cervicale Mou 2.0 MPR spine multi : http://localhost:8042/series/e7f7f651-aeacf5d4-a3832d08-3c7a3efa-2eff3c3a
Instance 4 : http://localhost:8042/instances/8d5edbe5-073a70b9-c46dcaa7-9d54a495-6dc5ed32
Download instance.dcm : http://localhost:8042/instances/8d5edbe5-073a70b9-c46dcaa7-9d54a495-6dc5ed32/file
Simplified tags : http://localhost:8042/instances/8d5edbe5-073a70b9-c46dcaa7-9d54a495-6dc5ed32/simplified-tags
Tags : http://localhost:8042/instances/8d5edbe5-073a70b9-c46dcaa7-9d54a495-6dc5ed32/tags
Content : http://localhost:8042/instances/8d5edbe5-073a70b9-c46dcaa7-9d54a495-6dc5ed32/content
Preview : http://localhost:8042/instances/8d5edbe5-073a70b9-c46dcaa7-9d54a495-6dc5ed32/preview

A complete grid of the Orthanc RESTful API is available as Google Spreadsheet.

Internet with a Brain

Your browser becomes your personal assistant and Internet gets a synthetic consciousness

Monthly Archives: October 2016