UNICODE

27 de septiembre de 2012 - 15:02 - Oracle

Unicode

Unicode on Oracle

Note: If you are unfamiliar with Unicode see the brief primer at this link before continuing.

Oracle Character Sets

Unicode characters are stored in an Oracle database using one of Oracle's Unicode character sets. An Oracle character set defines a group of characters with the following common features.

same region of the world
same encoding
same code point map

The following table contains a sample of Unicode character sets available in a Oracle database as well as some older character sets commonly used before Unicode was invented.

Oracle Character Set Name	Region	IANA Character Set Name	Can be Used as		Notes
Oracle Character Set Name	Region	IANA Character Set Name	Database Character Set	National Character Set	Notes
AL32UTF8	ALL	UTF-8	Y	N	the default database character set for new installations this is Oracle's first recommended database character set choice for new system deployment
AL16UTF16	ALL	UTF-16BE	N	Y	the default national character set for new installations processing AL16UTF16 strings is typically faster than processing UTF8 strings because Oracle processes most AL16UTF16 characters as fixed-width characters
AL16UTF16LE	ALL	UTF-16LE	N	N	can only be used for character set conversion
UTF8	ALL	CESU-8	Y	Y	exists for backward compatibility with Oracle 8 databases
WE8ISO8859P15	Western Europe	ISO-8859-15	Y	N
US7ASCII	US	US-ASCII	Y	N	many character sets are a superset of US7ASCII; i.e. the first 128 characters of the superset are the same as US7ASCII

Database Character Set

Much of the following information was taken from Oracle Database Globalization Support Guide - Choosing a Character Set.

Oracle uses the "database character" set for:

storing data in SQL CHAR datatypes (CHAR, VARCHAR2, CLOB, and LONG)
storing table names, column names, and PL/SQL variable names
storing SQL and PL/SQL source code.

The database character set is defined using the CREATE DATABASE statement. The database parameter NLS_CHARACTERSET identifies the database character set used on a given database.

The database character set must be a superset of either EBCDIC or 7-bit ASCII, whichever is the native character set on the platform. It is therfore not possible to use a fixed-width, multibyte character set as the database character set.

In some cases it is possible to change the database character set after the database has been created. In most of these cases a full export/import will be required to do the conversion.

National Character Set

The national character set is used for data stored inside NCHAR, NVARCHAR, NCLOB columns. It is defined using the CREATE DATABASE statement. The database parameter NLS_NCHAR_CHARACTERSET identifies the national character set used on a given database.

The national character set can be either UTF8 or AL16UTF16. The default is AL16UTF16.

Values returned by Oracle's UNISTR() function are always in the national character set.

Determining the Character Sets in Use for a Database

To determine the database character set and the national character set for a given database use the following query.

select parameter, value
from   nls_database_parameters
where  parameter like 'NLS%CHARACTERSET' ;

PARAMETER                      VALUE
------------------------------ ---------------
NLS_CHARACTERSET               AL32UTF8
NLS_NCHAR_CHARACTERSET         AL16UTF16

The CHR(n) Function

Oracle's CHR function takes a numeric argument n and returns the VARCHAR2 value that is the binary equivalent of n in the database character set (or the national character set if USING NCHAR_CS is specified). Since one character set's binary representation of a character can differ from another character set's binary representation CHR is a poor choice for representing Unicode characters in SQL or PL/SQL code.

For example, say we wanted to write a query that returned values containing the copyright symbol © (U+00A9). If we were working on a database whose database character set was WE8ISO8859P1 a WHERE clause like this would work.

where description like '%' || chr(169) || '%'

However, if the code were later moved to a database with a database character set of AL32UTF8 the WHERE clause would no longer function properly. On an AL32UTF8 database the copyright symbol has a binary representation of 49833, not 169.

In place of CHR programmers can use the UNISTR function. This function's return value depends on an encoded string, not a binary value. Thus, a WHERE clause like the following will work on any database regardless of its database character set.

where description like unistr( '%