Unihan
31 August 2003
謝淑真
Unihan.txt is a text file which explains many of the characters in the Unicode standard. It explains how to pronounce them and also gives a definition as well as tips on how to find the characters in one of a number of standard dictionaries. A typical entry looks like this:
U+885B kAlternateKangXi 1109.029
U+885B kAlternateMorohashi 34073
U+885B kBigFive BDC3
U+885B kCCCII 215749
U+885B kCNS1986 1-6E6C
U+885B kCNS1992 1-6E6C
U+885B kCangjie HODQN
U+885B kCantonese WAI6
U+885B kDaeJaweon 1574.110
U+885B kDefinition guard, protect, defend
U+885B kEACC 215749
U+885B kFrequency 3
U+885B kGB1 4632
U+885B kHanYu 20845.040
U+885B kIRGDaeJaweon 1574.110
U+885B kIRGDaiKanwaZiten 34073
U+885B kIRGHanyuDaZidian 20845.040
U+885B kIRGKangXi 1109.290
U+885B kIRG_GSource 1-4E40
U+885B kIRG_JSource 0-3152
U+885B kIRG_KPSource KP0-FDD4
U+885B kIRG_KSource 0-6A5B
U+885B kIRG_TSource 1-6E6C
U+885B kIRG_VSource 1-6644
U+885B kJapaneseKun MAMORU MAMORI
U+885B kJapaneseOn EI E
U+885B kJis0 1750
U+885B kKPS0 FDD4
U+885B kKSC0 7459
U+885B kKangXi 1109.290
U+885B kKorean WI
U+885B kMandarin WEI4
U+885B kMatthews 7089
U+885B kMorohashi 34073
U+885B kNelson 1596
U+885B kPhonetic 1433
U+885B kRSKangXi 144.9
U+885B kRSUnicode 144.9
U+885B kSBGY 375.17
U+885B kSimplifiedVariant U+536B
U+885B kTaiwanTelegraph 5898
U+885B kTotalStrokes 15
U+885B kXerox 244:077
U+885B kZVariant U+536B
What I want to do is, for each entry like the above in the basic file, generate a paragraph like the following:
衛
guard, protect, defend. It is pronounced WAI6 in Cantonese. It is written U+885b in the Unicode system. It has a 'commercial code' of 5898. The HTML entity 衛 can be used to include it in a webpage. The 6 after the pronounciation means that it has a low level tone. (This is written in the PRC 'simplified' character set as 卫)(Plus all the other information, of course, as well as links to the various dictionary definitions where possible.)