论坛徽章:: 0

电梯直达

1楼 [收藏(0)] [报告]

发表于 2009-03-26 14:30 |只看该作者 |倒序浏览

我们的汉字在计算机系统里面存储时需要2个字节的空间。当数据库使用单字节字符集的时候，数据库允许存储半个汉字，因为它占用的是一个字节的空间为一个有效数据，例如通常的英文字符集：en_us.819或en_us.utf8。但是当数据库使用多字节字符集的时候，由于半个汉字为非法的不完整字符，会导致数据库在存储这种数据的时候报错illegal character，例如通常的中文字符集：zh_cn.gb和zh_cn.GB18030-2000。为了解决这个问题，我编写了一个小程序用于过滤掉数据库数据中存在的半个汉字问题。

原理：
汉字由2个字节组成，且每个部分其ascii编码都大于127，因此我们在发现一个字符的ascii编码大小大于127的情况下需要检测紧随的一个字节其ascii编码是否大于127，如果是则为一个完整的汉字，反之则是半个汉字。

以下为使用步骤：
1.将数据库中的数据卸载为存文本形式
2.使用trim infile outfile对该数据进行过滤，它会将所有紧跟中非中文字符的半个汉字去除
3.设置中文字符集以后，将该数据重新装载进数据库

/*******************************************************************************
*
*    Module:       trim
*    Author:       Richard ZHAN
*    Description: Eliminate half Chinese character followed by a non Chinese character in a plain data file
*
*    Change Log
*
*    Date          Name          Description.................
*    03/20/2009    Richard ZHAN Start Program
*
*******************************************************************************/

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <errno.h>
#include <strings.h>

#define LEN 4096

int
main (int argc, char *argv[])
{
  int rfd, wfd, len1, len2, i;
  char hi, *p1, *p2, str1[LEN], str2[LEN];
  unsigned char ascii_hi = '\x7F';

  if (argc != 3)
{
   usage ();
   exit (1);
}
  if ((rfd = open (argv[1], O_RDONLY)) == -1)
{
   printf ("Cannot open read file!\n");
   exit (1);
}
  else if ((wfd = open (argv[2], O_RDWR | O_CREAT, 0644)) == -1)
{
   printf ("Cannot open write file!\n");
   close (rfd);
   exit (1);
}
  else
{
   hi = '\x0';
   while ((len1 = read (rfd, str1, LEN)) > 0)
      {
      len2 = 0;
      bzero (str2, LEN);
      p2 = str2;
      for (p1 = str1, i = 0; i < len1; p1++, i++)
         {
            if ((unsigned char) (*p1) > ascii_hi)
            {
               if (hi == '\x0')
                  {
                  hi = *p1;
                  }
               else
                  {
                  *p2++ = hi;
                  *p2++ = *p1;
                  len2 += 2;
                  hi = '\x0';
                  }
            }
            else
            {
               *p2++ = *p1;
               len2++;
               hi = '\x0';
            }
         }
      if (write (wfd, str2, len2) != len2)
         {
            perror ("Encounter write error\n");
            close (rfd);
            close (wfd);
            exit (1);
         }
      }
   if (len1 < 0)
      {
      perror ("Encounter read error\n");
      close (rfd);
      close (wfd);
      exit (1);
      }
}
  close (rfd);
  close (wfd);
  exit (0);
}

usage ()
{
  fprintf (stderr, "Usage: trim infile outfile\n");
  return 0;
}