Indexing Project

FamilySearch.org contains tens of thousands of scanned images from Korean genealogies. Unfortunately, they are not indexed. In other words, those on a quest to find their ancestors must view thousands of pages searching for a name. If the images were indexed, then one could simply type the name into an automated search engine. Indexing those records by hand is a massive project that has not yet begun, to the best of my knowledge. However, I have been testing some open source software packages to determine how feasible it would be to automatically index the images.

So far the results are promising. I took a sample page, pre-processed it thru Scan Tailor, then ran it thru Tesseract-OCR (via PDFOCRX) with the traditional Chinese and Korean OCR training libraries installed. It did a great job of identifying the individual characters. The biggest remaining challenge is to get it to read top-down, left-to-right and follow the page formats of Korean genealogies. Once I get that working for one page, then I can work on developing the software necessary for automated retrieval, formatting, batch processing, and outputting. It is no small task, but it would be well worth the effort. It would open up Korean genealogy research to search engines on an unprecedented scale. The first step is to get it to recognize sons and daughters listed in the record. A more advanced step will be to get it to recognize wives within the text. I’m not sure if I could develop something sophisticated enough to actually build family tree structures automatically, but just locating the names automatically would be a huge step forward — so I’ll start there. I’ll start digging thru the source code… I hope it works!

You can see in the initial test results below. Tesseract-OCR is incorrectly trying to read the character as left-to-right lines, as you can see in the text results. However, the important thing is that it recognized the HanJa characters rather well.

Preliminary test results:

Original Image:

Original Image

Processed Image

Resulting Text:

女 女
暈 安
堂士廣
山 柳 忍孝州
封干舊十丙后父配堤直莫
九笠八子戊孝月南閣洞
人泰后寔人弼 英封日四寅永坡丙公五
籌亡 幕父: 洞後o月生愚慄坐基代
進榮 濂道遠 , 合移摹二忌齋氐 西租
子子 女子 踟
屾 金安轟女 儅父配英違在日四壬一薑子 父配洞
長世 竇鑸 山 李 樟準宣洞川醣卒月戌九玄璉 永醞午
魽 屾章后蠶睾桽鯔孛壼壬 束 杓霖坐
_ 杰 山氐 九西薹四子覃 聿 氏
子 女 子 女 女
李 銓支 乙一生 泅月 后月 泗
羶 忘風郭亥九父亨后城崔 城霜后
玉人鍚生五大煥 人弘 人
‘_ 煥 八東 噎 可木 盆洙
年 巧 戌,.,、 跚-
子子 子子
崔崔 李李
諴燾 東東
客圭 元玩

 

2/1/13 Update: I am finally able to get back to this project… It looks like there is a solution to identifying words despite the complex formatting of Korean JokBo (족보) genealogical records. Tesseract-OCR has an embedded “boxer” that identifies individual characters and creates a box around them. Refer tohttps://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 . I can take the coordinates of the boxes and write my own code to turn each into its own image, recognize them with the OCR routines, then try to automatically piece them together in the right order based on analysis of the coordinates and box sizes. There are quite a few complexities involved in the format detection, but with the impressive array of tools google created with the Tesseract-OCR project, it is certainly within the realm of possibility…

2/3/13 Update:  I used tesseract to make a boxfile that outlines each character that it automatically detects.  It did a great job of detecting where each of the characters are.  A sample graphical representation of the boxfile results is shown below:

boxer_sample

The reason this is important is that the coordinates of each box will be used by a program that I will write to interpret the structure of a Korean JokBo record.  That way instead of the improperly ordered results shown above under “Resulting Text,” I’ll be able to order each character and word properly for indexing.  That way people will be able to search for names and get results of which page of a JokBo they are found on…  The progress is good so far and proves that the concept of using Tesseract-OCR to index JokBo records could work.

4/1/13 Update: I now have some JokBo records in the proper format and have worked out a detailed plan for carrying out the initial tests. I’ll include a searchable database on this website that will refer people to specific pages of online records when they search for a name or word in a JokBo. The initial full-page indexing test is almost ready. It will be followed by several volumes and a searchable online database.

9 thoughts on “Indexing Project

  1. Pingback: Korean Genealogy

  2. Hi,
    I served in Korea for my mission, and I would like to help index for those people. I didn’t learn a lot of Chinese on my mission, but is there some other way I can help?

    • I served with a Michael Nielsen in the Taejon Mission… 🙂

      I couldn’t find any official Korean indexing projects open yet, so I decided to create my own index by starting with OCR then manually verifying the results. That’ll require lost of proofreading help when it gets moving… and its a great way to learn HanJa characters.

  3. 우리 한국조상을 구원하기 위한 노력에 감사합니다.
    Thank you for your effort to save our Korean ancestors.
    Our family members are doing family history work with our own Jokbo. It is our great responsibility in ths life. It is special for Korean people having their own Jokbo. I know that Heavenly father expects us, Korean people, to do the family history work with great enthusiasm.
    -Incheon stake Mansu ward-

      • I am researching my family line back to China, confirming the assumption that my ancestor left China as a dignitary to Korea and was given the name “Kay, Gae or Ke”. I found this information on Korean sites and am having trouble finding confirmation. Does anybody know where to direct me? Any help would be much appreciated, I am taking the initiative as nobody in my family seems to have the answers.

  4. This website was recently loaded in koreanlds.org. I am too excited to discover that somebody like you are working on 족보 to be indexed. I would be very greatful if all korean 족보are translated into Korean so that I could find my ancestors. I can’t read Chinese. Is there any progress in this work? Thank you.

    • Yes. The progress is slow, but I am working on training OCR for the HanJa fonts used in JokBo and writing programs to deal with the JokBo format when converting to HanJa and HanGul text formats.

Leave a Reply