Generative AI for Chinese Studies Workshop (Introductory)¶
3-4:30 PM, 23 Feb 2024 K262 Bowie-Vernon Room, CGIS Knafel
Instructor: Kwok-leong Tang (Managing Director for Digital China Initiative, FCCS; Lecturer, EALC)
Introduction¶
Two workshops in Spring 2024¶
- Introductory workshop
- For the advanced workshop (April 5, 2024)
- Running LLMs locally
- Use APIs to query LLMs
What has happened since April 2023?¶
In April 2023, the China Biographical Database Project (CBDB) and the Digital China Initiative (DCI) organized the first workshop on generative AI. The teaching materials for the workshop can be found on this GitHub repository:
In the slides of the previous workshop, we had one titled: “Chatbot limitations”, and it listed the followings:
- Cannot access internet;
- Limitations in input and tokens;
- Limitations in memorizing chat history;
- Restrictions in content;
- Time frame
Many of these constraints have been eased, to some degree, due to the following advancements:
- OpenAI (GPT-4-Turbo, GPTs, Sora, ChatGPT with memory)
- New tools, new players. you can find them at: https://www.futurepedia.io/
- Blooming of open source LLMs (Llama series, Mistral series, Qwen series, and many others)
- Running LLMs on local machines, even on laptops.
AI Assistants¶
Microsoft Copilot: Your everyday AI companion
Gemini - chat to supercharge your ideas
For comparing performance of different models, try ChatBot Arena:
Chat with Open Large Language Models
Basic Concepts about Generative AI¶
- Large language models (LLMs) or machine learning had not been something new before November 2022.
- From the perspectives of individuals, the revolutions brought by the ChatGPT, or the generative AI at large, were the interaction experience between humans and machines.
- We use natural language instead of graphical user interface (GUI) or command-line interface (CLI), which includes coding, to interact with machines.
- GenAI has numerous effects on scholars and students involved in Chinese Studies. I’d like to highlight two key impacts:
- It simplifies the technological complexities, enabling easier adoption of digital tools in research processes. Less coding, less training (both humans and machines).
- In the context of digital research workflows, domain expertise has gained increased significance due to the necessity of evaluation.
They are “autocomplete”. They extend beyond merely completing words or sentences but generate full paragraphs.¶
Try the following prompts in any AI assistants:
The capital of China is
The most beautiful city in China is
Issues, concerns, and potential solutions¶
- Inconsistency: Given the inherent characteristics of generative AI, the outcomes it produces may lack consistency. Potential solutions include adjusting the prompts and fine-tuning the models. (These solutions will not be discussed in this context.)
- Hallucinations: Hallucination has been a popular critic of generative AI since November 2022. The potential solutions include increasing token limits which will provide more context, adjusting prompts, and the retrieval augment generation (RAG).
- Privacy and Security: There is a concern among researchers regarding the protection and confidentiality of their research data when utilizing commercial AI tools. One potential solution could be the local deployment of large language models.
Create prompts (prompt engineering)¶
Markdown¶
Given that the prompts are primarily composed of text, Markdown serves as an excellent means to express formatting or styling within content that is purely text-based.
Presenting a table in markdown:
| journal title | issn | publichser |
| -------------------------- | --------- | ---------------------------------------------------------------------------- |
| *Journal of Asian Studies* | 0021-9118 | [Duke University Press](https://www.dukeupress.edu/journal-of-asian-studies) |
When it renders,
journal title | issn | publichser |
---|---|---|
Journal of Asian Studies | 0021-9118 | Duke University Press |
Prompt Components¶
- Role
- Purpose
- Instruction
- Context
- Exclusions
- Result
Prompt builder:
More on prompt engineering:
Prompt Engineering Guide – Nextra
Use Cases¶
In this part, we will explore some use cases of generative AI in Chinese Studies.
For language learning¶
Language learners can practice speaking and listening with AI.
- The ChatGPT applications for iOS and Android have the capability to interact with you through voice, eliminating the need for button presses. They can offer role-play games as a fun way to improve your listening and speaking skills. Please note that these features necessitate a paid subscription.
- If you’re not a subscriber of ChatGPT Plus, Microsoft Copilot is available with the dictation feature. However, you might need to set the system’s default language to the one you wish to practice. Google Gemini offers similar functions.
Generating questions for book trunks¶
The idea for this scenario was inspired by Professor Peter Bol. The scenario involves providing a section of writing and asking an AI tool to generate the best questions that the section can answer. This is a way to summarize the excerpt.
Case reference: Project: Using Mixtral 8x7b Instruct v0.1 Q8 to Generate a Synthetic Dataset for LLM Finetuning : r/LocalLLaMA (reddit.com)
Prompt:
You are a professor writing an exam. Using the provided context: '{text_chunk}', formulate a single question that captures an important fact or insight from the context, e.g. 'Who was Mao Zhedong?' or 'What is John Fairbank's argument?' or 'What was civil service examination in Qing dynasty?' or 'Who did kill the last Ming emperor' or 'Why did Mongols defeat the Southern Song?' or 'Why did Christianity appeal to Confucian scholars during late Ming?'. Restrict the question to the context information provided.
context: '{From the early 1900s Marxism in China was preceded by a widespread interest in anarchism. Until the Soviet revolution brought Leninism to China after 1917, anarchists were the chief socialists on the scene. Chinese students in both Paris and Tokyo were much attracted to Proudhon, Bakunin, and Kropotkin and their denunciation of all authority, beginning with governments, nations, militarism, and the family. Anarchist writers quoted Kropotkin’s dictum that the state had become the God of the present day. They eloquently put forward ideas of egalitarianism, especially emancipation of women from family bonds and of the peasantry from exploitation that would become part of the Chinese vocabulary of revolution. Anarchists wanted to rely not on the state but on individual liberation and its bloodless re-creation of the egalitarian community of the far past. Yet Peter Zarrow’s (1990) analysis of Chinese anarchist writings gives one a feeling that they indulged in utopian hopes that with one great leap they could somehow jump out of the Confucian straitjacket into complete freedom—a pathetically flawed ideal. No action but assassination ever eventuated. What could really be done?}'
From the early 1900s Marxism in China was preceded by a widespread interest in anarchism. Until the Soviet revolution brought Leninism to China after 1917, anarchists were the chief socialists on the scene. Chinese students in both Paris and Tokyo were much attracted to Proudhon, Bakunin, and Kropotkin and their denunciation of all authority, beginning with governments, nations, militarism, and the family. Anarchist writers quoted Kropotkin’s dictum that the state had become the God of the present day. They eloquently put forward ideas of egalitarianism, especially emancipation of women from family bonds and of the peasantry from exploitation that would become part of the Chinese vocabulary of revolution. Anarchists wanted to rely not on the state but on individual liberation and its bloodless re-creation of the egalitarian community of the far past. Yet Peter Zarrow’s (1990) analysis of Chinese anarchist writings gives one a feeling that they indulged in utopian hopes that with one great leap they could somehow jump out of the Confucian straitjacket into complete freedom—a pathetically flawed ideal. No action but assassination ever eventuated. What could really be done? (Fairbank, John King, and Merle Goldman. 2006. China: A New History. 2nd enl. ed. Cambridge, Mass: Belknap Press of Harvard University Press.)
-
Result
![Untitled](Generative%20AI%20for%20Chinese%20Studies%20Workshop%20(Introd%207bfd97862edf4a2ab90ab912283254ff/Untitled%201.png)
-
Experiment
context:{从历史方面说,我们可以确定以下几项事实:一、以前学者只能推测戴章初识在乾隆三十一年丙戌(1766),但现在我们则确切地知道这件事发生在是年的春夏之交,而且是实斋主动地到休宁会馆去正式拜访东原。据(东原年谱),东原是年入都会试不第,居新安(即休宁)会馆,与实斋所言正合,可以无疑。二、吴孝琳和倪文孙都猜测实斋初识东原是因为朱的关系。这个猜想显然错了。戴、章之同的介绍人并非朱筠,而是郑诚斋(名虎文,1714—1784)。三、自来研究实斋思想发展者,根据〈与族孙汝楠论学书》,颇以为戴、章初见,实斋所受的影响乃在于东原所坚持的考证观点。今据〈答邵二云书》,可知东原给予实裔最初同时也是最深的印象实在于他在义理方面的成就,如《原善》等哲学作品。此点关系最大,下文将另有讨论。}
-
Result
![Untitled](Generative%20AI%20for%20Chinese%20Studies%20Workshop%20(Introd%207bfd97862edf4a2ab90ab912283254ff/Untitled%202.png)
The passage has been processed through OCR from Yu Ying-Shih’s work, 論戴震與章學誠.
-
Formatting citations and references¶
Formatting citations and references to align with the style guidelines of publishers is a task that often consumes a significant amount of our time. In this particular scenario, we will be tailoring our citations to adhere to the style guide of JAS.
Journal of Asian Studies Style Guide: https://dukeupress.edu/Assets/Downloads/JAS_style_guide.pdf
Prompt
Please help me to prepare a list of references. The general rules are based on *The Chicago Manual of Style, 17th ed.*. For references to works published in Chinese, Japanese, Korean, or any other
Asian script, please provide titles in the following format: Transcription <original characters> [English translation]. Translations of journal or periodical names are not required. Here are two examples
Huang Shang 黃裳. 1947. *Guanyu Meiguo bing* 關於美國兵 [About American soldiers]. Shanghai: Shanghai chuban gongsi.
Peng Su 彭蘇. 2006. “Beida ‘fanMei’ nüsheng Ma Nan de zhenshi jinkuang 北大“反美”女生馬楠的真實近況 [Updates on Ma Nan, “anti-American” female student from
Peking University]. *Nanfang renwu zhoukan* 南方人物週刊 9:38–39.
Please confirm your understanding of the task. Once confirmed, I will proceed to provide you with the references for formatting.
Once the AI confirms, you can feed the following references to it.
石崎又造:《近世における支那俗語文学史》(東京:清水弘文堂書房,1967年)
林俊宏:《朱舜水在日本的活動及其貢獻研究》(臺北:秀威資訊科技股份有限公司,2004年)
湯沢質幸:《近世儒学韻学と唐音 —訓読の中の唐音直読の軌跡》(東京:勉誠出版,2014年)
劉元卿:《劉聘君全集》,《四庫全書存目叢書》(濟南:齊魯書社,1997年)影印清咸豐二年(1852)重刻本
余英時:《朱熹的歷史世界》(北京:生活 • 讀書 • 新知三聯書店,2011年)
-
Results
![Result from HUIT AI Sandbox (Model: GPT-4-Turbo preview)](Generative%20AI%20for%20Chinese%20Studies%20Workshop%20(Introd%207bfd97862edf4a2ab90ab912283254ff/Untitled%203.png)
Result from HUIT AI Sandbox (Model: GPT-4-Turbo preview)
Highlights:
- Accuracy of Japanese transliteration? 湯沢質幸’s page on KAKEN: KAKEN — 研究者をさがす | 湯沢 質幸 (90007162) (nii.ac.jp)
- The transliterations in Chinese are remarkably precise, and the “words” are properly segmented.
- 生活 • 讀書 • 新知三聯書店 is correctly translated into SDX Joint Publishing Company.
Extracting information and tagging¶
Leveraging generative AI for data extraction has become a widely embraced approach within the field of Chinese Studies. This section will start by discussing a case introduced in April 2023. Following that, we will delve into the methods of utilizing AI for data tagging.
Extracting data¶
The case of extracting data from deeds in our previous workshop:
Taiwanese deeds were obtained from:
Prompt
You will extract information from deeds. You will be given a few examples, then you should follow the same output format to extract the information from other deeds. The output is in form with five columns:
Title (標題): the title of the deed;
People/Organizations (人名/團體): People and organizations, and their roles, mentioned in the deed.
Contractor (立契者)
Date of Contract (立契時間)
Here are the examples:
Deed A:
立承佃鄭復禮,今在李德榮處承出根靣全民田壹號,坐產洋下洋,圡名大頭坵,現丈弍畝伍分弍厘捌毛正,承來耕作,靣约遞年認纳乾凈上籠租谷壹仟觔正,早六晚四送還理清,不得包箂,不得少欠,如有少欠,係保代賠,其田即听李家召佃別耕,且鄭家復禮不得阻留之理。今欲有慿,立承佃字壹帋為照。
光緒叁拾年弍月初七日立承佃字鄭復禮
保佃:胞兄鄭復奎
信字
Result:
| 標題 | 人名/團體 | 地點 | 立契者 | 立契時間 |
| ------------ | ------------------------------------------ | ------------- | ------ | ------------------ |
| 立承佃鄭復禮 | 鄭復禮(立據人);李德榮(給墾人);鄭復奎(保人) | 洋下洋;大頭坵 | 鄭復禮 | 光緒三十年二月七日 |
Deed B:
立承佃高仁旋,今在嶺頂村李德荣處承出民田一號,坐產洋下洋,圡名三夷,經丈叁畝弍分零,面約每年納出上籠乾淨租谷壹仟壹百觔止,早六晚四理还清椘,不敢少欠,如有少欠,係保代賠,恐口無慿,立承佃為據。
光緒叁拾叁年弍月初三日立承佃高仁旋
保佃:胞姪高居鏘
信字
Result:
| 標題 | 人名/團體 | 地點 | 立契者 | 立契時間 |
| -------------------------------- | ------------------------------------------ | ------------------ | ------ | -------------------- |
| 光緒三十三年二月三日立承佃高仁旋 | 高仁旋(立據人);李德荣(給墾人);高居鏘(保人) | 嶺頂村;洋下洋;三夷 | 高仁旋 | 光緒三十三年二月三日 |
You do not generate deeds yourself. Once you understand the task, you can tell me to give you the deeds for extraction.
By using the same prompt and examples, could we correctly extract the data from the deeds from Cixi, another location. Deeds source: 章均立. 2018. 慈溪契約文書. 寧波出版社.
139.光绪七年(1881)十一月 鸣鹤场芦东管 方林氏等卖地契
【录文】
立永远杜绝卖契方林氏、同卖子春潮,今因正用,情愿挽中,将故夫遗下芦东管课地壹爿,计地叁亩正,坐落下宝山前,土名大龙湾,东至王姓地合港,南至胡姓地沟各半,西至小路,北至河外路为界,四址坐落分明,凭中出永绝卖与胡冠三兄处为永业。三面议明,当受时值价直(值)银念四两,其银当日一并收足,归身正用。是永卖之后,任凭受主管业布种,开割过户,入册输粮无阻。其地并无上下等争执,亦无重行典押在外。如有诸般违碍等情,卖主自行理直,不涉受主之事。此系两愿,各无异言。恐后无凭,立此杜绝卖契,永远存照行。
尾未请到,随后补粘。
光绪七年十一月× 日 立永远杜绝卖契 方林氏(押)
同卖 子 春潮(押)
见卖 叔 正表(押)
---
140. 光绪七年(1881)十二月 鸣鹤场芦东管 方林氏等卖地找契
【录文】
立永远杜绝找叹契方林氏、同找子春潮,今又乏用,仍挽原中,将前月间曾永卖得芦东管课地壹爿,计地叁亩,土名坐落下宝山前大龙湾,坐落四址俱载明正契内。因前价不足,仍挽原中,出绝找与 × × 处为永业。三面议定,当受时值价找银××两,其银当日收足,归身正用。是找之后,永无再找、再叹、异言。恐后无凭,立此杜绝找叹契,永远存照行。
内注:“前”字一个,并照行。
又注:“东”字一个,并照行。
光绪七年十二月 × 日 立永远杜绝找叹契 方林氏(押)
同找子 春潮(押)
见卖 叔正表(押)
中
陈声球
代
岑积大
找契大吉行
---
106.咸丰元年(1851)十二月 慈溪县新升管 林源陶卖地契
【录文】
立永远杜绝卖契林源陶,今因缺银正用,情愿浼中,将祖遗下分授自已叶家路弟(第)六节,坐落土名三塘下新升地二沟,计地捌亩正,东至四塘,南至受主河心,西至横头地,北至学圣公地河心为界,四址坐落分明,凭中出永绝卖与曹咸福处为业。三面议定,时值价银 ××两文正,其银当日交足,归身正用。其地并无房分上下人等争执,如有违碍等情,卖主自行理直,不涉银主之事。其地自卖之后,任从银主管业布种,开割过户,入册输粮无阻,永无异言。恐后无凭,立此永远杜绝卖契,存照行。
咸丰元年十二月 ×日 立永远杜绝卖契 林源陶(押)
中人 性陶(押)
代字 林泉(押)
永绝卖契大吉行
-
Result
![Untitled](Generative%20AI%20for%20Chinese%20Studies%20Workshop%20(Introd%207bfd97862edf4a2ab90ab912283254ff/Untitled%204.png)
Discern the color references within classical poetry¶
In this scenario, our aim is to discern the color references within classical poetry and investigate methods for generating visual representations.
Credits to Wenfei Wang and Wan-Chun Chiu for helping me to prepare some of the materials.
Prompt 1:
Please analyze the colors mentioned in the given poetry sentence and provide an explanation for your reasoning:
漆灰骨末丹水砂,凄凄古血生铜花。
-
Result
![Result from HUIT AI Sandbox (Model: GPT-4-Turbo preview)](Generative%20AI%20for%20Chinese%20Studies%20Workshop%20(Introd%207bfd97862edf4a2ab90ab912283254ff/Untitled%205.png)
Result from HUIT AI Sandbox (Model: GPT-4-Turbo preview)
Prompt 2:
Please analyze the colors mentioned in the given poetry sentence and return the words with color implications in <word> (HEX Color code).
漆灰骨末丹水砂,凄凄古血生铜花。
-
Result
![Result from HUIT AI Sandbox (Model: GPT-4-Turbo preview)](Generative%20AI%20for%20Chinese%20Studies%20Workshop%20(Introd%207bfd97862edf4a2ab90ab912283254ff/Untitled%206.png)
Result from HUIT AI Sandbox (Model: GPT-4-Turbo preview)
![Generated by ChatGPT. Credit: Wan-Chun Chiu ](Generative%20AI%20for%20Chinese%20Studies%20Workshop%20(Introd%207bfd97862edf4a2ab90ab912283254ff/Untitled.jpeg)
Generated by ChatGPT. Credit: Wan-Chun Chiu
Exercise: Try to get the color values from following sentences. Present the results in tabular data format.
谁遣虞卿裁道帔,轻绡一匹染朝霞。
剑如霜兮胆如铁,出燕城兮望秦月。
我有辞乡剑,玉锋堪截云。
竹马梢梢摇绿尾,银鸾睒光踏半臂。