GenAI for Databases

Digital China Initiative, Harvard University

2025-02-21

Digital China Initiative

  • An initiative of Harvard University to prompt digital tools and methods for Chinese Studies.
  • PIs: Prof. Peter Bol and Prof. Michael Szonyi
  • Provide support to Harvard faculty, students, and researchers in Chinese Studies.

The GenAI Turn

  • In 2022, the release of ChatGPT (and other Large Language Models, LLMs) has brought generative AI turn to the usage of digital tools and methods in Chinese Studies.
  • In April 2023, CBDB and DCI held the first GenAI workshop for Chinese Studies and humanties at Harvard University.
  • We decided that our future projects should incline to the GenAI approach.
  • DCI has hosted a series of GenAI workshops for Chinese Studies since 2023.

Projects Involving GenAI

Nehru Papers

  • This is a NEH project that led by Tansen Sen (NYU), Gal Gvili (McGill), and Arunabh Ghosh (Harvard).
  • It is a project supported by Digital China Initiative.
  • The project has to digitize documents collected from the Indian archives which are related to Sino-Indian relations during the 60s and 70s.
  • In this stage, we are focusing on creating a catalog for the documents (~50K+?).

Nehru Papers

  • Most of the documents are in English, a few are in Indian languages and Chinese.
  • Tech stacks
    • OCR: done at NYU Shanghai
    • Data extraction: Self-hosted n8n + OpenAI models
    • Data storage and collaboration: Self-hosted Nocodb

Data Extraction Workflow

Reviewing First Draft

Collaborating

Standardization

Next Steps

  • The current experiment are limited to 3000 English documents.
  • We hope that we will have a few dictionaries (code tables) for entities after expert review.
  • Building an API for evaluating and standardizing the extracted entities.
  • Import the data into two databases: the NYU Shanghai China-India database and the Woodrow Wilson Center database.

Digital China Worldwide Chatbot

Digital China Worldwide Chatbot

  • Tech stacks (current version)
    • Qdrant for vector database
    • Chanlit for interface
    • OpenAI gpt-4o-mini model
    • Nocodb for data storage and collaboration
  • Tech stacks (future version)
    • Qdrant for vector database
    • Flowise for chat workflow
    • n8n for api and interface (integrate with Slack and other tools)
    • Nocodb for data storage and collaboration
    • Scraping machanism

Monitoring GenAI Applications

  • LiteLLM: https://litellm.ai/
    • Monitoring the usage of LLM APIs.
    • Creating API keys for different projects, people, and apps.
  • Langfuse: https://langfuse.com/
    • Monitoring GenAI applications.
    • Recording tokens usage, model cost and other metrics.

Challenges

  • The cost for building a GenAI apps are different from “traditional” databases and platforms. Every actions (e.g., query, update, delete) will cost tokens/money.
  • Building and maintaining approaches. For example, fine-tune models vs. RAG or Agentic?
  • Rapid development of all aspect of GenAI. How to build in a flexible way?