SSP Forum: Harun Khan and Merve Tekgurler (M.S. Candidates)
Room 126
(See description for Notes on Entry)
The
Symbolic Systems Forum(
community sessions of SYMSYS 280 - Symbolic Systems Research Seminar)
presents
Formalizing the Theory of Bit-vectors for Integration of SMT solvers and Theorem Provers
Harun Khan (M.S. Candidate)
Symbolic Systems Program
Semantics of Empire: Machine Translation of Ottoman Turkish into English
Merve Tekgurler (M.S. Candidate)
Symbolic Systems Program
Monday, May 6, 2024
12;30-1:20 pm PT
Margaret Jacks Hall (Bldg. 460), Room 126
In-person event, not recorded
(see below for entry instructions if you are not an active Stanford affiliate)
Note: Lunch is provided, if pre-ordered, only for members of SYMSYS 280, but others are welcome to bring a lunch and eat during the presentation.
ABSTRACTS:
(1) Harun Khan, "Formalizing the Theory of Bit-vectors for Integration of SMT solvers and Theorem Provers" (Primary Advisor: Clark Barrett, Computer Science)
To integrate the Lean Theorem Prover and cvc5, an SMT solver, we formalize an efficient representation of bit-vectors in Lean. We additionally formalize all bit-blasting theorem and many bit-vector rewrite rules used by cvc5. We will discuss implementation challenges we faced to achieve the integration for the theory of bit-vectors.
(2) Merve Tekgurler, "Semantics of Empire: Machine Translation of Ottoman Turkish into English" (Primary Advisor: Mark Algee-Hewitt, English)
Semantics of Empire will enhance accessibility of Ottoman Turkish sources. We are building a first-pass translator. Our translator will facilitate the integration of non-English historical texts into teaching and research, thereby democratizing access to diverse historical accounts. The project consists of two parts: training a neural machine translation (NMT) model and crowdsourcing a translation dataset from scholars in Ottoman studies. For the NMT training, we fine-tuned a state-of-the-art Turkish to English translation model on a dataset of 50k parallel sentences. We created this dataset by computationally aligning sentences from Turkish novels with their English translations. Our preliminary results show that this dataset improved the model performance by 5 chr-F points and 1 BLEU point, which are common metrics in MT evaluation. Concurrently, we tested the performance of our model and that of several Large Language Models (LLMs) on Ottoman Turkish manuscripts. We identified a concern regarding the use of Google’s multimodal model Gemini in translation tasks. Using Reinforcement Learning from Human Feedback (RLHF), Gemini was programmed to refuse to output texts that contain harmful language, such as depictions of warfare or violence. Historical materials complicate these processes, raising critical questions about the criteria used to classify language as harmful. In the translation tests, GPT-4 had the highest BLEU and chr-F scores. Motivated by these results, we built a website that uses the GPT-4 API to translate Ottoman Turkish into English and collect more sentence examples. We shared this website with Ottomanists, who were asked to input sentences from their own research to be translated by GPT-4 and offer corrections and improvements. Our findings illustrate how we can align models with the needs of underresourced language communities and how AI research can benefit from interdisciplinary scholarship.
NOTES ON ENTRY TO THE MEETING ROOM:
Entry to the building is open to anyone with an active Stanford ID via the card readers next to each door. If you do not have a Stanford ID, you can gain entry between 12:15 and 12:30pm ONLY by knocking on the exterior windows of room 126. These windows are to the left of the west side exterior door on the first floor of Margaret Jacks Hall, which faces the back east side of Building 420. Please do not knock on these windows after 12:30pm when the talk has started. We will not be able to come out and open the door for you at that point.