In November, the paper "Findings of the Association for Computational Linguistics: EMNLP 2025" and related dataset were published. The paper discusses the assembly and release of the dataset, FicSim, made of long-form, recently written fiction, including scores along 12 axes of similarity informed by author-produced metadata and validated by digital humanities scholars for use evaluating the usefulness of language models for computational literary studies. Maria-Emil Deal, who graduated from the MLIS program in May 2025, participated in the evaluation of the dataset and the creation of this paper along with co-authors Natasha Johnson, Amanda Bertsch, and Emma Strubell from the Language Technologies Institute at Carnegie Mellon University. Maria-Emil Deal provided subject matter expertise and information on the use of user-generated tags in information organization and cataloging.
Johnson, N., Bertsch, A., Deal, M-E., & Strubell, E. (2025). FicSim: A Dataset for Multi-Faceted Semantic Similarity in Long-Form Fiction. Findings of the Association for Computational Linguistics: EMNLP 2025, 25228–25246. https://aclanthology.org/2025.findings-emnlp.1375/