Computational Thematics

dc.bibliographicCitation.issue1
dc.bibliographicCitation.volume11
dc.contributor.authorSobchuk, Oleg
dc.contributor.authorŠeļa, Artjoms
dc.date.accessioned2025-03-14T10:30:45Z
dc.date.available2025-03-14T10:30:45Z
dc.date.issued2024
dc.date.updated2025-01-28T05:41:05Z
dc.description.abstractWhat are the best methods of capturing thematic similarity between literary texts? Knowing the answer to this question would be useful for automatic clustering of book genres, or any other thematic grouping. This paper compares a variety of algorithms for unsupervised learning of thematic similarities between texts, which we call “computational thematics”. These algorithms belong to three steps of analysis: text pre-processing, extraction of text features, and measuring distances between the lists of features. Each of these steps includes a variety of options. We test all the possible combinations of these options. Every combination of algorithms is given a task to cluster a corpus of books belonging to four pre-tagged genres of fiction. This clustering is then validated against the “ground truth” genre labels. Such comparison of algorithms allows us to learn the best and the worst combinations for computational thematic analysis. To illustrate the difference between the best and the worst methods, we then cluster 5000 random novels from the HathiTrust corpus of fiction.
dc.identifier.doi10.1057/s41599-024-02933-6
dc.identifier.urihttp://resolver.sub.uni-goettingen.de/purl?fidaac-11858/3440
dc.language.isoeng
dc.relation.journalHumanities and Social Sciences Communications
dc.rightsL::CC BY 4.0
dc.subject.ddcddc:800
dc.subject.fieldliterarystudies
dc.subject.fielddigitalhumanities
dc.titleComputational Thematics
dc.title.alternativeComparing Algorithms for Clustering the Genres of Literary Fiction
dc.typearticle
dc.type.versionpublishedVersion
dspace.entity.typePublication

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
41599_2024_Article_2933.pdf
Size:
2.11 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
5.84 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections