Full Program »
Got: Generalization Over Taxonomies, A Software Toolkit For Content Analysis With Taxonomies
GOT is a Python3 software toolkit for content analysis of collections of texts using domain taxonomies. The structure of the toolkit follows a hybrid methodology developed in recent research. The efficiency of this methodology was illustrated in the analysis of research tendencies in Data Science: the findings led to insights on the tendencies of research that could not be derived by using more conventional techniques. The methodology takes a collection of texts and domain taxonomy as an input. It includes three steps: (1) computing matrices of relevance between texts and taxonomy leaf concepts using a purely structural string-to-text relevance measure based on suffix trees representing the texts and annotated by string frequencies, (2) finding fuzzy clusters of taxonomy leaf topics using an in-house method involving both additive and spectral properties, and (3) finding most specific generalizations of the fuzzy clusters in a rooted tree of the taxonomy. Such a generalization parsimoniously lifts a cluster to its 'head subject' in the higher ranks of the taxonomy, to tightly cover the cluster, up to a few errors, 'gaps' and/or 'offshoots'. A user of the toolkit may use the implementation of the whole methodology as well as its individual modules including a visualization module. GOT toolkit provides two usage scenarios: (a) console mode for using via command line and (b) import mode for using in Python3 source codes.