CRL and ITAL

]> CRL and ITAL Eric Lease Morgan converted into TEI-conformant markup by Eric Lease Morgan Eric Lease Morgan, © University of Notre Dame

emorgan@nd.edu

Available through the Distant Reader at . 20

This document is distributed under a GNU Public License.

This is some preliminary and rudimentary analysis done against a carrel of CRL and ITAL content.

This posting was originally shared on the Code4Lib Slack channel (November 2, 2022)

2022-11-14 readings 2022-11-14 Eric Lease Morgan initial TEI encoding

I've begun reading the whole of two library-related journals: 1) College & Research Libraries (CRL), and 2) Information Technology And Libraries (ITAL), and this blurb outlines what I've learned so far.

I began by exploiting OAI-PMH to download the whole of the journals. The corpus includes about 6,100 articles and is 26 million words long. [1] CRL dates from 1938, and ITAL from 1968. [2]

I then did some topic modeling against the corpus, and for grins, I limited the number of topics to eight. This resulted in the following themes:

themes weights features books 0.43275 books use materials work subject collections s... american 0.27160 american history books index bibliography refe... academic 0.26037 academic work true false management change kno... faculty 0.21954 faculty state staff academic committee status ... catalog 0.15720 catalog data records use systems used search s... students 0.14023 students academic study student instruction fa... journals 0.11281 journals study use science articles data acade... data 0.07789 data digital web users technology search conte...

'Looks about right, if you ask me. I then visualized the results using the ubiquitous pie chart, and again, it looks pretty much what I expected.

I then augmented the underlying model with date values, pivoted the table, and visualized the result as a stacked area chart. From the results you can see that the theme of "books" was very prevalent until 1967. Then, starting around 2006, the theme of "data" became much more prominent.

Since my corpus is relatively large, and since I iterated the modeling process relatively few times, these results ought to be considered preliminary. Still, it looks about right to me.

Fun with comprehensive, full text collections.

[1] To put this into context, Moby Dick is about .25 million words long. [2] For additional descriptive statistics-like detail about the collection, see: https://distantreader.org/stacks/carrels/crl-ital/index.htm .