<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE TEI.2 PUBLIC "-//TEI P4//DTD Main Document Type//EN"
"http://www.infomotions.com/alex/dtd/tei2.dtd" [
<!ENTITY % TEI.XML         'INCLUDE' >
<!ENTITY % TEI.prose       'INCLUDE' >
<!ENTITY % TEI.linking     'INCLUDE' >
<!ENTITY % TEI.figures     'INCLUDE' >
<!ENTITY % TEI.names.dates 'INCLUDE' >
<!ATTLIST xptr   url CDATA #IMPLIED >
<!ATTLIST xref   url CDATA #IMPLIED >
<!ATTLIST figure url CDATA #IMPLIED >
]> 
<TEI.2>
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>CRL and ITAL</title> 
        <author>Eric Lease Morgan</author>
        <respStmt>
          <resp>converted into TEI-conformant markup by</resp>
          <name>Eric Lease Morgan</name>
        </respStmt>
      </titleStmt>
      <publicationStmt>
        <publisher>Eric Lease Morgan, &#169; University of Notre Dame</publisher>
        <address>
        	<addrLine>emorgan@nd.edu</addrLine>
        </address>
        <distributor>Available through the Distant Reader at <xptr url='https://distantreader.org/blog/crl-ital/' />.</distributor>
        <idno type='reader'>20</idno>
        <availability status='free'>
          <p>This document is distributed under a GNU Public License.</p>
        </availability>
      </publicationStmt>
      <notesStmt>
       <note type='abstract'>This is some preliminary and rudimentary analysis done against a carrel of CRL and ITAL content.</note>
      </notesStmt>
      <sourceDesc>
        <p>This posting was originally shared on the Code4Lib Slack channel (November 2, 2022)</p>
      </sourceDesc>
    </fileDesc>
    <profileDesc>
      <creation>
        <date>2022-11-14</date>
      </creation>
      <textClass>
        <keywords>
          <list><item>readings</item></list>
        </keywords>
      </textClass>
    </profileDesc>
    <revisionDesc>
      <change>
<date>2022-11-14</date>
<respStmt>
<name>Eric Lease Morgan</name>
</respStmt>
<item>initial TEI encoding</item>
</change>
    </revisionDesc>
  </teiHeader>
  <text>
    <front>
    </front>
    <body>
      <div1>
<p>
I've begun reading the whole of two library-related journals: 1) College &amp; Research Libraries (CRL), and 2) Information Technology And Libraries (ITAL), and this blurb outlines what I've learned so far.
</p>

<p>
I began by exploiting OAI-PMH to download the whole of the journals. The corpus includes about 6,100 articles and is 26 million words long. [1] CRL dates from 1938, and ITAL from 1968. [2]
</p>

<p>
I then did some topic modeling against the corpus, and for grins, I limited the number of topics to eight. This resulted in the following themes:
</p>

<p rend='pre'>     themes  weights                                           features
      books  0.43275  books use materials work subject collections s...
   american  0.27160  american history books index bibliography refe...
   academic  0.26037  academic work true false management change kno...
    faculty  0.21954  faculty state staff academic committee status ...
    catalog  0.15720  catalog data records use systems used search s...
   students  0.14023  students academic study student instruction fa...
   journals  0.11281  journals study use science articles data acade...
       data  0.07789  data digital web users technology search conte...</p>

<p>
'Looks about right, if you ask me. I then visualized the results using the ubiquitous pie chart, and again, it looks pretty much what I expected.
</p>

<p><figure url='./topics.jpg' rend='center' /></p>

<p>
I then augmented the underlying model with date values, pivoted the table, and visualized the result as a stacked area chart. From the results you can see that the theme of "books" was very prevalent until 1967. Then, starting around 2006, the theme of "data" became much more prominent.
</p>

<p><figure url='./topics-over-time.jpg' rend='center' /></p>

<p>
Since my corpus is relatively large, and since I iterated the modeling process relatively few times, these results ought to be considered preliminary. Still, it looks about right to me.
</p>

<p>
Fun with comprehensive, full text collections.
</p>

<p>[1] To put this into context, Moby Dick is about .25 million words long.<lb />
[2] For additional descriptive statistics-like detail about the collection, see: <xref url='https://distantreader.org/stacks/carrels/crl-ital/index.htm'>https://distantreader.org/stacks/carrels/crl-ital/index.htm</xref>
.</p>

</div1>
    </body>
    <back>
    </back>
  </text>
</TEI.2>
