<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE TEI.2 PUBLIC "-//TEI P4//DTD Main Document Type//EN"
"http://www.infomotions.com/alex/dtd/tei2.dtd" [
<!ENTITY % TEI.XML         'INCLUDE' >
<!ENTITY % TEI.prose       'INCLUDE' >
<!ENTITY % TEI.linking     'INCLUDE' >
<!ENTITY % TEI.figures     'INCLUDE' >
<!ENTITY % TEI.names.dates 'INCLUDE' >
<!ATTLIST xptr   url CDATA #IMPLIED >
<!ATTLIST xref   url CDATA #IMPLIED >
<!ATTLIST figure url CDATA #IMPLIED >
]> 
<TEI.2>
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>Distant Reader Indexes</title> 
        <author>Eric Lease Morgan</author>
        <respStmt>
          <resp>converted into TEI-conformant markup by</resp>
          <name>Eric Lease Morgan</name>
        </respStmt>
      </titleStmt>
      <publicationStmt>
        <publisher>Eric Lease Morgan, &#169; University of Notre Dame</publisher>
        <address>
        	<addrLine>emorgan@nd.edu</addrLine>
        </address>
        <distributor>Available through the Distant Reader at <xptr url='https://distantreader.org/blog/indexes/' />.</distributor>
        <idno type='reader'>27</idno>
        <availability status='free'>
          <p>This document is distributed under a GNU Public License.</p>
        </availability>
      </publicationStmt>
      <notesStmt>
       <note type='abstract'>I have begun to integrate indexes into the Distant Reader for the purposes of creating and downloading previously created data sets ("study carrels"). This posting introduces the work done to date.</note>
      </notesStmt>
      <sourceDesc>
        <p>This is the first publication of this posting.</p>
      </sourceDesc>
    </fileDesc>
    <profileDesc>
      <creation>
        <date>2022-11-25</date>
      </creation>
      <textClass>
        <keywords>
          <list><item>Distant Reader</item><item>indexes</item></list>
        </keywords>
      </textClass>
    </profileDesc>
    <revisionDesc>
      <change>
<date>2022-11-25</date>
<respStmt>
<name>Eric Lease Morgan</name>
</respStmt>
<item>initial TEI encoding</item>
</change>
    </revisionDesc>
  </teiHeader>
  <text>
    <front>
    </front>
    <body>
      <div1>
  <head>Distant Reader Indexes</head>
  <p>I have begun to integrate indexes into the Distant Reader for the purposes of creating and downloading previously created data sets ("study carrels"). This posting introduces the work done to date.</p>
  <p>The Reader creates and provides services against data sets afffectionately called "study carrels". To create a study carrel the student, researcher, or scholar assembles a set of files for analysis into a single directory. The directory is used as input to the Reader's build command, and the result is a data set that can be modeled for the purposes of addressing research questions. That said, you would be surprised how difficult it is for people to create a directory filed with files; now-a-days we seem distribute links to splash pages instead of the content itself. Sigh! The Reader's set of indexes is intended to make it easier to create a set of files for analysis and thus demonstrate the distant reading.</p>
  <p>To date, a small set of indexes have been created. They include:</p>
  <list type='bulleted'>
    <item><xref url='https://distantreader.org/stacks/indexes/arxiv'>Arxiv</xref> - More than 2 million pre-print journal articles, mostly in areas of physics, astronomy, and computer science (interesting query: <xref url='https://distantreader.org/stacks/indexes/search?index=arxiv&amp;query=%22computer+science+is%22&amp;format=html'>"computer science is"</xref>)</item>
    <item><xref url='https://distantreader.org/stacks/indexes/gutenberg'>Project Gutenberg electronic texts (ebooks)</xref> - About 60,000 books, mostly from the Western canon (interesting query: <xref url='https://distantreader.org/stacks/indexes/search?index=gutenberg&amp;query=subject%3Alove+AND+subject%3Awar&amp;format=html'>subject:love AND subject:war</xref>)</item>
    <item><xref url='https://distantreader.org/stacks/indexes/ital'>ITAL</xref> - The ful run (approximately 800) of a journal called Information Technology &amp; Libraries (interesting query: <xref url='https://distantreader.org/stacks/indexes/search?index=ital&amp;query=%22libraries+are%22&amp;format=html'>"libraries are"</xref> )</item>
    <item><xref url='https://distantreader.org/stacks/indexes/carrels'>Distant Reader study carrels</xref> - Approximately 3,000 previously and automatically created study carrels (interesting query: <xref url='https://distantreader.org/stacks/indexes/search?index=carrels&amp;query=love+AND+sources%3Afreebo+NOT+purcell&amp;format=html'>love AND sources:freebo</xref> )</item>
  </list>
  <p>The idea behind the indexes is this:</p>
  <list type='ordered'>
    <item>query an index</item>
    <item>use the HTML output to sort, filter, and refine the query</item>
    <item>export the results as CSV or JSON</item>
    <item>import the CSV or JSON files into your favorite anaysis program (like a database, spreadsheet, or OpenRefine)</item>
    <item>curate the results to make them even more exact</item>
    <item>programatically loop through the results and cache content locally</item>
    <item>transform the results into a simple metadata (CSV) file</item>
    <item>create a study carrel with the cached ontent and the metadata file</item>
  </list>
  <p>Yes, at first glance, the process seems complicated, but a whole lot of it can be automated.</p>
  <p>"Again, why are you doing this?" Because we continue to be drinking from the proverbial firehose; because we are suffering from information overload; because typical search results return so many relevant items, and we few means to really read and analyze all of them. The Distant Reader is intended to address this problem and more.</p>
  <p>Finally, software is never done. If it were, then it would be called "hardware". That being the case, the links above will probably break sooner than I desire, but the process will remain the same: 1) search index, 2) refine result, 3) create data set, and ultimately, 4) do analysis -- read.</p>
</div1>

    </body>
    <back>
    </back>
  </text>
</TEI.2>
