<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE TEI.2 PUBLIC "-//TEI P4//DTD Main Document Type//EN"
"http://www.infomotions.com/alex/dtd/tei2.dtd" [
<!ENTITY % TEI.XML         'INCLUDE' >
<!ENTITY % TEI.prose       'INCLUDE' >
<!ENTITY % TEI.linking     'INCLUDE' >
<!ENTITY % TEI.figures     'INCLUDE' >
<!ENTITY % TEI.names.dates 'INCLUDE' >
<!ATTLIST xptr   url CDATA #IMPLIED >
<!ATTLIST xref   url CDATA #IMPLIED >
<!ATTLIST figure url CDATA #IMPLIED >
]> 
<TEI.2>
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>Generative-AI Summarization</title> 
        <author>Eric Lease Morgan</author>
        <respStmt>
          <resp>converted into TEI-conformant markup by</resp>
          <name>Eric Lease Morgan</name>
        </respStmt>
      </titleStmt>
      <publicationStmt>
        <publisher>Eric Lease Morgan, &#169; University of Notre Dame</publisher>
        <address>
        	<addrLine>emorgan@nd.edu</addrLine>
        </address>
        <distributor>Available through the Distant Reader at <xptr url='https://distantreader.org/blog/summarization/' />.</distributor>
        <idno type='reader'>68</idno>
        <availability status='free'>
          <p>This document is distributed under a GNU Public License.</p>
        </availability>
      </publicationStmt>
      <notesStmt>
       <note type='abstract'>Ann Blair's book Too Much To Know overflows with techniques of how pre-early modern scholars dealt with information overload. One of the more oft-used techniques is summarization. With the advent of generative-AI, it is almost trivial to create more-than-plausible summaries of documents.</note>
      </notesStmt>
      <sourceDesc>
        <p>This is the original publication of this posting.</p>
      </sourceDesc>
    </fileDesc>
    <profileDesc>
      <creation>
        <date>2024-06-27</date>
      </creation>
      <textClass>
        <keywords>
          <list><item>libraries and librarianship</item><item>large-language models (LLMs)</item><item>summarization</item></list>
        </keywords>
      </textClass>
    </profileDesc>
    <revisionDesc>
      <change>
<date>2024-06-27</date>
<respStmt>
<name>Eric Lease Morgan</name>
</respStmt>
<item>initial TEI encoding</item>
</change>
    </revisionDesc>
  </teiHeader>
  <text>
    <front>
    </front>
    <body>
      <div1>
<p>Ann Blair's book Too Much To Know overflows with techniques of how pre-early modern scholars dealt with information overload. [1] One of the more oft-used techniques is summarization. With the advent of generative-AI, it is almost trivial to create more-than-plausible summaries of documents.</p>

<p>The <xref url='./summarize.py'>linked Python script</xref> is an example. Given the path to a plain text file, the script will load a configured large-language model, vectorize the given plain text file, compare the two, and output a three-sentence summary. I enhanced the script to work in batch, and thus I have used the technique to summarize collections of items:</p>

<list type='bulleted'><item>each chapter in each book written by <xref url='./austen.txt'>Jane Austen</xref></item>
<item>250 journal articles on the topic <xref url='./rheumatoid-arthritis.txt'>rheumatoid arthritis</xref></item>
<item>another 250 journal articles on the topic of <xref url='./climate-change.txt'>climate change</xref></item>
<item>130 articles on the topic of <xref url='./cataloging.txt'>cataloging</xref></item>
</list>

<p>For any given document there are zero 100% correct summaries; everybody will summarize a document differently. That said, the results of this automated process look pretty good to me. Moreover, each list of summaries addresses difficult to answer questions such as:</p>

<list type='bulleted'><item>how can Jane Austen's works be characterized?</item>
<item>what is rheumatoid arthritis and what are some of its treatments?</item>
<item>how is climate change being manifested across the globe?</item>
<item>how has the practice of cataloging changed over time?</item>
</list>

<p>The lists of summaries may be deemed as information overload in-and-of themselves, and one might consider summarizing the summaries. Such is an exercise left up to the reader.</p>

<p>I believe libraries and librarians ought to learn how to exploit generative-AI for summarization purposes. Just as the migration of printed cards to MARC transformed how libraries hosted catalogs, migrating from hand-crafted summaries to computed summaries will transform how information overload is managed.</p>

<p>[1] Blair, Ann. 2010. Too Much to Know&#8239;: Managing Scholarly Information Before the Modern Age. New Haven Conn: Yale University Press.</p>

</div1>

    </body>
    <back>
    </back>
  </text>
</TEI.2>
