<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE TEI.2 PUBLIC "-//TEI P4//DTD Main Document Type//EN"
"http://www.infomotions.com/alex/dtd/tei2.dtd" [
<!ENTITY % TEI.XML         'INCLUDE' >
<!ENTITY % TEI.prose       'INCLUDE' >
<!ENTITY % TEI.linking     'INCLUDE' >
<!ENTITY % TEI.figures     'INCLUDE' >
<!ENTITY % TEI.names.dates 'INCLUDE' >
<!ATTLIST xptr   url CDATA #IMPLIED >
<!ATTLIST xref   url CDATA #IMPLIED >
<!ATTLIST figure url CDATA #IMPLIED >
]> 
<TEI.2>
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>Lexicon Enhancers</title> 
        <author>Eric Lease Morgan</author>
        <respStmt>
          <resp>converted into TEI-conformant markup by</resp>
          <name>Eric Lease Morgan</name>
        </respStmt>
      </titleStmt>
      <publicationStmt>
        <publisher>Eric Lease Morgan, &#169; University of Notre Dame</publisher>
        <address>
        	<addrLine>emorgan@nd.edu</addrLine>
        </address>
        <distributor>Available through the Distant Reader at <xptr url='https://distantreader.org/blog/lexicon-enhancers/' />.</distributor>
        <idno type='reader'>70</idno>
        <availability status='free'>
          <p>This document is distributed under a GNU Public License.</p>
        </availability>
      </publicationStmt>
      <notesStmt>
       <note type='abstract'>This posting describes a number of Python scripts used to enhance a lexicon, where a lexicon is defined as a list of desirable, meaningful words.</note>
      </notesStmt>
      <sourceDesc>
        <p>This is the original publication of this posting.</p>
      </sourceDesc>
    </fileDesc>
    <profileDesc>
      <creation>
        <date>2024-07-12</date>
      </creation>
      <textClass>
        <keywords>
          <list><item>hacks</item><item>lexicons</item></list>
        </keywords>
      </textClass>
    </profileDesc>
    <revisionDesc>
      <change>
<date>2024-07-12</date>
<respStmt>
<name>Eric Lease Morgan</name>
</respStmt>
<item>initial TEI encoding</item>
</change>
    </revisionDesc>
  </teiHeader>
  <text>
    <front>
    </front>
    <body>
      <div1>
<p>This posting describes a number of Python scripts used to enhance a lexicon, where a lexicon is defined as a list of desirable, meaningful words. Compare a lexicon to a stop word list where stop word lists contain words of little use or interest, and a lexicon is a set of words of great significance.</p>

<p>The first script -- <xref url='keywords2lexicon.py'>keywords2lexicon.py</xref> - takes the name of a <xref url='https://distantreader.org/'>Distant Reader</xref> study carrel and an integer (N) as input. It then computes the N most frequent keywords and outputs them to the carrel's etc/lexicon.txt file. This is a decent way to jumpstart a lexicon. Alternatively, create a list of words by hand.</p>

<p>Once you have a lexicon, you may want to enhance it, and there are three supported methods:</p>

<list type='ordered'>
<item><xref url='lexicon2variants.py'>lexicon2variants.py</xref> - given a study carrel, this script will find the lemmas of each word in the lexicon, identify the associated tokens ("words") with those lemmas, and send the result to standard output. This is a good way to identify variations in spellings but only the variations that exist in the carrel's corpus.</item>
<item><xref url='lexicon2related.py'>lexicon2related.py</xref> - given a study carrel and a number (N), this script will loop through the lexicon to identify semantically related words. The number of related words can be equal to the top N similarities or the similarities whose value is greater than N, where N is a floating point number. The script will output the related words. All of this can only be done by first semantically indexing the carrel, and it only really becomes useful if the carrel's size can be measured in millions of words. This technique is a root of the current generative-AI trend. </item>
<item><xref url='lexicon2synonyms.py'>lexicon2synonyms.py</xref> - given a study carrel, loop through the carrel's lexicon and use <xref url='https://wordnet.princeton.edu'>WordNet</xref> to identify synonyms. This technique ought to be seen as complementary to lexicon2related.py as it will introduce words outside the carrel's corpus.</item>
</list>

<p>Used in different orders, with different parameters, and compounded between themselves, this tiny system of scripts will generate lists of words that may be of interest to the student, researcher, or scholar.</p>

<p>Given a refined lexicon, it is possible to create sophisticated full-text database queries, map the lexicon's words in a networked space, feed the words to a concordance for quick reading, etc. One might even calculate weights of individual documents based on the occurances of lexicon words. Hmmm...</p>

<p>As an example, here is a tiny lexicon generated from the list of computed keywords in Homer's Iliad and Odyssey:</p>

<quote>father; great; jove; man</quote> 

<p>Here is a list of these words and their variants found in the corpus:</p>

<quote>father; fathers; great; greater; greatest; jove; man; manned; men</quote>

<p>Here is a list of these words and their semantically related words:</p>

<quote>aside; father; great; jove; loud; man; mighty; murderous; redoubtable; sarpedon; shook; thereon</quote>

<p>Here is a list of these words and their synonyms:</p>

<quote>Church Father; Father; Father-God; Father of the Church; Fatherhood; Isle of Man; Jove; Jupiter; Man; Padre; adult male; bang-up; beget; begetter; beginner; big; bring forth; bully; capital; corking; cracking; dandy; don; enceinte; engender; expectant; father; forefather; founder; founding father; generate; gentleman; gentleman's gentleman; get; gravid; great; groovy; heavy; homo; human; human being; human beings; human race; humanity; humankind; humans; jove; keen; large; majuscule; male parent; man; mankind; military man; military personnel; mother; neat; nifty; not bad; outstanding; peachy; piece; serviceman; sire; slap-up; smashing; swell; valet; valet de chambre; with child; world
</quote>
<p>Finally, and very importantly, the output of these scripts are not intended to be taken whole cloth. Instead one is expected to get the output, peruse it, and then season it to your own taste. Computers are stupid. You are not.</p>

</div1>

    </body>
    <back>
    </back>
  </text>
</TEI.2>
