<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE TEI.2 PUBLIC "-//TEI P4//DTD Main Document Type//EN"
"http://www.infomotions.com/alex/dtd/tei2.dtd" [
<!ENTITY % TEI.XML         'INCLUDE' >
<!ENTITY % TEI.prose       'INCLUDE' >
<!ENTITY % TEI.linking     'INCLUDE' >
<!ENTITY % TEI.figures     'INCLUDE' >
<!ENTITY % TEI.names.dates 'INCLUDE' >
<!ATTLIST xptr   url CDATA #IMPLIED >
<!ATTLIST xref   url CDATA #IMPLIED >
<!ATTLIST figure url CDATA #IMPLIED >
]> 
<TEI.2>
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>Using OpenRefine to Remove Duplicates</title> 
        <author>Eric Lease Morgan</author>
        <respStmt>
          <resp>converted into TEI-conformant markup by</resp>
          <name>Eric Lease Morgan</name>
        </respStmt>
      </titleStmt>
      <publicationStmt>
        <publisher>Eric Lease Morgan, &#169; University of Notre Dame</publisher>
        <address>
        	<addrLine>emorgan@nd.edu</addrLine>
        </address>
        <distributor>Available through the Distant Reader at <xptr url='https://distantreader.org/blog/removing-duplicates/' />.</distributor>
        <idno type='reader'>40</idno>
        <availability status='free'>
          <p>This document is distributed under a GNU Public License.</p>
        </availability>
      </publicationStmt>
      <notesStmt>
       <note type='abstract'>This blog posting describes how to remove duplicate items using OpenRefine's clustering functionality.</note>
      </notesStmt>
      <sourceDesc>
        <p>This is the original source of this publication.</p>
      </sourceDesc>
    </fileDesc>
    <profileDesc>
      <creation>
        <date>2023-06-01</date>
      </creation>
      <textClass>
        <keywords>
          <list><item>miscellaneous</item></list>
        </keywords>
      </textClass>
    </profileDesc>
    <revisionDesc>
      <change>
<date>2023-06-01</date>
<respStmt>
<name>Eric Lease Morgan</name>
</respStmt>
<item>initial TEI encoding</item>
</change>
    </revisionDesc>
  </teiHeader>
  <text>
    <front>
    </front>
    <body>
      <div1>
  <p>tl;dnr - Through the use of OpenRefine, one can create more useful HathiTrust collection files.</p>
  <div2>
    <head>Introduction</head>
    <p>I often take advantage of the HathiTrust and its very large collection of public domain documents, but when I search the collection for just about anything, I am often faced with numersous duplicate items. Because the volume of search results is so large, filtering duplicates is often tedious, but I have learned how to take advantage of OpenRefine's clustering functions to quickly and easily remove duplicates. This blog posting describes how.</p>
  </div2>
  <div2>
    <head>The Problem</head>
    <p>For simplicity's sake, let's use a HathiTrust featured collection as an example, specificaly, the Adventure Novels: G.A. Henty. [1] At first blush, the collection includes 47 items, but after downloading the collection file, importing it into any spreadsheet application, and sorting/grouping it by title one can see there are dupicate items, for example but not limited to:</p>
    <list type='bulleted'>
      <item>A Knight of the White Cross (two listings)</item>
      <item>Bonnie Prince Charlie (three listings)</item>
      <item>In the Reign of Terror (seven listings)</item>
    </list>
    <p>While manually looping through 47 items and removing duplicates is not onerous, the problem becomes acute when the student, researcher, or scholar tries to create a complete and authoritative list of all Henty's titles; an author search for Henty and filtered by langauge, place of publication, and even specific library returns many copies of the same things. The tedious process of manually removing duplicates from any sizable collection will significantly impede anybody from doing research on whole collections, and it will cause any computer-based analysis to be whoefully inaccrate. This, in turn, will encourage some people to disregard computer-based analysis. From my point of view, such is undesireable, and this is where OpenRefine comes to the rescue.</p>
  </div2>
  <div2>
    <head>OpenRefine, the solution</head>
    <p>OpenRefine bills itself as "a powerful free, open source tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data", and from my point of view, it most certainly lives up to its description. [2] Some of my research would be a whole lot more difficult if it weren't for OpenRefine.</p>
    <p>With the idea of removing duplicates and creating a more useful HathiTrust collection file, the first step is to create an OpenRefine project and choose the given collection file as input. [3] Since OpenRefine eats delimited files (like comma-separated value files and tab-separated value -- TSV -- files) for lunch, OpenRefine will recognize the collection file as a TSV file and present you with additional parsing options. In this case, you can accept the defaults and finish initializing the project by clicking the "Create project" button</p>
    
    <p rend='center'><figure url='image-01.png' rend='center'/><lb />create a project</p>

    <p>The next step is to apply text faceting against the title column and sorting the result by count. You will see that a number of items are listed numerous times, and upon closer inspection, you will see some titles with very similar manifestations (differences in cataloging practice). These are the sorts of things we want to both normalize and deduplicate. Click the "Cluster" button.</p>

    <p rend='center'><figure url='image-02.png' rend='center'/><lb />viewing text facets</p>

    <p>After clicking the Cluster button you will be presented with a dialog box displaying a large number of clustering options/algorithms. Apply each and every option to the collection, and the titles we become normalized, and when you are finished, the number of items in the collection will not have changed, but the number of unique titles will have decreased; there are now many repeated titles. Click the "Remove all" button to exit the faceting process.</p>

    <p rend='center'><figure url='image-03.png' rend='center'/><lb />clustering</p>

    <p>The next step is pernamently sorting the collection by title, "blanking down", and removing the blanked items. This is the actual process for removing the duplicates. Here's how:</p>
    <list type='ordered'>
      <item>Sort the titles alphabetically, and then make the sort permanent by selecting "Reorder rows permanently" from the Sort menu.</item>
      <item>Choose "blank down" from the Edit menu, and this retains the first title of many duplicates but makes any subsequent titles empty.</item>
      <item>Text facet on the title columm and select the last value from the facets, and it has the label "(blank)".</item>
      <item>Finally, select "Delete matching rows".</item>
    </list>
    
    <table rend="center">
<row>
<cell>
<figure url="./image-05.png"/>
<lb/>
sort titles
</cell>
<cell>
<figure url="./image-06.png"/>
<lb/>
make sort permanent</cell>
</row>
</table>

    <table rend="center">
<row>
<cell>
<figure url="./image-07.png"/>
<lb/>
blank down
</cell>
<cell>
<figure url="./image-08.png"/>
<lb/>
delete matching rows
</cell>
</row>
</table>


    <p>If you have been using the Henty collection, then your collection has been reduced to 29 items, and none of the titles are duplicated. Use the "Export" button to save your good work to a file and use the file for further analysis. For example, upload the new file to the HathiTrust Research Center and do processing against it. [4]</p>


    <p rend='center'><figure url='image-09.png' rend='center'/><lb />export final result</p>
    
  </div2>
  <div2>
    <head>Extra credit</head>
    <p>For extra credit, you might want to apply other cleaning/normalizing processes against the collection file, but removing duplicates is probably the most important. Some of these other processes include making sure the access column includes a value of "1". Otherwise, you may not be able to download the full text of the associated item. You might also want to take a look at the rights_date_used column and make sure there are no dates similar to "9999". You might also want to remove leading articles from the titles so sorted titles... sort correctly.</p>
  </div2>
  <div2>
    <head>Summary</head>
    <p>The 'Trust is a leading provider of digitized books distributed in the public domain. Using the 'Trust the student, researcher, or scholar can study entire genres or all the complete works written by a given author, but removing duplicates is critical to such analysis. This posting outlined one way to do this with the help of OpenRefine's clustering functions. The OpenRefine website outlines how to use clustering in greater detail. See the website as well. [5]</p>
  </div2>
  <div2>
    <head>Links and notes</head>
    <list type='ordered'>
      <item>Adventure Novels - <xref url='https://babel.hathitrust.org/cgi/mb?a=listis;c=464226859'>https://babel.hathitrust.org/cgi/mb?a=listis;c=464226859</xref></item>
      <item>OpenRefine - <xref url='https://openrefine.org'>https://openrefine.org</xref></item>
      <item>Downloading, installing, and launching OpenRefine is an exercise left up to the reader.</item>
      <item>My deduplicated file is linked <xref url='./henty.tsv'>here</xref>.</item>
      <item>clustering - <xref url='https://openrefine.org/docs/manual/cellediting#cluster-and-edit'>https://openrefine.org/docs/manual/cellediting#cluster-and-edit</xref></item>
    </list>
  </div2>
</div1>

    </body>
    <back>
    </back>
  </text>
</TEI.2>
