org.apache.nutch.indexer
Class DeleteDuplicates

java.lang.Object
  extended by org.apache.hadoop.util.ToolBase
      extended by org.apache.nutch.indexer.DeleteDuplicates
All Implemented Interfaces:
Configurable, Closeable, JobConfigurable, Mapper, OutputFormat, Reducer, Tool

public class DeleteDuplicates
extends ToolBase
implements Mapper, Reducer, OutputFormat

Delete duplicate documents in a set of Lucene indexes. Duplicates have either the same contents (via MD5 hash) or the same URL. This tool uses the following algorithm:

Author:
Andrzej Bialecki

Nested Class Summary
static class DeleteDuplicates.HashPartitioner
           
static class DeleteDuplicates.HashReducer
           
static class DeleteDuplicates.IndexDoc
           
static class DeleteDuplicates.InputFormat
           
static class DeleteDuplicates.UrlsReducer
           
 
Field Summary
 
Fields inherited from class org.apache.hadoop.util.ToolBase
conf
 
Constructor Summary
DeleteDuplicates()
           
DeleteDuplicates(Configuration conf)
           
 
Method Summary
 void checkOutputSpecs(FileSystem fs, JobConf job)
           
 void close()
           
 void configure(JobConf job)
           
 void dedup(Path[] indexDirs)
           
 RecordWriter getRecordWriter(FileSystem fs, JobConf job, String name, Progressable progress)
          Write nothing.
static void main(String[] args)
           
 void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter)
          Map [*,IndexDoc] pairs to [index,doc] pairs.
 void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter)
          Delete docs named in values from index named in key.
 int run(String[] args)
           
 void setConf(Configuration conf)
           
 
Methods inherited from class org.apache.hadoop.util.ToolBase
doMain, getConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DeleteDuplicates

public DeleteDuplicates()

DeleteDuplicates

public DeleteDuplicates(Configuration conf)
Method Detail

configure

public void configure(JobConf job)
Specified by:
configure in interface JobConfigurable

setConf

public void setConf(Configuration conf)
Specified by:
setConf in interface Configurable
Overrides:
setConf in class ToolBase

close

public void close()
Specified by:
close in interface Closeable

map

public void map(WritableComparable key,
                Writable value,
                OutputCollector output,
                Reporter reporter)
         throws IOException
Map [*,IndexDoc] pairs to [index,doc] pairs.

Specified by:
map in interface Mapper
Throws:
IOException

reduce

public void reduce(WritableComparable key,
                   Iterator values,
                   OutputCollector output,
                   Reporter reporter)
            throws IOException
Delete docs named in values from index named in key.

Specified by:
reduce in interface Reducer
Throws:
IOException

getRecordWriter

public RecordWriter getRecordWriter(FileSystem fs,
                                    JobConf job,
                                    String name,
                                    Progressable progress)
                             throws IOException
Write nothing.

Specified by:
getRecordWriter in interface OutputFormat
Throws:
IOException

checkOutputSpecs

public void checkOutputSpecs(FileSystem fs,
                             JobConf job)
Specified by:
checkOutputSpecs in interface OutputFormat

dedup

public void dedup(Path[] indexDirs)
           throws IOException
Throws:
IOException

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception

run

public int run(String[] args)
        throws Exception
Specified by:
run in interface Tool
Throws:
Exception


Copyright © 2006 The Apache Software Foundation