|
||||||||||
PREV NEXT | FRAMES NO FRAMES |
Packages that use CrawlDatum | |
---|---|
org.apache.nutch.analysis.lang | Text document language identifier. |
org.apache.nutch.crawl | Crawl control code. |
org.apache.nutch.fetcher | The Nutch robot. |
org.apache.nutch.indexer | Maintain Lucene full-text indexes. |
org.apache.nutch.indexer.basic | A basic indexing plugin. |
org.apache.nutch.indexer.more | A more indexing plugin. |
org.apache.nutch.microformats.reltag | A microformats Rel-Tag Parser/Indexer/Querier plugin. |
org.apache.nutch.protocol | |
org.apache.nutch.protocol.file | Protocol plugin which supports retrieving local file resources. |
org.apache.nutch.protocol.ftp | Protocol plugin which supports retrieving documents via the ftp protocol. |
org.apache.nutch.protocol.http | Protocol plugin which supports retrieving documents via the http protocol. |
org.apache.nutch.protocol.http.api | Common API used by HTTP plugins (http ,
httpclient ) |
org.apache.nutch.protocol.httpclient | Protocol plugin which supports retrieving documents via the HTTP protocol. |
org.apache.nutch.scoring | |
org.apache.nutch.scoring.opic | |
org.creativecommons.nutch | Sample plugins that parse and index Creative Commons medadata. |
Uses of CrawlDatum in org.apache.nutch.analysis.lang |
---|
Methods in org.apache.nutch.analysis.lang with parameters of type CrawlDatum | |
---|---|
Document |
LanguageIndexingFilter.filter(Document doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
|
Uses of CrawlDatum in org.apache.nutch.crawl |
---|
Fields in org.apache.nutch.crawl declared as CrawlDatum | |
---|---|
CrawlDatum |
Generator.SelectorEntry.datum
|
Methods in org.apache.nutch.crawl that return CrawlDatum | |
---|---|
CrawlDatum |
CrawlDbReader.get(String crawlDb,
String url,
Configuration config)
|
static CrawlDatum |
CrawlDatum.read(DataInput in)
|
Methods in org.apache.nutch.crawl with parameters of type CrawlDatum | |
---|---|
static boolean |
CrawlDatum.hasDbStatus(CrawlDatum datum)
|
static boolean |
CrawlDatum.hasFetchStatus(CrawlDatum datum)
|
void |
CrawlDatum.set(CrawlDatum that)
Copy the contents of another instance into this instance. |
Uses of CrawlDatum in org.apache.nutch.fetcher |
---|
Methods in org.apache.nutch.fetcher that return CrawlDatum | |
---|---|
CrawlDatum |
FetcherOutput.getCrawlDatum()
|
Constructors in org.apache.nutch.fetcher with parameters of type CrawlDatum | |
---|---|
FetcherOutput(CrawlDatum crawlDatum,
Content content,
ParseImpl parse)
|
Uses of CrawlDatum in org.apache.nutch.indexer |
---|
Methods in org.apache.nutch.indexer with parameters of type CrawlDatum | |
---|---|
Document |
IndexingFilters.filter(Document doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
Run all defined filters. |
Document |
IndexingFilter.filter(Document doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
Adds fields or otherwise modifies the document that will be indexed for a parse. |
Uses of CrawlDatum in org.apache.nutch.indexer.basic |
---|
Methods in org.apache.nutch.indexer.basic with parameters of type CrawlDatum | |
---|---|
Document |
BasicIndexingFilter.filter(Document doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
|
Uses of CrawlDatum in org.apache.nutch.indexer.more |
---|
Methods in org.apache.nutch.indexer.more with parameters of type CrawlDatum | |
---|---|
Document |
MoreIndexingFilter.filter(Document doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
|
Uses of CrawlDatum in org.apache.nutch.microformats.reltag |
---|
Methods in org.apache.nutch.microformats.reltag with parameters of type CrawlDatum | |
---|---|
Document |
RelTagIndexingFilter.filter(Document doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
|
Uses of CrawlDatum in org.apache.nutch.protocol |
---|
Methods in org.apache.nutch.protocol with parameters of type CrawlDatum | |
---|---|
ProtocolOutput |
Protocol.getProtocolOutput(Text url,
CrawlDatum datum)
Returns the Content for a fetchlist entry. |
RobotRules |
Protocol.getRobotRules(Text url,
CrawlDatum datum)
Retrieve robot rules applicable for this url. |
Uses of CrawlDatum in org.apache.nutch.protocol.file |
---|
Methods in org.apache.nutch.protocol.file with parameters of type CrawlDatum | |
---|---|
ProtocolOutput |
File.getProtocolOutput(Text url,
CrawlDatum datum)
|
RobotRules |
File.getRobotRules(Text url,
CrawlDatum datum)
|
Constructors in org.apache.nutch.protocol.file with parameters of type CrawlDatum | |
---|---|
FileResponse(URL url,
CrawlDatum datum,
File file,
Configuration conf)
|
Uses of CrawlDatum in org.apache.nutch.protocol.ftp |
---|
Methods in org.apache.nutch.protocol.ftp with parameters of type CrawlDatum | |
---|---|
ProtocolOutput |
Ftp.getProtocolOutput(Text url,
CrawlDatum datum)
|
RobotRules |
Ftp.getRobotRules(Text url,
CrawlDatum datum)
|
Constructors in org.apache.nutch.protocol.ftp with parameters of type CrawlDatum | |
---|---|
FtpResponse(URL url,
CrawlDatum datum,
Ftp ftp,
Configuration conf)
|
Uses of CrawlDatum in org.apache.nutch.protocol.http |
---|
Methods in org.apache.nutch.protocol.http with parameters of type CrawlDatum | |
---|---|
protected Response |
Http.getResponse(URL url,
CrawlDatum datum,
boolean redirect)
|
Constructors in org.apache.nutch.protocol.http with parameters of type CrawlDatum | |
---|---|
HttpResponse(HttpBase http,
URL url,
CrawlDatum datum)
|
Uses of CrawlDatum in org.apache.nutch.protocol.http.api |
---|
Methods in org.apache.nutch.protocol.http.api with parameters of type CrawlDatum | |
---|---|
ProtocolOutput |
HttpBase.getProtocolOutput(Text url,
CrawlDatum datum)
|
protected abstract Response |
HttpBase.getResponse(URL url,
CrawlDatum datum,
boolean followRedirects)
|
RobotRules |
HttpBase.getRobotRules(Text url,
CrawlDatum datum)
|
Uses of CrawlDatum in org.apache.nutch.protocol.httpclient |
---|
Methods in org.apache.nutch.protocol.httpclient with parameters of type CrawlDatum | |
---|---|
protected Response |
Http.getResponse(URL url,
CrawlDatum datum,
boolean redirect)
|
Constructors in org.apache.nutch.protocol.httpclient with parameters of type CrawlDatum | |
---|---|
HttpResponse(HttpBase http,
URL url,
CrawlDatum datum)
|
Uses of CrawlDatum in org.apache.nutch.scoring |
---|
Methods in org.apache.nutch.scoring that return CrawlDatum | |
---|---|
CrawlDatum |
ScoringFilters.distributeScoreToOutlink(Text fromUrl,
Text toUrl,
ParseData parseData,
CrawlDatum target,
CrawlDatum adjust,
int allCount,
int validCount)
|
CrawlDatum |
ScoringFilter.distributeScoreToOutlink(Text fromUrl,
Text toUrl,
ParseData parseData,
CrawlDatum target,
CrawlDatum adjust,
int allCount,
int validCount)
Distribute score value from the current page to all its outlinked pages. |
Methods in org.apache.nutch.scoring with parameters of type CrawlDatum | |
---|---|
CrawlDatum |
ScoringFilters.distributeScoreToOutlink(Text fromUrl,
Text toUrl,
ParseData parseData,
CrawlDatum target,
CrawlDatum adjust,
int allCount,
int validCount)
|
CrawlDatum |
ScoringFilter.distributeScoreToOutlink(Text fromUrl,
Text toUrl,
ParseData parseData,
CrawlDatum target,
CrawlDatum adjust,
int allCount,
int validCount)
Distribute score value from the current page to all its outlinked pages. |
float |
ScoringFilters.generatorSortValue(Text url,
CrawlDatum datum,
float initSort)
Calculate a sort value for Generate. |
float |
ScoringFilter.generatorSortValue(Text url,
CrawlDatum datum,
float initSort)
This method prepares a sort value for the purpose of sorting and selecting top N scoring pages during fetchlist generation. |
float |
ScoringFilters.indexerScore(Text url,
Document doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore)
|
float |
ScoringFilter.indexerScore(Text url,
Document doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore)
This method calculates a Lucene document boost. |
void |
ScoringFilters.initialScore(Text url,
CrawlDatum datum)
Calculate a new initial score, used when adding newly discovered pages. |
void |
ScoringFilter.initialScore(Text url,
CrawlDatum datum)
Set an initial score for newly discovered pages. |
void |
ScoringFilters.injectedScore(Text url,
CrawlDatum datum)
Calculate a new initial score, used when injecting new pages. |
void |
ScoringFilter.injectedScore(Text url,
CrawlDatum datum)
Set an initial score for newly injected pages. |
void |
ScoringFilters.passScoreBeforeParsing(Text url,
CrawlDatum datum,
Content content)
|
void |
ScoringFilter.passScoreBeforeParsing(Text url,
CrawlDatum datum,
Content content)
This method takes all relevant score information from the current datum (coming from a generated fetchlist) and stores it into Content metadata. |
void |
ScoringFilters.updateDbScore(Text url,
CrawlDatum old,
CrawlDatum datum,
List inlinked)
Calculate updated page score during CrawlDb.update(). |
void |
ScoringFilter.updateDbScore(Text url,
CrawlDatum old,
CrawlDatum datum,
List inlinked)
This method calculates a new score of CrawlDatum during CrawlDb update, based on the initial value of the original CrawlDatum, and also score values contributed by inlinked pages. |
Uses of CrawlDatum in org.apache.nutch.scoring.opic |
---|
Methods in org.apache.nutch.scoring.opic that return CrawlDatum | |
---|---|
CrawlDatum |
OPICScoringFilter.distributeScoreToOutlink(Text fromUrl,
Text toUrl,
ParseData parseData,
CrawlDatum target,
CrawlDatum adjust,
int allCount,
int validCount)
Get a float value from Fetcher.SCORE_KEY, divide it by the number of outlinks and apply. |
Methods in org.apache.nutch.scoring.opic with parameters of type CrawlDatum | |
---|---|
CrawlDatum |
OPICScoringFilter.distributeScoreToOutlink(Text fromUrl,
Text toUrl,
ParseData parseData,
CrawlDatum target,
CrawlDatum adjust,
int allCount,
int validCount)
Get a float value from Fetcher.SCORE_KEY, divide it by the number of outlinks and apply. |
float |
OPICScoringFilter.generatorSortValue(Text url,
CrawlDatum datum,
float initSort)
Use getScore() . |
float |
OPICScoringFilter.indexerScore(Text url,
Document doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore)
Dampen the boost value by scorePower. |
void |
OPICScoringFilter.initialScore(Text url,
CrawlDatum datum)
Set to 0.0f (unknown value) - inlink contributions will bring it to a correct level. |
void |
OPICScoringFilter.injectedScore(Text url,
CrawlDatum datum)
Set to the value defined in config, 1.0f by default. |
void |
OPICScoringFilter.passScoreBeforeParsing(Text url,
CrawlDatum datum,
Content content)
Store a float value of CrawlDatum.getScore() under Fetcher.SCORE_KEY. |
void |
OPICScoringFilter.updateDbScore(Text url,
CrawlDatum old,
CrawlDatum datum,
List inlinked)
Increase the score by a sum of inlinked scores. |
Uses of CrawlDatum in org.creativecommons.nutch |
---|
Methods in org.creativecommons.nutch with parameters of type CrawlDatum | |
---|---|
Document |
CCIndexingFilter.filter(Document doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
|
|
||||||||||
PREV NEXT | FRAMES NO FRAMES |