deeperlib.core package¶

Submodules¶

deeperlib.core.smartcrawl module¶

deeperlib.core.smartcrawl.smartCrawl(top_k, count, pool_thre, jaccard_thre, threads, budget, api, sampledata, localdata, hiddendata)[source]¶

Given a budget ofb queries, SMARTCRAWL first constructs a query pool based on the local database and then iteratively issues b queries to the hidden database such that the union of the query results can cover the maximum number of records in the local database. Finally, it performs entity resolution between the local database and the crawled records. —-DeepER: Deep Entity Resolution

Parameters:

Parameters:	top_k – top-k constraint of specific api count – size of hidden database pool_thre – threshold of queries’ frequency jaccard_thre – jaccard threshold threads – numbers of queries issued at each iteration budget – the budget of api call times api – An implementation of simapi for specific api. sampledata – SampleData object localdata – LocalData object hiddendata – HiddenData object
Returns:

top_k – top-k constraint of specific api
count – size of hidden database
pool_thre – threshold of queries’ frequency
jaccard_thre – jaccard threshold
threads – numbers of queries issued at each iteration
budget – the budget of api call times
api – An implementation of simapi for specific api.
sampledata – SampleData object
localdata – LocalData object
hiddendata – HiddenData object

Returns:

deeperlib.core.utils module¶

deeperlib.core.utils.add_naiveIndex(queries, data, index)[source]¶

To improve the efficiency of building index, naive queries would be added to query pool and inverted index after processing the queries whose frequency are larger than threshold.

Parameters:	queries – query pool without naive queries data – local database index – inverted index without naive queries
Returns:	query pool and inverted index with naive queries

deeperlib.core.utils.forwardIndex(D1index)[source]¶

A forward index maps a local record to all the queries that the record satisfies. Such a list is called a forward list. To build the index, we initialize a hash map F and let F(d)denote the forward list for d.

Parameters:	D1index – inverted index of local database.
Returns:	a dict of forward index.

deeperlib.core.utils.initScore_biased(sampleindex, k, sr, Dratio, queries)[source]¶

Biased benefit estimation.

Parameters:	sampleindex – inverted index of sample k – top-k restriction sr – sample rate Dratio – local database rate queries – query pool
Returns:	query pool with biased benefit

deeperlib.core.utils.initScore_unbiased(sampleindex, D1index, k, sr, queries)[source]¶

Unbiased benefit estimation.

Parameters:	sampleindex – inverted index of sample k – top-k restriction sr – sample rate Dratio – local database rate queries – query pool
Returns:	query pool with biased benefit

deeperlib.core.utils.invertedIndex(queries, data)[source]¶

An inverted index maps each keyword to a list of local records that contain the keyword. Such a list is called an inverted list. To build the index, we initialize a hash map I and let I(w) denote the inverted list of key-word w. For each local record d belongs to D, we enumerate each keyword in document(d) and add d into I(w). Given a query q, we generate q(D) by getting the intersection of the inverted list of each keyword in the query.

Parameters:	queries – query pool which is a closed frequency itemset of local database data – local database or sample database
Returns:	an inverted index {query: set(uniqueid)}

deeperlib.core.utils.queryGene(D1, thre)[source]¶

Use fpgrowth to generate a finite queries pool

Parameters:	D1 – local database {‘uniqueid’:[‘database’. ‘laboratory’]} thre – threshold of queries’ frequency
Returns:	a closed frequency itemset of local database

deeperlib.core.utils.results_simjoin(er_result, D1_ER, jaccard_thre)[source]¶

An adapter for similarity join and smart crawl.

Parameters:	er_result – documents returned by api at each iteration D1_ER – local database jaccard_thre – jaccard threshold
Returns:	match index and pair at each iteration

deeperlib.core.utils.updateList(D1index)[source]¶

Update information stored into update list rather than update the priority of each query in-place in the priority queue.

Parameters:	D1index – inverted index of local database.
Returns:	a dict of update information

deeperlib.core package¶

Submodules¶

deeperlib.core.smartcrawl module¶

deeperlib.core.utils module¶

Module contents¶