deeperlib.core package

Submodules

deeperlib.core.smartcrawl module

deeperlib.core.smartcrawl.smartCrawl(top_k, count, pool_thre, jaccard_thre, threads, budget, api, sampledata, localdata, hiddendata)[source]

Given a budget ofb queries, SMARTCRAWL first constructs a query pool based on the local database and then iteratively issues b queries to the hidden database such that the union of the query results can cover the maximum number of records in the local database. Finally, it performs entity resolution between the local database and the crawled records. —-DeepER: Deep Entity Resolution

Parameters:
  • top_k – top-k constraint of specific api
  • count – size of hidden database
  • pool_thre – threshold of queries’ frequency
  • jaccard_thre – jaccard threshold
  • threads – numbers of queries issued at each iteration
  • budget – the budget of api call times
  • api – An implementation of simapi for specific api.
  • sampledata – SampleData object
  • localdata – LocalData object
  • hiddendata – HiddenData object
Returns:

deeperlib.core.utils module

deeperlib.core.utils.add_naiveIndex(queries, data, index)[source]

To improve the efficiency of building index, naive queries would be added to query pool and inverted index after processing the queries whose frequency are larger than threshold.

Parameters:
  • queries – query pool without naive queries
  • data – local database
  • index – inverted index without naive queries
Returns:

query pool and inverted index with naive queries

deeperlib.core.utils.forwardIndex(D1index)[source]
A forward index maps a local record to all the queries that the record satisfies. Such a list is called a forward list. To build the index, we initialize a hash map F and let F(d)denote the forward list for d.
Parameters:D1index – inverted index of local database.
Returns:a dict of forward index.
deeperlib.core.utils.initScore_biased(sampleindex, k, sr, Dratio, queries)[source]

Biased benefit estimation.

Parameters:
  • sampleindex – inverted index of sample
  • k – top-k restriction
  • sr – sample rate
  • Dratio – local database rate
  • queries – query pool
Returns:

query pool with biased benefit

deeperlib.core.utils.initScore_unbiased(sampleindex, D1index, k, sr, queries)[source]

Unbiased benefit estimation.

Parameters:
  • sampleindex – inverted index of sample
  • k – top-k restriction
  • sr – sample rate
  • Dratio – local database rate
  • queries – query pool
Returns:

query pool with biased benefit

deeperlib.core.utils.invertedIndex(queries, data)[source]
An inverted index maps each keyword to a list of local records that contain the keyword. Such a list is called an inverted list. To build the index, we initialize a hash map I and let I(w) denote the inverted list of key-word w. For each local record d belongs to D, we enumerate each keyword in document(d) and add d into I(w). Given a query q, we generate q(D) by getting the intersection of the inverted list of each keyword in the query.
Parameters:
  • queries – query pool which is a closed frequency itemset of local database
  • data – local database or sample database
Returns:

an inverted index {query: set(uniqueid)}

deeperlib.core.utils.queryGene(D1, thre)[source]

Use fpgrowth to generate a finite queries pool

Parameters:
  • D1 – local database {‘uniqueid’:[‘database’. ‘laboratory’]}
  • thre – threshold of queries’ frequency
Returns:

a closed frequency itemset of local database

deeperlib.core.utils.results_simjoin(er_result, D1_ER, jaccard_thre)[source]

An adapter for similarity join and smart crawl.

Parameters:
  • er_result – documents returned by api at each iteration
  • D1_ER – local database
  • jaccard_thre – jaccard threshold
Returns:

match index and pair at each iteration

deeperlib.core.utils.updateList(D1index)[source]

Update information stored into update list rather than update the priority of each query in-place in the priority queue.

Parameters:D1index – inverted index of local database.
Returns:a dict of update information

Module contents