deeperlib.data_processing package

Submodules

deeperlib.data_processing.data_process module

deeperlib.data_processing.data_process.alphnum(s)[source]

Filter letters and numbers from raw string

Parameters:s – a raw string of different kinds of characters
Returns:a new string of letters and numbers
deeperlib.data_processing.data_process.getElement(node_list, data)[source]

Get the specified element according to the node path provided by users.

Parameters:
  • node_list – node path
  • data – data in dictionary
Returns:

specified element

deeperlib.data_processing.data_process.wordset(s, lower_case=True, alphanum_only=True)[source]

Split raw string into a list of words

Parameters:
  • s – a raw string of different kinds of characters
  • lower_case – a boolean value that denotes whether convert the raw string to lower case
  • alphanum_only – a boolean value that denotes whether filter letters and numbers from raw string
Returns:

a list of words

deeperlib.data_processing.hidden_data module

class deeperlib.data_processing.hidden_data.HiddenData(result_dir, uniqueid, matchlist)[source]

A HiddenData object would keep the data crawled from api in json format in a dict. It provides you with some methods to manipulate the data, such as, defining your own way to pre-process the raw_data, saving the data and matched pairs to files.

Initialize the object. The data structures of messages returned by various api are so different that users or developers have to define the uniqueid and matchlist of the messages manually.

Parameters:
  • result_dir – the target directory for output files.
  • uniqueid – the uniqueid of returned messages.
  • matchlist – the fields of returned messages for similarity join.
getMatchList()[source]
getMatchPair()[source]
getMergeResult()[source]
getResultDir()[source]
getUniqueId()[source]
proResult(result_raw)[source]

Merge the raw data and keep them in a dict. Then, pre-process the raw data for similarity join.

Parameters:result_raw – the raw result returned by api.
Returns:a list for similarity join. [([‘yong’, ‘jun’, ‘he’, ‘simon’, ‘fraser’],’uniqueid’)]
Raises:KeyError – some messages would miss some fields.
saveMatchPair()[source]

Save the returned massages in the target directory. result_dirresult_file.pkl

result_file.csv

match_file.pkl

match_file.csv

saveResult()[source]

Save the returned massages in the target directory. result_dirresult_file.pkl

result_file.csv

match_file.pkl

match_file.csv

setMatchList(matchlist)[source]
setMatchPair(matchpair)[source]
setMergeResult(mergeresult)[source]
setResultDir(result_dir)[source]
setUniqueId(uniqueid)[source]

deeperlib.data_processing.json2csv module

class deeperlib.data_processing.json2csv.Json2csv(jsondata, csv_path)[source]

Convert data in json format to csv.

deeperlib.data_processing.local_data module

class deeperlib.data_processing.local_data.LocalData(localpath, filetype, uniqueid, querylist, matchlist)[source]

A LocalData object would read the data from input file and then process the raw data. At last, it would generate a set of uniqueid, .a list for similarity join and a dict for query pool generation.

Initialize the object. The data structures of messages input by users or developers are so different that users or developers have to define the uniqueid, querylist and matchlist of the messages manually.

Parameters:
  • localpath – the path of input file.
  • filetype – file type
  • uniqueid – the uniqueid of messages in the file.
  • querylist – the fields of messages for query pool generation..
  • matchlist – the fields of messages for similarity join.
getFileType()[source]
getLocalPath()[source]
getMatchList()[source]
getQueryList()[source]
getUniqueId()[source]
getlocalData()[source]
read_csv()[source]

Load local data and then generate three important data structures used for smart crawl. localdata_ids Collect a set of uniqueid. (‘uniqueid1’, ‘uniqueid2’)

localdata_query Split the fields into a list of words defined by querylist of each message. Filter out stop words and words whose length<3 from the list of words. Then generate a dict for query pool generation. {‘uniqueid’:[‘database’. ‘laboratory’]}

localdata_er A list for similarity join. [([‘yong’, ‘jun’, ‘he’, ‘simon’, ‘fraser’],’uniqueid’)]

read_pickle()[source]

Load local data and then generate three important data structures used for smart crawl. localdata_ids Collect a set of uniqueid. (‘uniqueid1’, ‘uniqueid2’)

localdata_query Split the fields into a list of words defined by querylist of each message. Filter out stop words and words whose length<3 from the list of words. Then generate a dict for query pool generation. {‘uniqueid’:[‘database’. ‘laboratory’]}

localdata_er A list for similarity join. [([‘yong’, ‘jun’, ‘he’, ‘simon’, ‘fraser’],’uniqueid’)]

setFileType(filetype)[source]
setLocalPath(localpath)[source]
setMatchList(matchlist)[source]
setQueryList(querylist)[source]
setUniqueId(uniqueid)[source]
setlocalData(localdata_ids, localdata_query, localdata_er)[source]

deeperlib.data_processing.sample_data module

class deeperlib.data_processing.sample_data.SampleData(samplepath, filetype, uniqueid, querylist)[source]

A SampleData object would read the data from input file and then process the raw data. At last, it would generate a dict for query pool generation.

Initialize the object. The data structures of messages input by users or developers are so different that users or developers have to define the uniqueid and querylist of the messages manually.

Parameters:
  • samplepath – the path of input file.
  • filetype – file type
  • uniqueid – the uniqueid of messages in the file.
  • querylist – the fields of messages for query pool generation..
getFileType()[source]
getQueryList()[source]
getSample()[source]
getSamplePath()[source]
getUniqueId()[source]
read_csv()[source]

Load sample data and then generate a same data structures as localdata_query used for smart crawl.

sample Split the fields into a list of words defined by querylist of each message. Then generate a dict for query pool generation. {‘uniqueid’:[‘database’. ‘laboratory’]}

read_pickle()[source]

Load sample data and then generate a same data structures as localdata_query used for smart crawl.

sample Split the fields into a list of words defined by querylist of each message. Then generate a dict for query pool generation. {‘uniqueid’:[‘database’. ‘laboratory’]}

setFileType(filetype)[source]
setQueryList(querylist)[source]
setSample(sample)[source]
setSamplePath(samplepath)[source]
setUniqueId(uniqueid)[source]

Module contents