deeperlib.data_processing package¶

Submodules¶

deeperlib.data_processing.data_process module¶

deeperlib.data_processing.data_process.alphnum(s)[source]¶

Filter letters and numbers from raw string

Parameters:	s – a raw string of different kinds of characters
Returns:	a new string of letters and numbers

deeperlib.data_processing.data_process.getElement(node_list, data)[source]¶

Get the specified element according to the node path provided by users.

Parameters:	node_list – node path data – data in dictionary
Returns:	specified element

deeperlib.data_processing.data_process.wordset(s, lower_case=True, alphanum_only=True)[source]¶

Split raw string into a list of words

Parameters:	s – a raw string of different kinds of characters lower_case – a boolean value that denotes whether convert the raw string to lower case alphanum_only – a boolean value that denotes whether filter letters and numbers from raw string
Returns:	a list of words

deeperlib.data_processing.hidden_data module¶

class deeperlib.data_processing.hidden_data.HiddenData(result_dir, uniqueid, matchlist)[source]¶

A HiddenData object would keep the data crawled from api in json format in a dict. It provides you with some methods to manipulate the data, such as, defining your own way to pre-process the raw_data, saving the data and matched pairs to files.

Initialize the object. The data structures of messages returned by various api are so different that users or developers have to define the uniqueid and matchlist of the messages manually.

Parameters:	result_dir – the target directory for output files. uniqueid – the uniqueid of returned messages. matchlist – the fields of returned messages for similarity join.

getMatchList()[source]¶

getMatchPair()[source]¶

getMergeResult()[source]¶

getResultDir()[source]¶

getUniqueId()[source]¶

proResult(result_raw)[source]¶

Merge the raw data and keep them in a dict. Then, pre-process the raw data for similarity join.

Parameters:	result_raw – the raw result returned by api.
Returns:	a list for similarity join. [([‘yong’, ‘jun’, ‘he’, ‘simon’, ‘fraser’],’uniqueid’)]
Raises:	KeyError – some messages would miss some fields.

saveMatchPair()[source]¶: Save the returned massages in the target directory. result_dirresult_file.pkl

result_file.csv

match_file.pkl

match_file.csv

saveResult()[source]¶: Save the returned massages in the target directory. result_dirresult_file.pkl

result_file.csv

match_file.pkl

match_file.csv

setMatchList(matchlist)[source]¶

setMatchPair(matchpair)[source]¶

setMergeResult(mergeresult)[source]¶

setResultDir(result_dir)[source]¶

setUniqueId(uniqueid)[source]¶

deeperlib.data_processing.json2csv module¶

class deeperlib.data_processing.json2csv.Json2csv(jsondata, csv_path)[source]¶: Convert data in json format to csv.

deeperlib.data_processing.local_data module¶

class deeperlib.data_processing.local_data.LocalData(localpath, filetype, uniqueid, querylist, matchlist)[source]¶

A LocalData object would read the data from input file and then process the raw data. At last, it would generate a set of uniqueid, .a list for similarity join and a dict for query pool generation.

Initialize the object. The data structures of messages input by users or developers are so different that users or developers have to define the uniqueid, querylist and matchlist of the messages manually.

Parameters:	localpath – the path of input file. filetype – file type uniqueid – the uniqueid of messages in the file. querylist – the fields of messages for query pool generation.. matchlist – the fields of messages for similarity join.

getFileType()[source]¶

getLocalPath()[source]¶

getMatchList()[source]¶

getQueryList()[source]¶

getUniqueId()[source]¶

getlocalData()[source]¶

read_csv()[source]¶

Load local data and then generate three important data structures used for smart crawl. localdata_ids Collect a set of uniqueid. (‘uniqueid1’, ‘uniqueid2’)

localdata_query Split the fields into a list of words defined by querylist of each message. Filter out stop words and words whose length<3 from the list of words. Then generate a dict for query pool generation. {‘uniqueid’:[‘database’. ‘laboratory’]}

localdata_er A list for similarity join. [([‘yong’, ‘jun’, ‘he’, ‘simon’, ‘fraser’],’uniqueid’)]

read_pickle()[source]¶

Load local data and then generate three important data structures used for smart crawl. localdata_ids Collect a set of uniqueid. (‘uniqueid1’, ‘uniqueid2’)

localdata_er A list for similarity join. [([‘yong’, ‘jun’, ‘he’, ‘simon’, ‘fraser’],’uniqueid’)]

setFileType(filetype)[source]¶

setLocalPath(localpath)[source]¶

setMatchList(matchlist)[source]¶

setQueryList(querylist)[source]¶

setUniqueId(uniqueid)[source]¶

setlocalData(localdata_ids, localdata_query, localdata_er)[source]¶

deeperlib.data_processing.sample_data module¶

class deeperlib.data_processing.sample_data.SampleData(samplepath, filetype, uniqueid, querylist)[source]¶

A SampleData object would read the data from input file and then process the raw data. At last, it would generate a dict for query pool generation.

Initialize the object. The data structures of messages input by users or developers are so different that users or developers have to define the uniqueid and querylist of the messages manually.

Parameters:	samplepath – the path of input file. filetype – file type uniqueid – the uniqueid of messages in the file. querylist – the fields of messages for query pool generation..

getFileType()[source]¶

getQueryList()[source]¶

getSample()[source]¶

getSamplePath()[source]¶

getUniqueId()[source]¶