deeperlib.data_processing.data_process.
alphnum
(s)[source]¶Filter letters and numbers from raw string
Parameters: | s – a raw string of different kinds of characters |
---|---|
Returns: | a new string of letters and numbers |
deeperlib.data_processing.data_process.
getElement
(node_list, data)[source]¶Get the specified element according to the node path provided by users.
Parameters: |
|
---|---|
Returns: | specified element |
deeperlib.data_processing.data_process.
wordset
(s, lower_case=True, alphanum_only=True)[source]¶Split raw string into a list of words
Parameters: |
|
---|---|
Returns: | a list of words |
deeperlib.data_processing.local_data.
LocalData
(localpath, filetype, uniqueid, querylist, matchlist)[source]¶A LocalData object would read the data from input file and then process the raw data. At last, it would generate a set of uniqueid, .a list for similarity join and a dict for query pool generation.
Initialize the object. The data structures of messages input by users or developers are so different that users or developers have to define the uniqueid, querylist and matchlist of the messages manually.
Parameters: |
|
---|
read_csv
()[source]¶Load local data and then generate three important data structures used for smart crawl. localdata_ids Collect a set of uniqueid. (‘uniqueid1’, ‘uniqueid2’)
localdata_query Split the fields into a list of words defined by querylist of each message. Filter out stop words and words whose length<3 from the list of words. Then generate a dict for query pool generation. {‘uniqueid’:[‘database’. ‘laboratory’]}
localdata_er A list for similarity join. [([‘yong’, ‘jun’, ‘he’, ‘simon’, ‘fraser’],’uniqueid’)]
read_pickle
()[source]¶Load local data and then generate three important data structures used for smart crawl. localdata_ids Collect a set of uniqueid. (‘uniqueid1’, ‘uniqueid2’)
localdata_query Split the fields into a list of words defined by querylist of each message. Filter out stop words and words whose length<3 from the list of words. Then generate a dict for query pool generation. {‘uniqueid’:[‘database’. ‘laboratory’]}
localdata_er A list for similarity join. [([‘yong’, ‘jun’, ‘he’, ‘simon’, ‘fraser’],’uniqueid’)]
deeperlib.data_processing.sample_data.
SampleData
(samplepath, filetype, uniqueid, querylist)[source]¶A SampleData object would read the data from input file and then process the raw data. At last, it would generate a dict for query pool generation.
Initialize the object. The data structures of messages input by users or developers are so different that users or developers have to define the uniqueid and querylist of the messages manually.
Parameters: |
|
---|
read_csv
()[source]¶Load sample data and then generate a same data structures as localdata_query used for smart crawl.
sample Split the fields into a list of words defined by querylist of each message. Then generate a dict for query pool generation. {‘uniqueid’:[‘database’. ‘laboratory’]}