sner.scripts package¶
Subpackages¶
Submodules¶
sner.scripts.analysis module¶
-
sner.scripts.analysis.
main
(data, options)¶ - Currently this region of code is not in full working order. It originally analyzed the corpus to look for potential rules to identify names. Specifically it would generate every possible spelling and context rule, and then assess its performance.
Parameters: - data (string) – containing the full text of various tablets.
- = object containing various configuration information (i believe (options) – this type no longer exists in the code base, it was intended to be replaced with the config object)
Returns: None
Raises: None
sner.scripts.context module¶
-
sner.scripts.context.
main
(text, name)¶ Parameters: - = line from corpus split into array split on ' ' (text) –
- = the PN found in this line (name) –
Returns: None (?) Probably fills left_rules and right_rules dicts
Raises: None
sner.scripts.export module¶
-
sner.scripts.export.
findKnown
(config, known_pn, known_gn)¶ - Iterates through the attenstations file for personal and geographical names
- and adds the lineIDs in a list to the known_pn and known_gn dictionaries.
Parameters: - data –
- options –
- = { Personal name (known_pn) – line ID }, to be filled in.
- = { Geographical name (known_gn) – line ID}, to be filled in.
Returns: Updates known_pn and known_gn dictionaries with the lineIDs.
Raises: None
-
sner.scripts.export.
main
(config)¶ - Finds all names in options.attestations file, and goes through each word
- in the main options.corpus file to create a sparse matrix file to be used with scikit learn.
Parameters: - = True to normalize numbers (options.norm_num) –
- = True to normalize numbers –
- = True to normalize professions (options.norm_prof) –
- = True to normalize geographical names (options.norm_geo) –
Returns: TabletID, LineID, LocationInSentence, Word, Word Type Word Types: ‘-‘ for unknowns.
’PN’ to identify personal names. ‘GN’ to identify geographical names. ‘PF’ to identify professions.
Return type: csv file with the format
Raises: None
-
sner.scripts.export.
writeKey
(path)¶
-
sner.scripts.export.
writeLine
(x_index, config, line, out_features, out_target, out_key, known_pn, known_gn, test_run)¶
-
sner.scripts.export.
writeSparse
(config, out_features, word_left, word_middle, word_right, x_index)¶ - Writes a single x vector of features in a one hot inspired representation
- to the out_features file.
Parameters: - = output file to write features (out_features) –
- = the word left of the word to be output (word_left) –
- = the word in question (word_middle) –
- = the right context of the word to be output (word_right) –
- = the row id for the feature entry (x_index) –
Returns: Nothing
Raises: None
-
sner.scripts.export.
writeTarget
(out_target, isName, isGN)¶
-
sner.scripts.export.
writeWord
(tablet_id, line_id, i, last_word, word, next_word, x_index, out_key, out_features, out_target, PN, GN, test_run, config)¶
sner.scripts.export_atf module¶
-
sner.scripts.export_atf.
main
(config)¶ - Finds all names in options.attestations file, and goes through each word
- in the main options.corpus file to create a sparse matrix file to be used with scikit learn.
Parameters: - = True to normalize numbers (options.norm_num) –
- = True to normalize professions (options.norm_prof) –
- = True to normalize geographical names (options.norm_geo) –
Returns: TabletID, LineID, LocationInSentence, Word, Word Type Word Types: ‘-‘ for unknowns.
’PN’ to identify personal names. ‘GN’ to identify geographical names. ‘PF’ to identify professions.
Return type: csv file with the format
Raises: None
-
sner.scripts.export_atf.
writeLine
(config, line, line_id, out_features, out_target, out_key, known_pn, known_gn, tablet_id)¶
-
sner.scripts.export_atf.
writeSparse
(out_features, word_left, word_middle, word_right, x_index)¶ - Writes a single x vector of features in a one hot inspired representation
- to the out_features file.
Parameters: - = output file to write features (out_features) –
- = the word left of the word to be output (word_left) –
- = the word in question (word_middle) –
- = the right context of the word to be output (word_right) –
- = the row id for the feature entry (x_index) –
Returns: Nothing
Raises: None
sner.scripts.formatting module¶
-
sner.scripts.formatting.
findKnown
(data, options, knownPN, knownGN)¶ - Iterates through the attenstations file for personal and geographical names
- and adds the lineIDs in a list to the knownPN and knownGN dictionaries.
Parameters: - data –
- options –
- = { Personal name (knownPN) – line ID }, to be filled in.
- = { Geographical name (knownGN) – line ID}, to be filled in.
Returns: Updates knownPN and knownGN dictionaries with the lineIDs.
Raises: None
-
sner.scripts.formatting.
main
(data, options)¶ - Finds all names in options.attestations file, and goes through each word
- in the main options.corpus file to fill a new output file specified by options.output
Parameters: - = True to normalize numbers (options.norm_num) –
- = True to normalize professions (options.norm_prof) –
- = True to normalize geographical names (options.norm_geo) –
Returns: TabletID, LineID, LocationInSentence, Word, Word Type Word Types: ‘-‘ for unknowns.
’PN’ to identify personal names. ‘GN’ to identify geographical names. ‘PF’ to identify professions.
Return type: csv file with the format
Raises: None
sner.scripts.import_corpus module¶
sner.scripts.names module¶
Object representing the various types of names. Formatted as dictionaries mapping plaintext versions of names to the number of occurrences of that names
sner.scripts.output module¶
-
sner.scripts.output.
addNames
(file_key, file_target, names)¶
-
sner.scripts.output.
add_names_and_unique
(file_key, file_target, names, train_names, output_list, index=3)¶
-
sner.scripts.output.
main
(config)¶
-
sner.scripts.output.
numeric_compare
(x, y)¶
-
sner.scripts.output.
outputATF
(config, output_list)¶
sner.scripts.overfit_check module¶
-
sner.scripts.overfit_check.
addNames
(file_key, file_target, names)¶
-
sner.scripts.overfit_check.
add_names_and_unique
(file_key, file_target, names, train_names, output_list, index=3)¶
-
sner.scripts.overfit_check.
main
(config)¶
-
sner.scripts.overfit_check.
numeric_compare
(x, y)¶
-
sner.scripts.overfit_check.
outputATF
(config, output_list)¶
sner.scripts.professions module¶
-
sner.scripts.professions.
main
()¶ Testing function
-
sner.scripts.professions.
replaceProfessions
(line)¶ Replaces known professions with ‘profession’
sner.scripts.readnames module¶
-
sner.scripts.readnames.
getKgrams
(names, k)¶ Parameters: - = dictionary of the form { Names (names) – Occurrences }
- = Max k-gram you want to retrieve. (k) –
- Special values of k = -1 returning dictionary of monograms.
- and k = -2 returning dictionary of monograms,
- followed by a dictionary of bigrams.
- use to get all monograms, bigrams, and trigrams (Example) –
- getKgrams (getPNs(), 3) –
Returns: A dictionary of all grams up to order k
Raises: None
-
sner.scripts.readnames.
getPNs
(data)¶ Collects names in Garshana csv and returns them
-
sner.scripts.readnames.
getPersonalNames
(csvFile)¶ Parameters: = A csv file with the name on line[5] and line[9] is an identifier of its (csvfile) – type, where we are looking for ‘PN’ indicating a personal name. Returns: - Puts found names from the csv file and returns them in a dictionary
- of the form: { Name : Occurrences }
Raises: None
-
sner.scripts.readnames.
testKgrams
()¶ A simple test to print off Trigrams and Quadgrams that occur more than once [note: duplicate names will trigger these]
sner.scripts.spelling module¶
Generates dictionaries of monograms, bigrams, and trigrams, mapping them to lists containing the number of times they occur in names, and the number of times they occur in total. This is information later used to evaluate these grams usefulness as name recognition rules.
-
sner.scripts.spelling.
addBigram
(gram, namecount, totalcount)¶
-
sner.scripts.spelling.
addMonogram
(gram, namecount, totalcount)¶
-
sner.scripts.spelling.
addTrigram
(gram, namecount, totalcount)¶
-
sner.scripts.spelling.
analyzeData
(f)¶ - Collect percentage statistics from the gram maps and writes them to a .csv
- output file.
Parameters: f (file) – N-Gram, Percentage, Occurence, Total Occurence Returns: None Raises: None
-
sner.scripts.spelling.
gramhelper
(gramDict, gram, namecount, totalcount)¶ - Utility function to insert arbitrary n-grams into dictionaries that associate them with lists containing the number of times that gram occurred as well as the number of times that gram occurred in a name.
Parameters: - = dictionary to recieve an n-gram (gramDict) –
- = gram to be inserted into the dict (gram) –
- = the occurrences for this n-gram in a name (namecount) –
- = how many times this gram has occurred in the corpus in any word (totalcount) –
Returns: Updates gramDict with the information from gram
Raises:
-
sner.scripts.spelling.
loadData
(data, allgrams, f)¶ Loads name data :param data: :type data: string :param allgrams: { gram : namecount, totalcount } :type allgrams: dict :param f: :type f: file
Returns: Fills allgrams dictionary. Raises: If grams are found in name dataset but not in overall dataset, will – print out the type of gram and oddity.
-
sner.scripts.spelling.
main
(data, syll_count)¶
-
sner.scripts.spelling.
outputAnalysis
(k, v, f)¶ Parameters: - k (string) –
- = (v) –
- = csv file we are writing to. (f) –
Returns:
Raises: If an n-gram has greater than 100% significance, reports the n-gram.
sner.scripts.utilities module¶
-
sner.scripts.utilities.
clean_line
(line, normNum=True, normProf=True)¶ Clean a line of data, removing all annotations from the line.
NOTE: The line is expected to only be the TEXT portion of the data files. I.e. the ID and line number parts of the data files are expected to be previously removed.
Parameters: line (str) – Line of just the text section of a tablet Returns: The line, with all annotations removed Return type: line (str) Raises: None
-
sner.scripts.utilities.
get_counts
(data)¶ This function gets the total occurrences of words and syllables in the original Unicode Garshana corpus. To do this, it opens a .csv file with utf-16 encoding, and splits on commans, expecting the line of sumerian text to be in the 8th column. Filters annotations from each line, and tracks the occurrence of each word and syllable. All combinations of unigrams, bigrams, and trigrams are treated as individual syllables.
Parameters: = filename of the corpus .csv file, consistent with the formatting (data) – of the .csv files provided with the Garshana corpus. Returns: Returns dictionary of the number of times any unique word occurs, as well as a dictionary of occurrences for syllables. Raises: IOError
-
sner.scripts.utilities.
main
()¶
-
sner.scripts.utilities.
update_syllable_count
(word, syll_count)¶ Update the total occurrence counts of each unigram, bigram, and trigram syllable that occurs in the word. Note: syllables are separated by a dash (‘-‘).