sner.scripts package

Submodules

sner.scripts.analysis module

sner.scripts.analysis.main(data, options)
Currently this region of code is not in full working order. It originally analyzed the corpus to look for potential rules to identify names. Specifically it would generate every possible spelling and context rule, and then assess its performance.
Parameters:
  • data (string) – containing the full text of various tablets.
  • = object containing various configuration information (i believe (options) – this type no longer exists in the code base, it was intended to be replaced with the config object)
Returns:

None

Raises:

None

sner.scripts.context module

sner.scripts.context.main(text, name)
Parameters:
  • = line from corpus split into array split on ' ' (text) –
  • = the PN found in this line (name) –
Returns:

None (?) Probably fills left_rules and right_rules dicts

Raises:

None

sner.scripts.export module

sner.scripts.export.findKnown(config, known_pn, known_gn)
Iterates through the attenstations file for personal and geographical names
and adds the lineIDs in a list to the known_pn and known_gn dictionaries.
Parameters:
  • data
  • options
  • = { Personal name (known_pn) – line ID }, to be filled in.
  • = { Geographical name (known_gn) – line ID}, to be filled in.
Returns:

Updates known_pn and known_gn dictionaries with the lineIDs.

Raises:

None

sner.scripts.export.main(config)
Finds all names in options.attestations file, and goes through each word
in the main options.corpus file to create a sparse matrix file to be used with scikit learn.
Parameters:
  • = True to normalize numbers (options.norm_num) –
  • = True to normalize numbers
  • = True to normalize professions (options.norm_prof) –
  • = True to normalize geographical names (options.norm_geo) –
Returns:

TabletID, LineID, LocationInSentence, Word, Word Type Word Types: ‘-‘ for unknowns.

’PN’ to identify personal names. ‘GN’ to identify geographical names. ‘PF’ to identify professions.

Return type:

csv file with the format

Raises:

None

sner.scripts.export.writeKey(path)
sner.scripts.export.writeLine(x_index, config, line, out_features, out_target, out_key, known_pn, known_gn, test_run)
sner.scripts.export.writeSparse(config, out_features, word_left, word_middle, word_right, x_index)
Writes a single x vector of features in a one hot inspired representation
to the out_features file.
Parameters:
  • = output file to write features (out_features) –
  • = the word left of the word to be output (word_left) –
  • = the word in question (word_middle) –
  • = the right context of the word to be output (word_right) –
  • = the row id for the feature entry (x_index) –
Returns:

Nothing

Raises:

None

sner.scripts.export.writeTarget(out_target, isName, isGN)
sner.scripts.export.writeWord(tablet_id, line_id, i, last_word, word, next_word, x_index, out_key, out_features, out_target, PN, GN, test_run, config)

sner.scripts.export_atf module

sner.scripts.export_atf.main(config)
Finds all names in options.attestations file, and goes through each word
in the main options.corpus file to create a sparse matrix file to be used with scikit learn.
Parameters:
  • = True to normalize numbers (options.norm_num) –
  • = True to normalize professions (options.norm_prof) –
  • = True to normalize geographical names (options.norm_geo) –
Returns:

TabletID, LineID, LocationInSentence, Word, Word Type Word Types: ‘-‘ for unknowns.

’PN’ to identify personal names. ‘GN’ to identify geographical names. ‘PF’ to identify professions.

Return type:

csv file with the format

Raises:

None

sner.scripts.export_atf.writeLine(config, line, line_id, out_features, out_target, out_key, known_pn, known_gn, tablet_id)
sner.scripts.export_atf.writeSparse(out_features, word_left, word_middle, word_right, x_index)
Writes a single x vector of features in a one hot inspired representation
to the out_features file.
Parameters:
  • = output file to write features (out_features) –
  • = the word left of the word to be output (word_left) –
  • = the word in question (word_middle) –
  • = the right context of the word to be output (word_right) –
  • = the row id for the feature entry (x_index) –
Returns:

Nothing

Raises:

None

sner.scripts.formatting module

sner.scripts.formatting.findKnown(data, options, knownPN, knownGN)
Iterates through the attenstations file for personal and geographical names
and adds the lineIDs in a list to the knownPN and knownGN dictionaries.
Parameters:
  • data
  • options
  • = { Personal name (knownPN) – line ID }, to be filled in.
  • = { Geographical name (knownGN) – line ID}, to be filled in.
Returns:

Updates knownPN and knownGN dictionaries with the lineIDs.

Raises:

None

sner.scripts.formatting.main(data, options)
Finds all names in options.attestations file, and goes through each word
in the main options.corpus file to fill a new output file specified by options.output
Parameters:
  • = True to normalize numbers (options.norm_num) –
  • = True to normalize professions (options.norm_prof) –
  • = True to normalize geographical names (options.norm_geo) –
Returns:

TabletID, LineID, LocationInSentence, Word, Word Type Word Types: ‘-‘ for unknowns.

’PN’ to identify personal names. ‘GN’ to identify geographical names. ‘PF’ to identify professions.

Return type:

csv file with the format

Raises:

None

sner.scripts.import_corpus module

sner.scripts.names module

Object representing the various types of names. Formatted as dictionaries mapping plaintext versions of names to the number of occurrences of that names

sner.scripts.output module

sner.scripts.output.addNames(file_key, file_target, names)
sner.scripts.output.add_names_and_unique(file_key, file_target, names, train_names, output_list, index=3)
sner.scripts.output.main(config)
sner.scripts.output.numeric_compare(x, y)
sner.scripts.output.outputATF(config, output_list)

sner.scripts.overfit_check module

sner.scripts.overfit_check.addNames(file_key, file_target, names)
sner.scripts.overfit_check.add_names_and_unique(file_key, file_target, names, train_names, output_list, index=3)
sner.scripts.overfit_check.main(config)
sner.scripts.overfit_check.numeric_compare(x, y)
sner.scripts.overfit_check.outputATF(config, output_list)

sner.scripts.professions module

sner.scripts.professions.main()

Testing function

sner.scripts.professions.replaceProfessions(line)

Replaces known professions with ‘profession’

sner.scripts.readnames module

sner.scripts.readnames.getKgrams(names, k)
Parameters:
  • = dictionary of the form { Names (names) – Occurrences }
  • = Max k-gram you want to retrieve. (k) –
    Special values of k = -1 returning dictionary of monograms.
    and k = -2 returning dictionary of monograms,
    followed by a dictionary of bigrams.
  • use to get all monograms, bigrams, and trigrams (Example) –
  • getKgrams (getPNs(), 3) –
Returns:

A dictionary of all grams up to order k

Raises:

None

sner.scripts.readnames.getPNs(data)

Collects names in Garshana csv and returns them

sner.scripts.readnames.getPersonalNames(csvFile)
Parameters:= A csv file with the name on line[5] and line[9] is an identifier of its (csvfile) – type, where we are looking for ‘PN’ indicating a personal name.
Returns:
Puts found names from the csv file and returns them in a dictionary
of the form: { Name : Occurrences }
Raises:None
sner.scripts.readnames.testKgrams()

A simple test to print off Trigrams and Quadgrams that occur more than once [note: duplicate names will trigger these]

sner.scripts.spelling module

Generates dictionaries of monograms, bigrams, and trigrams, mapping them to lists containing the number of times they occur in names, and the number of times they occur in total. This is information later used to evaluate these grams usefulness as name recognition rules.

sner.scripts.spelling.addBigram(gram, namecount, totalcount)
sner.scripts.spelling.addMonogram(gram, namecount, totalcount)
sner.scripts.spelling.addTrigram(gram, namecount, totalcount)
sner.scripts.spelling.analyzeData(f)
Collect percentage statistics from the gram maps and writes them to a .csv
output file.
Parameters:f (file) – N-Gram, Percentage, Occurence, Total Occurence
Returns:None
Raises:None
sner.scripts.spelling.gramhelper(gramDict, gram, namecount, totalcount)
Utility function to insert arbitrary n-grams into dictionaries that associate them with lists containing the number of times that gram occurred as well as the number of times that gram occurred in a name.
Parameters:
  • = dictionary to recieve an n-gram (gramDict) –
  • = gram to be inserted into the dict (gram) –
  • = the occurrences for this n-gram in a name (namecount) –
  • = how many times this gram has occurred in the corpus in any word (totalcount) –
Returns:

Updates gramDict with the information from gram

Raises:

sner.scripts.spelling.loadData(data, allgrams, f)

Loads name data :param data: :type data: string :param allgrams: { gram : namecount, totalcount } :type allgrams: dict :param f: :type f: file

Returns:Fills allgrams dictionary.
Raises:If grams are found in name dataset but not in overall dataset, will – print out the type of gram and oddity.
sner.scripts.spelling.main(data, syll_count)
sner.scripts.spelling.outputAnalysis(k, v, f)
Parameters:
  • k (string) –
  • = (v) –
  • = csv file we are writing to. (f) –

Returns:

Raises:If an n-gram has greater than 100% significance, reports the n-gram.

sner.scripts.utilities module

sner.scripts.utilities.clean_line(line, normNum=True, normProf=True)

Clean a line of data, removing all annotations from the line.

NOTE: The line is expected to only be the TEXT portion of the data files. I.e. the ID and line number parts of the data files are expected to be previously removed.

Parameters:line (str) – Line of just the text section of a tablet
Returns:The line, with all annotations removed
Return type:line (str)
Raises:None
sner.scripts.utilities.get_counts(data)

This function gets the total occurrences of words and syllables in the original Unicode Garshana corpus. To do this, it opens a .csv file with utf-16 encoding, and splits on commans, expecting the line of sumerian text to be in the 8th column. Filters annotations from each line, and tracks the occurrence of each word and syllable. All combinations of unigrams, bigrams, and trigrams are treated as individual syllables.

Parameters:= filename of the corpus .csv file, consistent with the formatting (data) – of the .csv files provided with the Garshana corpus.
Returns:Returns dictionary of the number of times any unique word occurs, as well as a dictionary of occurrences for syllables.
Raises:IOError
sner.scripts.utilities.main()
sner.scripts.utilities.update_syllable_count(word, syll_count)

Update the total occurrence counts of each unigram, bigram, and trigram syllable that occurs in the word. Note: syllables are separated by a dash (‘-‘).

Module contents