The python API of this tool is divided into the following sections:

download: Handles the downloading of data files from s3. It matches up with the simeon download and simeon split commands
upload: Handles the uploading of data to GCS and BigQuery. It matches up with the simeon push command
report: Handles the generation of secondary tables in BigQuery. It matches up with the simeon report command

Components of the `download` package¶

AWS module¶

Module of utilities to help with listing and downloading files from S3

class simeon.download.aws.S3Blob(name, size, last_modified, bucket, local_name=None)¶

A class to represent S3 blobs

download_file(filename=None)¶

Download the S3Blob to the local file system and return the full path where the file is saved

Parameters:: filename (Union[None, str]) – Name of the output file
Return type:: str
Returns:: Returns the full path where the file is saved

classmethod from_info(bucket, type_, date, org='mitx', site='edx')¶

Make a list of blobs with the given parameters

Parameters:

bucket (s3.Bucket) – The boto3.s3.Bucket object to tie to this blob
type (str) – “sql” or “email” or “sql”
date (Union[str, datetime]) – A datetime or str object for a threshold date
org (str) – The org whose data will be fetched.
site (str) – The site from which data were generated

Return type:

List[S3Blob]

Raises:

AWSException

classmethod from_prefix(bucket, prefix)¶

Fetch a list of S3Blob objects from AWS whose names have the given prefix.

Parameters:

bucket (s3.Bucket) – The boto3.s3.Bucket object to tie to this blob
prefix (str) – A string with which to filter the list of objects

Return type:

List[S3Blob]

Returns:

A list of S3Blob objects

Raises:

AWSException

to_json()¶: Jsonify the Blob

simeon.download.aws.get_file_date(fname)¶: Get the date in the name of the S3 blob

simeon.download.aws.make_s3_bucket(bucket, client_id=None, client_secret=None, session_token=None, profile_name=None)¶: Make a simple boto3 Bucket object pointing to S3

Email opt-in module¶

Module to process email opt-in data from edX

simeon.download.emails.compress_email_files(files, ddir, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas')¶

Generate a GZIP JSON file in the given ddir directory using the contents of the files. :NOTE: schema_dir is not used yet. But we may use to check that the generated records match their destination tables.

Parameters:

files (Iterable[str]) – An iterable of email opt-in CSV files to process
ddir (str) – A destination directory
schema_dir (Union[None, str]) – Directory where schema files live

Return type:

None

Returns:

Writes the contents of files into email_opt_in.json.gz

simeon.download.emails.parse_date(datestr)¶: Convert datestr to an iso formatted date. If not possible, return None

simeon.download.emails.process_email_file(fname, verbose=True, logger=None, timeout=None, keepfiles=False)¶

Email opt-in files are kind of different in that they are zip archives inside which reside GPG encrypted files.

Parameters:

fname (str) – Zip archive containing the email opt-in data file
verbose (bool) – Whether to print stuff when decrypting
logger (logging.Logger) – A Logger object to print messages with
timeout (Union[int, None]) – Number of seconds to wait for the decryption to finish
keepfiles (bool) – Whether to keep the .gpg files after decrypting them

Return type:

str

Returns:

Returns the path to the decrypted file

Tracking logs module¶

Module to process tracking log files from edX

simeon.download.logs.batch_split_tracking_logs(filenames, ddir, dynamic_date=False, courses=None, verbose=True, logger=None, size=10, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', debug=False)¶: Call split_tracking_log on each file inside a process or thread pool

simeon.download.logs.process_line(line: str | bytes, lcount: int, date: None | datetime = None, is_gzip=True, courses: Iterable[str] = None) → dict¶

Process the line from a tracking log file and return the reformatted line (deserialized) along with the name of its destination file.

Parameters:

line (Union[str, bytes]) – A line from the tracking logs
lcount (int) – The line number of the given line
date (Union[None, datetime]) – The date of the file where this line comes from.
is_gzip (bool) – Whether or not this line came from a GZIP file
courses (Union[Iterable[str], None]) – A list of course IDs whose records are exported

Return type:

Dict[str, Union[Dict[str, str], str]]

Returns:

Dictionary with both the data and its destination file name

simeon.download.logs.split_tracking_log(filename: str, ddir: str, dynamic_date: bool = False, courses: Iterable[str] = None, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas')¶

Split the records in the given GZIP tracking log file. This function is very resource hungry because it keeps around a lot of open file handles and writes to them whenever it processes a good record. Some attempts are made to keep records around whenever the process is no longer allowed to open new files. But that will likely lead to the exhaustion of the running process’s allotted memory.

NOTE:

If you’ve got a better way, please update me.

Parameters:

filename (str) – The GZIP file to split
ddir (str) – Destination directory of the generated file
dynamic_date (bool) – Use dates from the JSON records to make output file names
courses (Union[Iterable[str], None]) – A list of course IDs whose records are exported
schema_dir (Union[None, str]) – Directory where to find schema files

Return type:

bool

Returns:

True if files have been generated. False, otherwise

SQL data files module¶

Module to process SQL files from edX

simeon.download.sqls.batch_decrypt_files(all_files, size=100, verbose=False, logger=None, timeout=None, keepfiles=False, njobs=5)¶

Batch the files by the given size and pass each batch to gpg to decrypt.

Parameters:

all_files (List[str]) – List of file names
size (int) – The batch size
verbose (bool) – Print the command to be run
logger (logging.Logger) – A logging.Logger object to print the command with
timeout (Union[int, None]) – Number of seconds to wait for the decryption to finish
keepfiles (bool) – Keep the encrypted files after decrypting them.
njobs (int) – Number of threads to use to call gpg in parallel

Return type:

None

Returns:

Nothing, but decrypts the .sql files from the given archive

simeon.download.sqls.force_delete_files(files, logger=None)¶

Delete the given files without regard for whatever or not they exist

Parameters:

files (Iterable[str]) – Iterable of file names
logger (Union[None, logging.Logger]) – A logger object to log messages

Return type:

None

Returns:

Returns nothing, but deletes the given files from the local FS

simeon.download.sqls.process_sql_archive(archive, ddir=None, include_edge=False, courses=None, size=5, tables_only=False, debug=False)¶

Unpack and decrypt files inside the given archive

Parameters:

archive (str) – SQL data package (a ZIP archive)
ddir (str) – The destination directory of the unpacked files
include_edge (bool) – Include the files from the edge site
courses (Union[Iterable[str], None]) – A list of course IDs whose data files are unpacked
size (int) – The size of the thread or process pool doing the unpacking
tables_only (bool) – Whether to extract file names only (no unarchiving)
debug (bool) – Show the stacktrace when an error occurs

Return type:

Set[str]

Returns:

A set of file names

simeon.download.sqls.unpacker(fname, names, ddir, cpaths=None, tables_only=False)¶: A worker callable to pass a Thread or Process pool

Utilities module for the download package¶

Some utility functions for working with the downloaded data files

simeon.download.utilities.check_for_funny_keys(record, name='toplevel')¶

I am quite frankly not sure what Ike is trying to do here, but there should be a better way. For now, though, we’ll just have to make do.

Parameters:

record (dict) – Dictionary whose values are modified
name (str) – Name of the level of the dict

Return type:

None

Returns:

Modifies the record in place

simeon.download.utilities.decrypt_files(fnames, verbose=True, logger=None, timeout=None, keepfiles=False)¶

Decrypt the given file with gpg. This assumes that the gpg command is available in the SHELL running this script.

Parameters:

fnames (Union[str, List]) – A file name or a list of file names to decrypt
verbose (bool) – Print the command to be run
logger (logging.Logger) – A logging.Logger object to print the command with
timeout (Union[int, None]) – Number of seconds to wait for the decryption to finish
keepfiles (bool) – Keep the encrypted files after decryption, if True.

Return type:

bool

Returns:

Returns True if the decryption does not fail

Raises:

DecryptionError

simeon.download.utilities.drop_empties(record, *keys)¶

Recursive drop keys whose corresponding values are empty from the given record.

Parameters:

record (dict) – Dictionary whose values are modified
keys (Iterable[str]) – multiple args

Return type:

None

Returns:

Modifies the record in place

simeon.download.utilities.format_sql_filename(fname: str)¶: Reformat the given edX SQL encrypted file name into a name indicative of where the file should end up after the SQL archive is unpacked. site/folder/filename.ext.gpg

simeon.download.utilities.get_course_id(record: dict, paths=None) → str¶

Given a JSON record, try getting the course_id out of it.

Parameters:

record (dict) – A deserialized JSON record
paths (Iterable[Iterable[str]]) – Paths to follow to find a matching course ID string

Return type:

str

Returns:

A valid edX course ID or an empty string

simeon.download.utilities.get_file_date(fname)¶

Extract the date in a file name and parse it into a datetime object

Parameters:: fname (str) – Some file name
Return type:: Union[None, datetime]
Returns:: Returns a datetime object or None

simeon.download.utilities.get_module_id(record: dict, paths=None)¶

Get the module ID of the given record

Parameters:

record (dict) – A deserialized JSON record
paths (Iterable[Iterable[str]]) – Paths to follow to find a matching course ID string

Return type:

str

Returns:

A valid edX course ID or an empty string

simeon.download.utilities.get_sql_course_id(course_str: str) → str¶

Given a course ID string from the SQL files, pluck out of the actual course ID and format it as follows: ORG/COURSE_NUMBER/TERM

Parameters:: course_str (str) – The course ID string from edX
Return type:: str
Returns:: Actual course ID and format it properly

simeon.download.utilities.is_float(val)¶: Check that the string can be coerced into a float.

simeon.download.utilities.make_file_handle(fname: str, mode: str = 'wt', is_gzip: bool = False)¶

Create a file handle pointing the given file name. If the directory of the file does not exist, create it.

Parameters:

fname (str) – A file name whose handle needs to be created.
mode (str) – “a[bt]?” for append or “w[bt]?” for write
is_gzip (bool) – Open it as a gzip file handle, if True.

Return type:

Union[TextIOWrapper, BufferedReader]

simeon.download.utilities.make_tracklog_path(course_id: str, datestr: str, is_gzip=True) → str¶

Make a local file path name with the given course ID and datetime object

Parameters:

course_id (str) – Properly formatted edX course ID
datestr (str) – %Y-%m-%d formatted date associated with the tracking log
is_gzip (bool) – Make a GZIP file, if True.

Return type:

str

Returns:

A local FS file path

simeon.download.utilities.move_field_to_mongoid(record: dict, path: list)¶

Move the values associated with the given path into record[‘mongoid’]

Parameters:

record (dict) – Dictionary whose values are modified
path (Iterable[str]) – A list of keys to traverse and move

Return type:

None

Returns:

Modifies the record in place

simeon.download.utilities.move_unknown_fields_to_agent(record, *keys)¶

Move the values associated with the given keys into record[‘agent’]

Parameters:

record (dict) – Dictionary whose values are modified
keys (Iterable[str]) – multiple args

Return type:

None

Returns:

Modifies the record in place

simeon.download.utilities.parse_mongo_tstamp(timestamp: str)¶

Try converting a MongoDB timestamp into a stringified datetime

Parameters:: timestamp (str) – String representing a timestamp. This can be either a unix timestamp or a datetime.
Return type:: str
Returns:: A formatted datetime

simeon.download.utilities.rephrase_record(record: dict)¶

Update the given record in place. The purpose of this function is to turn this record into something with the same schema as that of the target BigQuery table.

Parameters:: record (dict) – A deserialized JSON record
Return type:: None
Returns:: Nothing, but updates the given record in place

simeon.download.utilities.stringify_dict(record, *keys)¶

Given a dictionary and some keys, JSON stringify the values at those keys in place.

Parameters:

record (dict) – Dictionary whose values are modified
keys (Iterable[str]) – multiple args

Return type:

None

Returns:

Modifies the dict in place

Components of the `upload` package¶

GCP module¶

Utilities functions and classes to help with loading data to Google Cloud

Subclass bigquery.Client and add convenience methods

static export_compiled_query(query, table, target_directory)¶

Export a query string to the target directory

Parameters:

query (str) – Compiled SQL query that is sent to BigQuery
table (str) – Name of the table that is generated by the given query
target_directory (str) – The directory under which the compiled SQL query is stored

Return type:

None

Returns:

Stores SQL query under the given target directory

static extract_error_messages(errors)¶: Return the error messages from given list of error objects (dict)

get_course_tables(course_id)¶

Get all the tables related to the given course ID

Parameters:: course_id (str) – edX course ID in format ORG/NUMBER/TERM
Return type:: Dict[str, set]
Returns:: A dict with keys as log and sql, and values as table names

static get_not_found_object(message)¶: If the given message contains the keywords ‘Not found’, then try and determine the name and type of the object that is not found.

has_latest_table(course_id, table)¶

Check if the given table name exists in the _latest dataset of the given course ID

Parameters:

course_id (str) – edX course ID in format ORG/NUMBER/TERM
table (str) – Name of the table being looked up

Return type:

bool

Returns:

True if the table is currently in BigQuery

has_log_table(course_id, table)¶

Check if the given table name exists in the _logs dataset of the given course ID

Parameters:

course_id (str) – edX course ID in format ORG/NUMBER/TERM
table (str) – Name of the table being looked up

Return type:

bool

Returns:

True if the table is currently in BigQuery

load_one_file_to_table(fname: str, file_type: str, project: str, create: bool, append: bool, use_storage: bool = False, bucket: str = None, max_bad_rows=0, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', format_='json', patch=False)¶

Load the given file to a target BigQuery table

Parameters:

fname (str) – The specific file to load
file_type (str) – One of sql, email, log, rdx
project (str) – Target GCP project
create (bool) – Whether or not to create the destination table
append (bool) – Whether or not to append the records to the table
use_storage (bool) – Whether or not to load the data from GCS
bucket (str) – GCS bucket name to use
max_bad_rows (int) – Max number of bad rows allowed during loading
schema_dir (Union[None, str]) – Directory where schema files are found
format (str) – File format (json or csv)
patch (bool) – Whether or not to patch the description of the table

Return type:

bigquery.LoadJob

Returns:

The LoadJob object associated with the work being done

Raises:

Propagates everything from the underlying package

load_tables_from_dir(dirname: str, file_type: str, project: str, create: bool, append: bool, use_storage: bool = False, bucket: str = None, max_bad_rows=0, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', format_='json', patch=False) → List[LoadJob]¶

Load all the files in the given directory.

Parameters:

dirname (str) – Grandparent or parent directory of split up files
file_type (str) – One of sql, email, log, rdx
project (str) – Target GCP project
create (bool) – Whether or not to create the destination table
append (bool) – Whether or not to append the records to the table
use_storage (bool) – Whether or not to load the data from GCS
bucket (str) – GCS bucket name to use
max_bad_rows (int) – Max number of bad rows allowed during loading
schema_dir (str) – Directory where schema files are found
format (str) – File format (json or csv)
patch (bool) – Whether or not to patch the description of the table

Return type:

List[bigquery.LoadJob]

Returns:

List of load jobs

Raises:

Propagates everything from the underlying package

make_template(query)¶

Create a Template object whose environment includes some of the client’s methods as filters

Parameters:: query (str) – SQL query to use with the template being generated
Return type:: jinja2.Template
Returns:: Jinja2 Template object with the passed query

merge_to_table(fname, table, col, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', use_storage=False, patch=False, match_equal_columns=None, match_unequal_columns=None, target_directory='target')¶

Merge the given file to the target table name. If the latter does not exist, create it first. This process waits for all the jobs needed

Parameters:

fname (str) – A local file name or a GCS URI
table (str) – Fully qualified BigQuery table name
col (str) – Column by which to merge
schema_dir (Union[None, str]) – The directory where schema files live
use_storage (bool) – Whether or not the given path is a GCS URI
patch (bool) – Whether or not to patch the description of the table
match_equal_columns (Union[List[str], None, Tuple[str]]) – List of column names for which to set equality (=) if WHEN MATCH is met during the merge.
match_unequal_columns (Union[List[str], None, Tuple[str]]) – List of column names for which to set inequality (<>) if WHEN MATCH is met during the merge.
target_directory (str) – Target directory where to store SQL queries

Return type:

bigquery.QueryJob

Returns:

The QueryJob object associated with the merge carried out

Raises:

Propagates everything from the underlying package

class simeon.upload.gcp.GCSClient(project=<object object>, credentials=None, _http=None, client_info=None, client_options=None, use_auth_w_custom_endpoint=True, extra_headers={}, *, api_key=None)¶

Make a client to load data files to GCS

load_dir(dirname: str, file_type: str, bucket: str)¶

Load all the files in the given directory or any immediate subdirectories

Parameters:

dirname (str) – The directory whose files are loaded
bucket (str) – GCS bucket name

Param:

file_type: One of sql, email, log, rdx

Return type:

None

Returns:

Nothing, but should load file(s) in dirname to GCS

Raises:

Propagates everything from the underlying package

load_one_file_to_gcs(fname: str, file_type: str, bucket: str)¶

Load the given file to GCS

Parameters:

fname (str) – The local file to load to GCS
bucket (str) – GCS bucket name

Param:

file_type: One of sql, email, log, rdx

Return type:

None

Returns:

Nothing, but should load the given file to GCS

Raises:

Propagates everything from the underlying package

Utilities module for the upload package¶

Utility functions and classes associated with uploading data to GCP, so far.

simeon.upload.utilities.course_to_bq_dataset(course_id: str, file_type: str, project: str) → str¶

Make a fully qualified BigQuery dataset name with the given info

Parameters:

course_id (str) – edX course ID to format into a GCS path
file_type (str) – One of sql, log, email, rdx
project (str) – A GCP project ID

Return type:

str

Returns:

BigQuery dataset name with components separated by dots

simeon.upload.utilities.course_to_gcs_folder(course_id: str, file_type: str, bucket: str) → str¶

Use the given course ID to make a Google Cloud Storage path

Parameters:

course_id (str) – edX course ID to format into a GCS path
file_type (str) – One of sql, log, email, rdx
bucket (str) – A GCS bucket name

Return type:

str

Returns:

A nicely formatted GCS path

simeon.upload.utilities.dict_to_schema_field(schema_dict: dict)¶

Make a SchemaField from a schema directory

Parameters:: schema_dict (dict) – One of the objects in the schema JSON file
Return type:: bigquery.SchemaField
Returns:: A SchemaField matching the given dictionary’s name, type, etc.

simeon.upload.utilities.get_bq_schema(table: str, schema_dir: str = '/home/runner/work/simeon/simeon/simeon/upload/schemas')¶

Given a bare table name (without leading project or dataset), make a list of bigquery.SchemaField objects to act as the table’s schema.

Parameters:

table (str) – A BigQuery (bare) table name
schema_dir (str) – Directory where schema JSON file is looked up

Return type:

Tuple[List[bigquery.SchemaField], str]

Returns:

A 2-tuple with list of bigquery.SchemaField objects and a description text for the target table

Raises:

MissingSchemaException

simeon.upload.utilities.local_to_bq_table(fname: str, file_type: str, project: str) → str¶

Use the given local file to make a fully qualified BigQuery table name

Parameters:

fname (str) – A local file name
file_type (str) – One of sql, log, email, rdx
project (str) – A GCP project ID

Return type:

str

Returns:

BigQuery dataset name with components separated by dots

simeon.upload.utilities.local_to_gcs_path(fname: str, file_type: str, bucket: str) → str¶

Convert the local file name into a GCS path

Parameters:

fname (str) – A local file name
file_type (str) – One of sql, log, email, rdx, cold
bucket (str) – A GCS bucket name

Return type:

str

Returns:

A nicely formatted GCS path

simeon.upload.utilities.make_bq_load_config(table: str, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', append: bool = False, create: bool = True, file_format: str = 'json', delim=',', max_bad_rows=0)¶

Make a bigquery.LoadJobConfig object and description of a table

Parameters:

table (str) – Fully qualified table name
schema_dir (str) – The directory where schema files live
append (bool) – Whether to append the loaded to the table
create (bool) – Whether to create the target table if it does not exist
file_format (str) – One of sql, json, csv, txt
delim (str) – The delimiter of the file being loaded
max_bad_rows (int) – The number of bad rows to tolerate when loading the data

Return type:

Tuple[bigquery.LoadJobConfig, str]

Returns:

A 2-tuple with a bigquery.LoadJobConfig object and a description text for the destination table

simeon.upload.utilities.make_bq_query_config(append: bool = False, plain=True, table=None)¶

Make a bigquery.QueryJobConfig object to tie to a query to be sent to BigQuery for secondary table generation

Parameters:

append (bool) – Whether to append the loaded to the table
plain (bool) – Make an empty QueryJobConfig object
table (Union[None, str]) – Fully qualified name of a destination table

Return type:

bigquery.QueryJobConfig

Returns:

Make a bigquery.QueryJobConfig object

simeon.upload.utilities.sqlify_bq_field(field, named=True)¶

Convert a bigquery.SchemaField object into a DDL column definition.

Parameters:

field (bigquery.SchemaField) – A SchemaField object to convert to a DDL statement
named (bool) – Whether the returned statement start with the field name

Return type:

str

Returns:

A SQL column’s DDL statement

Components of the `report` package¶

Utilities module for the report package¶

Utility functions and classes to help with making course reports like user_info_combo, person_course, etc.

simeon.report.utilities.check_record_schema(record, schema, coerce=True, nullify=False)¶

Check that the given record matches the same keys found in the given schema list of fields. The latter is one of the schemas in simeon/upload/schemas/

Parameters:

record (dict) – Dictionary whose values are modified
schema (Iterable[Dict[str, Union[str, Dict]]]) – A list of dicts with info on BigQuery table fields
coerce (bool) – Whether or not to coerce values into BigQuery types
nullify (bool) – Whether to set values mapping missing keys to None

Return type:

None

Returns:

Modifies the record if needed

Raises:

SchemaMismatchException

simeon.report.utilities.course_from_block(block)¶

Extract a course ID from the given block ID

Parameters:: block (str) – A module item’s block string
Return type:: str
Returns:: Extracts the course ID in a module’s block string

simeon.report.utilities.drop_extra_keys(record, schema)¶

Walk through the record and drop key-value pairs that are not in the given schema

Parameters:

record (dict) – Dictionary whose values are modified
schema (Iterable[Dict[str, Union[str, Dict]]]) – A list of dicts with info on BigQuery table fields

Return type:

None

Returns:

Modifies the record if needed

simeon.report.utilities.extract_table_query(table, query_dir)¶

Given a table name and a query directory, extract both the query string and the table description. The latter is assumed to be any line in the query file that starts with # or –

Parameters:

table (str) – BigQuery table name whose query info is extracted
query_dir (str) – The directory where the query file is expected to be

Return type:

Tuple[str, str]

Returns:

A tuple of strings (query string, table description)

Raises:

MissingQueryFileException

simeon.report.utilities.get_has_solution(record)¶

Extract whether the given record is a problem that has showanswer. If it’s present and its associated value is not “never”, then return True. Otherwise, return False.

Parameters:: record (dict) – A course_axis record
Return type:: bool
Returns:: Wether the course_axis record has a solution in the data

simeon.report.utilities.get_problem_nitems(record)¶

Get a value for data.num_items in course_axis

Parameters:: record (dict) – A course_axis record
Return type:: Union[int, None]
Returns:: The number of subitems of a problem item

simeon.report.utilities.get_youtube_id(record)¶

Given a course structure record, extract the YouTube ID associated with the video element.

Parameters:: record (dict) – A course_axis record
Return type:: Union[str, None]
Returns:: The YouTube video ID associated with the record

simeon.report.utilities.make_course_axis(dirname, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', outname='course_axis.json.gz')¶

Given a course’s SQL directory, make a course_axis report

Parameters:

dirname (str) – Name of a course’s directory of SQL files
outname (str) – The filename to give it to the generated report

Return type:

None

Returns:

Nothing, but writes the generated data to the outname argument

simeon.report.utilities.make_forum_table(dirname, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', outname='forum.json.gz')¶

Generate a file to load into the forum table using the given SQL directory

Parameters:

dirname (str) – Name of a course’s directory of SQL files
schema_dir (str) – Directory where schema files live
outname (str) – The filename to give it to the generated report

Return type:

None

Returns:

Nothing, but writes the generated data to the target file

simeon.report.utilities.make_grades_persistent(dirname, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', first_outname='grades_persistent.json.gz', second_outname='grades_persistent_subsection.json.gz')¶

Given a course’s SQL directory, make the grades_persistent and grades_persistent_subsection reports.

Parameters:

dirname (str) – Name of a course’s directory of SQL files
outname (str) – The filename to give it to the generated report

Return type:

None

Returns:

Nothing, but writes the generated data to the target files

simeon.report.utilities.make_grading_policy(dirname, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', outname='grading_policy.json.gz')¶

Generate a file to be loaded into the grading_policy table of the given SQL directory.

Parameters:

dirname (str) – Name of a course’s directory of SQL files
schema_dir (Union[None, str]) – Directory where schema files live
outname (str) – The filename to give it to the generated report

Return type:

None

Returns:

Nothing, but writes the generated data to the target file

simeon.report.utilities.make_problem_analysis(state, **extras)¶

Use the state record from a studentmodule record to make a record to load into the problem_analysis table. The state is assumed to be from record of category “problem”.

Parameters:

state (dict) – Contents of the state field of studentmodule
extras (keyword arguments) – Things to be added to the generated record

Return type:

dict

Returns:

Return a record to be loaded in problem_analysis

simeon.report.utilities.make_roles_table(dirname, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', outname='roles.json.gz')¶

Generate a file to be loaded into the roles table of a dataset

Parameters:

dirname (str) – Name of a course’s directory of SQL files
schema_dir (Union[None, str]) – Directory where schema files live
outname (str) – The filename to give it to the generated report

Return type:

None

Returns:

Nothing, but writes the generated data to the target files

simeon.report.utilities.make_sql_tables_par(dirnames, verbose=False, logger=None, fail_fast=False, debug=False, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas')¶

Given a list of SQL directories, make the SQL tables defined in this module. This convenience function calls all the report generating functions for the given directory name

Parameters:

dirnames (List[str]) – Names of SQL directories
verbose (bool) – Print a message when a report is being made
logger (logging.Logger) – A logging.Logger object to print messages with
fail_fast (bool) – Whether or not to bail after the first error
debug (bool) – Show the stacktrace that caused the error
schema_dir (str) – The directory where schema files live

Return type:

bool

Returns:

True if the files are generated, and False otherwise.

simeon.report.utilities.make_sql_tables_seq(dirnames, verbose=False, logger=None, fail_fast=False, debug=False, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas')¶

Given an iterable of SQL directories, make the SQL tables defined in this module. This convenience function calls all the report generating functions for the given directory name

Parameters:

dirnames (Iterable[str]) – Names of SQL directories
verbose (bool) – Print a message when a report is being made
logger (logging.Logger) – A logging.Logger object to print messages with
fail_fast (bool) – Whether or not to bail after the first error
debug (bool) – Show the stacktrace that caused the error
schema_dir (str) – The directory where schema files live

Return type:

bool

Returns:

True if the files are generated, and False otherwise.

simeon.report.utilities.make_student_module(dirname, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', outname='studentmodule.json.gz')¶

Generate files to load into studentmodule and problem_analysis using the given SQL directory

Parameters:

dirname (str) – Name of a course’s directory of SQL files
schema_dir (str) – Directory where schema files live
outname (str) – The filename to give it to the generated report

Return type:

None

Returns:

Nothing, but writes the generated data to the target files

simeon.report.utilities.make_table_from_sql(table, course_id, client, project, append=False, query_dir='/home/runner/work/simeon/simeon/simeon/report/queries', schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', wait=False, geo_table='geocode.geoip', youtube_table='videos.youtube', target_directory='target', **kwargs)¶

Generate a BigQuery table using the given table name, course ID and a matching SQL query file in the query_dir folder. The query file contains placeholder for course ID, dataset name and other details.

Parameters:

table (str) – table name
course_id (str) – Course ID whose secondary reports are being generated
client (bigquery.Client) – An authenticated bigquery.Client object
project (str) – GCP project id where the video_axis table is loaded.
query_dir (Union[None, str]) – Directory where query files are saved.
schema_dir (Union[None, str]) – Directory where schema files live
geo_table (str) – Table name in BigQuery with geolocation data for IPs
youtube_table (str) – Table name in BigQuery with YouTube video details
wait (bool) – Whether to wait for the query job to finish running
target_directory (str) – Name of a directory where compiled SQL queries are stored

Return type:

Dict[str, Dict[str, str]]

Returns:

Returns the errors dictionary from the LoadJob object tied to the query

simeon.report.utilities.make_tables_from_sql(tables, course_id, client, project, append=False, query_dir='/home/runner/work/simeon/simeon/simeon/report/queries', wait=False, geo_table='geocode.geoip', youtube_table='videos.youtube', parallel=False, fail_fast=False, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', target_directory='target', **kwargs)¶

This is the plural/multiple tables version of make_table_from_sql

Parameters:

tables (Iterable[str]) – BigQuery table names to create or append to
course_id (str) – Course ID whose secondary reports are being generated
client (bigquery.Client) – An authenticated bigquery.Client object
project (str) – GCP project id where the video_axis table is loaded.
query_dir (Union[None, str]) – Directory where query files are saved.
geo_table (str) – Table name in BigQuery with geolocation data for IPs
youtube_table (str) – Table name in BigQuery with YouTube video details
wait (bool) – Whether to wait for the query job to finish running
parallel (bool) – Whether the function is running in a process pool
fail_fast (bool) – Whether to stop processing after the first error
schema_dir (Union[None, str]) – Directory where schema files live
target_directory (str) – Name of the directory where to store compiled SQL queries

Return type:

Dict[str, Dict[str, str]]

Returns:

Return a dict mapping table names to their corresponding errors

simeon.report.utilities.make_tables_from_sql_par(tables, courses, project, append=False, query_dir='/home/runner/work/simeon/simeon/simeon/report/queries', wait=False, geo_table='geocode.geoip', youtube_table='videos.youtube', safile=None, size=4, logger=None, fail_fast=False, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', target_directory='target', **kwargs)¶

Parallel version of make_tables_from_sql

Parameters:

tables (Iterable[str]) – An iterable of BigQuery table names
courses (Iterable[str]) – An iterable of course IDs
project (str) – The GCP project against which queries are run
append (bool) – Whether to append query results to the target tables
query_dir (str) – The directories where the SQL query files are found
wait (bool) – Whether to wait for the BigQuery load jobs to complete
geo_table (str) – Table name in BigQuery with geolocation data for IPs
youtube_table (str) – Table name in BigQuery with YouTube video details
safile (Union[None, str]) – GCP service account file to use to connect to BigQuery
size (int) – Size of the process pool to run queries in parallel
logger (logging.Logger) – A Logger object with which to report steps carried out
fail_fast (bool) – Whether to stop processing after the first error
schema_dir (Union[None, str]) – Directory where schema files live
target_directory (str) – Directory where compiled SQL queries are stored.

Return type:

Dict[str, Dict[str, Dict[str, str]]]

Returns:

A dict mapping course_ids to tables and their query errors

simeon.report.utilities.make_user_info_combo(dirname, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', outname='user_info_combo.json.gz')¶

Given a course’s SQL directory, make a user_info_combo report

Parameters:

dirname (str) – Name of a course’s directory of SQL files
outname (str) – The filename to give it to the generated report

Return type:

None

Returns:

Nothing, but writes the generated data to the outname argument

simeon.report.utilities.module_from_block(block)¶

Extract a module ID from the given block

Parameters:: block (str) – A module item’s block string
Return type:: str
Returns:: Extracts the module ID in a module’s block string

simeon.report.utilities.process_course_structure(data, start, mapping, parent=None)¶

The course structure data dictionary and starting point, loop through it and construct course axis data items

Parameters:

data (dict) – The data from the course_structure-analytics.json file
start (str) – The key from data to start looking up children
mapping (dict) – A dict mapping child blocks to their parents
parent (Union[None, str]) – Parent of start

Return type:

List[Dict]

Returns:

Returns the list of constructed data items

simeon.report.utilities.wait_for_bq_job_ids(job_list, client)¶

Given a list of BigQuery load or query job IDs, wait for them all to finish.

Parameters:

job_list (Iterable[str]) – An Iterable of job IDs
client (google.cloud.bigquery.client.Client) – A BigQuery Client object to do the waiting

Return type:

Dict[str, Dict[str, str]]

Returns:

Returns a dict of job IDs to job errors

TODO:

Improve this function to behave a little less like a tight loop

simeon.report.utilities.wait_for_bq_jobs(job_list)¶

Given a list of BigQuery load or query jobs, wait for them all to finish.

Parameters:: job_list (Iterable[LoadJob]) – An Iterable of job objects from the bigquery package
Return type:: None
Returns:: Nothing
TODO:: Improve this function to behave a little less like a tight loop

Components of the `exceptions` package¶

Exceptions module¶

Exception classes for the simeon package

exception simeon.exceptions.AWSException¶: Raised when an S3 resource can’t be made

exception simeon.exceptions.BadSQLFileException(message, context_dict=None)¶: Raised when a SQL file is not in its expected format

exception simeon.exceptions.BigQueryNameException(message, context_dict=None)¶: Raised when a fully qualified table or dataset name can’t be created.

exception simeon.exceptions.BlobDownloadError¶: Raised when a Blob fails to download. This could be an upstream issue. However, it can also be due to local file system access issues, or due to exhausted system resources

exception simeon.exceptions.DecryptionError(message, context_dict=None)¶: Raised when the GPG decryption process fails.

exception simeon.exceptions.EarlyExitError(message, context_dict=None)¶: Raised when an early exit is requested by the end user of the CLI tool

exception simeon.exceptions.LoadJobException(message, context_dict=None)¶: Raised from a BigQuery data load job

exception simeon.exceptions.MissingFileException(message, context_dict=None)¶: Raised when a necessary file is missing

exception simeon.exceptions.MissingQueryFileException(message, context_dict=None)¶: Raised when a report table does not have a query file in the given query directory.

exception simeon.exceptions.MissingSchemaException(message, context_dict=None)¶: Raised when a schema could not be found for a given BigQuery table name

exception simeon.exceptions.SQLQueryException(message, context_dict=None)¶: Raised when calling client.query raises an error

exception simeon.exceptions.SchemaMismatchException(message, context_dict=None)¶: Raised when a record does not match its corresponding schema

exception simeon.exceptions.SimeonError(message, context_dict=None)¶: Base exception for most simeon issues

exception simeon.exceptions.SplitException(message, context_dict=None)¶: Raised when an issue happens during a split operation

Components of the download package¶

AWS module¶

Email opt-in module¶

Tracking logs module¶

SQL data files module¶

Utilities module for the download package¶

Components of the upload package¶

GCP module¶

Utilities module for the upload package¶

Components of the report package¶

Utilities module for the report package¶

Components of the exceptions package¶

Exceptions module¶

Components of the `download` package¶

Components of the `upload` package¶

Components of the `report` package¶

Components of the `exceptions` package¶