The python API of this tool is divided into the following sections:

  • download: Handles the downloading of data files from s3. It matches up with the simeon download and simeon split commands

  • upload: Handles the uploading of data to GCS and BigQuery. It matches up with the simeon push command

  • report: Handles the generation of secondary tables in BigQuery. It matches up with the simeon report command

Components of the download package

AWS module

Module of utilities to help with listing and downloading files from S3

class simeon.download.aws.S3Blob(name, size, last_modified, bucket, local_name=None)

A class to represent S3 blobs

download_file(filename=None)

Download the S3Blob to the local file system and return the full path where the file is saved

Parameters:

filename (Union[None, str]) – Name of the output file

Return type:

str

Returns:

Returns the full path where the file is saved

classmethod from_info(bucket, type_, date, org='mitx', site='edx')

Make a list of blobs with the given parameters

Parameters:
  • bucket (s3.Bucket) – The boto3.s3.Bucket object to tie to this blob

  • type (str) – “sql” or “email” or “sql”

  • date (Union[str, datetime]) – A datetime or str object for a threshold date

  • org (str) – The org whose data will be fetched.

  • site (str) – The site from which data were generated

Return type:

List[S3Blob]

Raises:

AWSException

classmethod from_prefix(bucket, prefix)

Fetch a list of S3Blob objects from AWS whose names have the given prefix.

Parameters:
  • bucket (s3.Bucket) – The boto3.s3.Bucket object to tie to this blob

  • prefix (str) – A string with which to filter the list of objects

Return type:

List[S3Blob]

Returns:

A list of S3Blob objects

Raises:

AWSException

to_json()

Jsonify the Blob

simeon.download.aws.get_file_date(fname)

Get the date in the name of the S3 blob

simeon.download.aws.make_s3_bucket(bucket, client_id=None, client_secret=None, session_token=None, profile_name=None)

Make a simple boto3 Bucket object pointing to S3

Email opt-in module

Module to process email opt-in data from edX

simeon.download.emails.compress_email_files(files, ddir, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas')

Generate a GZIP JSON file in the given ddir directory using the contents of the files. :NOTE: schema_dir is not used yet. But we may use to check that the generated records match their destination tables.

Parameters:
  • files (Iterable[str]) – An iterable of email opt-in CSV files to process

  • ddir (str) – A destination directory

  • schema_dir (Union[None, str]) – Directory where schema files live

Return type:

None

Returns:

Writes the contents of files into email_opt_in.json.gz

simeon.download.emails.parse_date(datestr)

Convert datestr to an iso formatted date. If not possible, return None

simeon.download.emails.process_email_file(fname, verbose=True, logger=None, timeout=None, keepfiles=False)

Email opt-in files are kind of different in that they are zip archives inside which reside GPG encrypted files.

Parameters:
  • fname (str) – Zip archive containing the email opt-in data file

  • verbose (bool) – Whether to print stuff when decrypting

  • logger (logging.Logger) – A Logger object to print messages with

  • timeout (Union[int, None]) – Number of seconds to wait for the decryption to finish

  • keepfiles (bool) – Whether to keep the .gpg files after decrypting them

Return type:

str

Returns:

Returns the path to the decrypted file

Tracking logs module

Module to process tracking log files from edX

simeon.download.logs.batch_split_tracking_logs(filenames, ddir, dynamic_date=False, courses=None, verbose=True, logger=None, size=10, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', debug=False)

Call split_tracking_log on each file inside a process or thread pool

simeon.download.logs.process_line(line: str | bytes, lcount: int, date: None | datetime = None, is_gzip=True, courses: Iterable[str] | None = None) dict

Process the line from a tracking log file and return the reformatted line (deserialized) along with the name of its destination file.

Parameters:
  • line (Union[str, bytes]) – A line from the tracking logs

  • lcount (int) – The line number of the given line

  • date (Union[None, datetime]) – The date of the file where this line comes from.

  • is_gzip (bool) – Whether or not this line came from a GZIP file

  • courses (Union[Iterable[str], None]) – A list of course IDs whose records are exported

Return type:

Dict[str, Union[Dict[str, str], str]]

Returns:

Dictionary with both the data and its destination file name

simeon.download.logs.split_tracking_log(filename: str, ddir: str, dynamic_date: bool = False, courses: Iterable[str] | None = None, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas')

Split the records in the given GZIP tracking log file. This function is very resource hungry because it keeps around a lot of open file handles and writes to them whenever it processes a good record. Some attempts are made to keep records around whenever the process is no longer allowed to open new files. But that will likely lead to the exhaustion of the running process’s allotted memory.

NOTE:

If you’ve got a better way, please update me.

Parameters:
  • filename (str) – The GZIP file to split

  • ddir (str) – Destination directory of the generated file

  • dynamic_date (bool) – Use dates from the JSON records to make output file names

  • courses (Union[Iterable[str], None]) – A list of course IDs whose records are exported

  • schema_dir (Union[None, str]) – Directory where to find schema files

Return type:

bool

Returns:

True if files have been generated. False, otherwise

SQL data files module

Module to process SQL files from edX

simeon.download.sqls.batch_decrypt_files(all_files, size=100, verbose=False, logger=None, timeout=None, keepfiles=False, njobs=5)

Batch the files by the given size and pass each batch to gpg to decrypt.

Parameters:
  • all_files (List[str]) – List of file names

  • size (int) – The batch size

  • verbose (bool) – Print the command to be run

  • logger (logging.Logger) – A logging.Logger object to print the command with

  • timeout (Union[int, None]) – Number of seconds to wait for the decryption to finish

  • keepfiles (bool) – Keep the encrypted files after decrypting them.

  • njobs (int) – Number of threads to use to call gpg in parallel

Return type:

None

Returns:

Nothing, but decrypts the .sql files from the given archive

simeon.download.sqls.force_delete_files(files, logger=None)

Delete the given files without regard for whatever or not they exist

Parameters:
  • files (Iterable[str]) – Iterable of file names

  • logger (Union[None, logging.Logger]) – A logger object to log messages

Return type:

None

Returns:

Returns nothing, but deletes the given files from the local FS

simeon.download.sqls.process_sql_archive(archive, ddir=None, include_edge=False, courses=None, size=5, tables_only=False, debug=False)

Unpack and decrypt files inside the given archive

Parameters:
  • archive (str) – SQL data package (a ZIP archive)

  • ddir (str) – The destination directory of the unpacked files

  • include_edge (bool) – Include the files from the edge site

  • courses (Union[Iterable[str], None]) – A list of course IDs whose data files are unpacked

  • size (int) – The size of the thread or process pool doing the unpacking

  • tables_only (bool) – Whether to extract file names only (no unarchiving)

  • debug (bool) – Show the stacktrace when an error occurs

Return type:

Set[str]

Returns:

A set of file names

simeon.download.sqls.unpacker(fname, names, ddir, cpaths=None, tables_only=False)

A worker callable to pass a Thread or Process pool

Utilities module for the download package

Some utility functions for working with the downloaded data files

simeon.download.utilities.check_for_funny_keys(record, name='toplevel')

I am quite frankly not sure what Ike is trying to do here, but there should be a better way. For now, though, we’ll just have to make do.

Parameters:
  • record (dict) – Dictionary whose values are modified

  • name (str) – Name of the level of the dict

Return type:

None

Returns:

Modifies the record in place

simeon.download.utilities.decrypt_files(fnames, verbose=True, logger=None, timeout=None, keepfiles=False)

Decrypt the given file with gpg. This assumes that the gpg command is available in the SHELL running this script.

Parameters:
  • fnames (Union[str, List]) – A file name or a list of file names to decrypt

  • verbose (bool) – Print the command to be run

  • logger (logging.Logger) – A logging.Logger object to print the command with

  • timeout (Union[int, None]) – Number of seconds to wait for the decryption to finish

  • keepfiles (bool) – Keep the encrypted files after decryption, if True.

Return type:

bool

Returns:

Returns True if the decryption does not fail

Raises:

DecryptionError

simeon.download.utilities.drop_empties(record, *keys)

Recursive drop keys whose corresponding values are empty from the given record.

Parameters:
  • record (dict) – Dictionary whose values are modified

  • keys (Iterable[str]) – multiple args

Return type:

None

Returns:

Modifies the record in place

simeon.download.utilities.format_sql_filename(fname: str)

Reformat the given edX SQL encrypted file name into a name indicative of where the file should end up after the SQL archive is unpacked. site/folder/filename.ext.gpg

simeon.download.utilities.get_course_id(record: dict, paths=None) str

Given a JSON record, try getting the course_id out of it.

Parameters:
  • record (dict) – A deserialized JSON record

  • paths (Iterable[Iterable[str]]) – Paths to follow to find a matching course ID string

Return type:

str

Returns:

A valid edX course ID or an empty string

simeon.download.utilities.get_file_date(fname)

Extract the date in a file name and parse it into a datetime object

Parameters:

fname (str) – Some file name

Return type:

Union[None, datetime]

Returns:

Returns a datetime object or None

simeon.download.utilities.get_module_id(record: dict, paths=None)

Get the module ID of the given record

Parameters:
  • record (dict) – A deserialized JSON record

  • paths (Iterable[Iterable[str]]) – Paths to follow to find a matching course ID string

Return type:

str

Returns:

A valid edX course ID or an empty string

simeon.download.utilities.get_sql_course_id(course_str: str) str

Given a course ID string from the SQL files, pluck out of the actual course ID and format it as follows: ORG/COURSE_NUMBER/TERM

Parameters:

course_str (str) – The course ID string from edX

Return type:

str

Returns:

Actual course ID and format it properly

simeon.download.utilities.is_float(val)

Check that the string can be coerced into a float.

simeon.download.utilities.make_file_handle(fname: str, mode: str = 'wt', is_gzip: bool = False)

Create a file handle pointing the given file name. If the directory of the file does not exist, create it.

Parameters:
  • fname (str) – A file name whose handle needs to be created.

  • mode (str) – “a[bt]?” for append or “w[bt]?” for write

  • is_gzip (bool) – Open it as a gzip file handle, if True.

Return type:

Union[TextIOWrapper, BufferedReader]

simeon.download.utilities.make_tracklog_path(course_id: str, datestr: str, is_gzip=True) str

Make a local file path name with the given course ID and datetime object

Parameters:
  • course_id (str) – Properly formatted edX course ID

  • datestr (str) – %Y-%m-%d formatted date associated with the tracking log

  • is_gzip (bool) – Make a GZIP file, if True.

Return type:

str

Returns:

A local FS file path

simeon.download.utilities.move_field_to_mongoid(record: dict, path: list)

Move the values associated with the given path into record[‘mongoid’]

Parameters:
  • record (dict) – Dictionary whose values are modified

  • path (Iterable[str]) – A list of keys to traverse and move

Return type:

None

Returns:

Modifies the record in place

simeon.download.utilities.move_unknown_fields_to_agent(record, *keys)

Move the values associated with the given keys into record[‘agent’]

Parameters:
  • record (dict) – Dictionary whose values are modified

  • keys (Iterable[str]) – multiple args

Return type:

None

Returns:

Modifies the record in place

simeon.download.utilities.parse_mongo_tstamp(timestamp: str)

Try converting a MongoDB timestamp into a stringified datetime

Parameters:

timestamp (str) – String representing a timestamp. This can be either a unix timestamp or a datetime.

Return type:

str

Returns:

A formatted datetime

simeon.download.utilities.rephrase_record(record: dict)

Update the given record in place. The purpose of this function is to turn this record into something with the same schema as that of the target BigQuery table.

Parameters:

record (dict) – A deserialized JSON record

Return type:

None

Returns:

Nothing, but updates the given record in place

simeon.download.utilities.stringify_dict(record, *keys)

Given a dictionary and some keys, JSON stringify the values at those keys in place.

Parameters:
  • record (dict) – Dictionary whose values are modified

  • keys (Iterable[str]) – multiple args

Return type:

None

Returns:

Modifies the dict in place

Components of the upload package

GCP module

Utilities functions and classes to help with loading data to Google Cloud

class simeon.upload.gcp.BigqueryClient(project=None, credentials=None, _http=None, location=None, default_query_job_config=None, default_load_job_config=None, client_info=None, client_options=None)

Subclass bigquery.Client and add convenience methods

static export_compiled_query(query, table, target_directory)

Export a query string to the target directory

Parameters:
  • query (str) – Compiled SQL query that is sent to BigQuery

  • table (str) – Name of the table that is generated by the given query

  • target_directory (str) – The directory under which the compiled SQL query is stored

Return type:

None

Returns:

Stores SQL query under the given target directory

static extract_error_messages(errors)

Return the error messages from given list of error objects (dict)

get_course_tables(course_id)

Get all the tables related to the given course ID

Parameters:

course_id (str) – edX course ID in format ORG/NUMBER/TERM

Return type:

Dict[str, set]

Returns:

A dict with keys as log and sql, and values as table names

static get_not_found_object(message)

If the given message contains the keywords ‘Not found’, then try and determine the name and type of the object that is not found.

has_latest_table(course_id, table)

Check if the given table name exists in the _latest dataset of the given course ID

Parameters:
  • course_id (str) – edX course ID in format ORG/NUMBER/TERM

  • table (str) – Name of the table being looked up

Return type:

bool

Returns:

True if the table is currently in BigQuery

has_log_table(course_id, table)

Check if the given table name exists in the _logs dataset of the given course ID

Parameters:
  • course_id (str) – edX course ID in format ORG/NUMBER/TERM

  • table (str) – Name of the table being looked up

Return type:

bool

Returns:

True if the table is currently in BigQuery

load_one_file_to_table(fname: str, file_type: str, project: str, create: bool, append: bool, use_storage: bool = False, bucket: str | None = None, max_bad_rows=0, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', format_='json', patch=False)

Load the given file to a target BigQuery table

Parameters:
  • fname (str) – The specific file to load

  • file_type (str) – One of sql, email, log, rdx

  • project (str) – Target GCP project

  • create (bool) – Whether or not to create the destination table

  • append (bool) – Whether or not to append the records to the table

  • use_storage (bool) – Whether or not to load the data from GCS

  • bucket (str) – GCS bucket name to use

  • max_bad_rows (int) – Max number of bad rows allowed during loading

  • schema_dir (Union[None, str]) – Directory where schema files are found

  • format (str) – File format (json or csv)

  • patch (bool) – Whether or not to patch the description of the table

Return type:

bigquery.LoadJob

Returns:

The LoadJob object associated with the work being done

Raises:

Propagates everything from the underlying package

load_tables_from_dir(dirname: str, file_type: str, project: str, create: bool, append: bool, use_storage: bool = False, bucket: str | None = None, max_bad_rows=0, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', format_='json', patch=False) List[LoadJob]

Load all the files in the given directory.

Parameters:
  • dirname (str) – Grandparent or parent directory of split up files

  • file_type (str) – One of sql, email, log, rdx

  • project (str) – Target GCP project

  • create (bool) – Whether or not to create the destination table

  • append (bool) – Whether or not to append the records to the table

  • use_storage (bool) – Whether or not to load the data from GCS

  • bucket (str) – GCS bucket name to use

  • max_bad_rows (int) – Max number of bad rows allowed during loading

  • schema_dir (str) – Directory where schema files are found

  • format (str) – File format (json or csv)

  • patch (bool) – Whether or not to patch the description of the table

Return type:

List[bigquery.LoadJob]

Returns:

List of load jobs

Raises:

Propagates everything from the underlying package

make_template(query)

Create a Template object whose environment includes some of the client’s methods as filters

Parameters:

query (str) – SQL query to use with the template being generated

Return type:

jinja2.Template

Returns:

Jinja2 Template object with the passed query

merge_to_table(fname, table, col, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', use_storage=False, patch=False, match_equal_columns=None, match_unequal_columns=None, target_directory='target')

Merge the given file to the target table name. If the latter does not exist, create it first. This process waits for all the jobs needed

Parameters:
  • fname (str) – A local file name or a GCS URI

  • table (str) – Fully qualified BigQuery table name

  • col (str) – Column by which to merge

  • schema_dir (Union[None, str]) – The directory where schema files live

  • use_storage (bool) – Whether or not the given path is a GCS URI

  • patch (bool) – Whether or not to patch the description of the table

  • match_equal_columns (Union[List[str], None, Tuple[str]]) – List of column names for which to set equality (=) if WHEN MATCH is met during the merge.

  • match_unequal_columns (Union[List[str], None, Tuple[str]]) – List of column names for which to set inequality (<>) if WHEN MATCH is met during the merge.

  • target_directory (str) – Target directory where to store SQL queries

Return type:

bigquery.QueryJob

Returns:

The QueryJob object associated with the merge carried out

Raises:

Propagates everything from the underlying package

class simeon.upload.gcp.GCSClient(project=<object object>, credentials=None, _http=None, client_info=None, client_options=None, use_auth_w_custom_endpoint=True, extra_headers={})

Make a client to load data files to GCS

load_dir(dirname: str, file_type: str, bucket: str)

Load all the files in the given directory or any immediate subdirectories

Parameters:
  • dirname (str) – The directory whose files are loaded

  • bucket (str) – GCS bucket name

Param:

file_type: One of sql, email, log, rdx

Return type:

None

Returns:

Nothing, but should load file(s) in dirname to GCS

Raises:

Propagates everything from the underlying package

load_one_file_to_gcs(fname: str, file_type: str, bucket: str)

Load the given file to GCS

Parameters:
  • fname (str) – The local file to load to GCS

  • bucket (str) – GCS bucket name

Param:

file_type: One of sql, email, log, rdx

Return type:

None

Returns:

Nothing, but should load the given file to GCS

Raises:

Propagates everything from the underlying package

Utilities module for the upload package

Utility functions and classes associated with uploading data to GCP, so far.

simeon.upload.utilities.course_to_bq_dataset(course_id: str, file_type: str, project: str) str

Make a fully qualified BigQuery dataset name with the given info

Parameters:
  • course_id (str) – edX course ID to format into a GCS path

  • file_type (str) – One of sql, log, email, rdx

  • project (str) – A GCP project ID

Return type:

str

Returns:

BigQuery dataset name with components separated by dots

simeon.upload.utilities.course_to_gcs_folder(course_id: str, file_type: str, bucket: str) str

Use the given course ID to make a Google Cloud Storage path

Parameters:
  • course_id (str) – edX course ID to format into a GCS path

  • file_type (str) – One of sql, log, email, rdx

  • bucket (str) – A GCS bucket name

Return type:

str

Returns:

A nicely formatted GCS path

simeon.upload.utilities.dict_to_schema_field(schema_dict: dict)

Make a SchemaField from a schema directory

Parameters:

schema_dict (dict) – One of the objects in the schema JSON file

Return type:

bigquery.SchemaField

Returns:

A SchemaField matching the given dictionary’s name, type, etc.

simeon.upload.utilities.get_bq_schema(table: str, schema_dir: str = '/home/runner/work/simeon/simeon/simeon/upload/schemas')

Given a bare table name (without leading project or dataset), make a list of bigquery.SchemaField objects to act as the table’s schema.

Parameters:
  • table (str) – A BigQuery (bare) table name

  • schema_dir (str) – Directory where schema JSON file is looked up

Return type:

Tuple[List[bigquery.SchemaField], str]

Returns:

A 2-tuple with list of bigquery.SchemaField objects and a description text for the target table

Raises:

MissingSchemaException

simeon.upload.utilities.local_to_bq_table(fname: str, file_type: str, project: str) str

Use the given local file to make a fully qualified BigQuery table name

Parameters:
  • fname (str) – A local file name

  • file_type (str) – One of sql, log, email, rdx

  • project (str) – A GCP project ID

Return type:

str

Returns:

BigQuery dataset name with components separated by dots

simeon.upload.utilities.local_to_gcs_path(fname: str, file_type: str, bucket: str) str

Convert the local file name into a GCS path

Parameters:
  • fname (str) – A local file name

  • file_type (str) – One of sql, log, email, rdx, cold

  • bucket (str) – A GCS bucket name

Return type:

str

Returns:

A nicely formatted GCS path

simeon.upload.utilities.make_bq_load_config(table: str, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', append: bool = False, create: bool = True, file_format: str = 'json', delim=',', max_bad_rows=0)

Make a bigquery.LoadJobConfig object and description of a table

Parameters:
  • table (str) – Fully qualified table name

  • schema_dir (str) – The directory where schema files live

  • append (bool) – Whether to append the loaded to the table

  • create (bool) – Whether to create the target table if it does not exist

  • file_format (str) – One of sql, json, csv, txt

  • delim (str) – The delimiter of the file being loaded

  • max_bad_rows (int) – The number of bad rows to tolerate when loading the data

Return type:

Tuple[bigquery.LoadJobConfig, str]

Returns:

A 2-tuple with a bigquery.LoadJobConfig object and a description text for the destination table

simeon.upload.utilities.make_bq_query_config(append: bool = False, plain=True, table=None)

Make a bigquery.QueryJobConfig object to tie to a query to be sent to BigQuery for secondary table generation

Parameters:
  • append (bool) – Whether to append the loaded to the table

  • plain (bool) – Make an empty QueryJobConfig object

  • table (Union[None, str]) – Fully qualified name of a destination table

Return type:

bigquery.QueryJobConfig

Returns:

Make a bigquery.QueryJobConfig object

simeon.upload.utilities.sqlify_bq_field(field, named=True)

Convert a bigquery.SchemaField object into a DDL column definition.

Parameters:
  • field (bigquery.SchemaField) – A SchemaField object to convert to a DDL statement

  • named (bool) – Whether the returned statement start with the field name

Return type:

str

Returns:

A SQL column’s DDL statement

Components of the report package

Utilities module for the report package

Utility functions and classes to help with making course reports like user_info_combo, person_course, etc.

simeon.report.utilities.check_record_schema(record, schema, coerce=True, nullify=False)

Check that the given record matches the same keys found in the given schema list of fields. The latter is one of the schemas in simeon/upload/schemas/

Parameters:
  • record (dict) – Dictionary whose values are modified

  • schema (Iterable[Dict[str, Union[str, Dict]]]) – A list of dicts with info on BigQuery table fields

  • coerce (bool) – Whether or not to coerce values into BigQuery types

  • nullify (bool) – Whether to set values mapping missing keys to None

Return type:

None

Returns:

Modifies the record if needed

Raises:

SchemaMismatchException

simeon.report.utilities.course_from_block(block)

Extract a course ID from the given block ID

Parameters:

block (str) – A module item’s block string

Return type:

str

Returns:

Extracts the course ID in a module’s block string

simeon.report.utilities.drop_extra_keys(record, schema)

Walk through the record and drop key-value pairs that are not in the given schema

Parameters:
  • record (dict) – Dictionary whose values are modified

  • schema (Iterable[Dict[str, Union[str, Dict]]]) – A list of dicts with info on BigQuery table fields

Return type:

None

Returns:

Modifies the record if needed

simeon.report.utilities.extract_table_query(table, query_dir)

Given a table name and a query directory, extract both the query string and the table description. The latter is assumed to be any line in the query file that starts with # or –

Parameters:
  • table (str) – BigQuery table name whose query info is extracted

  • query_dir (str) – The directory where the query file is expected to be

Return type:

Tuple[str, str]

Returns:

A tuple of strings (query string, table description)

Raises:

MissingQueryFileException

simeon.report.utilities.get_has_solution(record)

Extract whether the given record is a problem that has showanswer. If it’s present and its associated value is not “never”, then return True. Otherwise, return False.

Parameters:

record (dict) – A course_axis record

Return type:

bool

Returns:

Wether the course_axis record has a solution in the data

simeon.report.utilities.get_problem_nitems(record)

Get a value for data.num_items in course_axis

Parameters:

record (dict) – A course_axis record

Return type:

Union[int, None]

Returns:

The number of subitems of a problem item

simeon.report.utilities.get_youtube_id(record)

Given a course structure record, extract the YouTube ID associated with the video element.

Parameters:

record (dict) – A course_axis record

Return type:

Union[str, None]

Returns:

The YouTube video ID associated with the record

simeon.report.utilities.make_course_axis(dirname, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', outname='course_axis.json.gz')

Given a course’s SQL directory, make a course_axis report

Parameters:
  • dirname (str) – Name of a course’s directory of SQL files

  • outname (str) – The filename to give it to the generated report

Return type:

None

Returns:

Nothing, but writes the generated data to the outname argument

simeon.report.utilities.make_forum_table(dirname, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', outname='forum.json.gz')

Generate a file to load into the forum table using the given SQL directory

Parameters:
  • dirname (str) – Name of a course’s directory of SQL files

  • schema_dir (str) – Directory where schema files live

  • outname (str) – The filename to give it to the generated report

Return type:

None

Returns:

Nothing, but writes the generated data to the target file

simeon.report.utilities.make_grades_persistent(dirname, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', first_outname='grades_persistent.json.gz', second_outname='grades_persistent_subsection.json.gz')

Given a course’s SQL directory, make the grades_persistent and grades_persistent_subsection reports.

Parameters:
  • dirname (str) – Name of a course’s directory of SQL files

  • outname (str) – The filename to give it to the generated report

Return type:

None

Returns:

Nothing, but writes the generated data to the target files

simeon.report.utilities.make_grading_policy(dirname, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', outname='grading_policy.json.gz')

Generate a file to be loaded into the grading_policy table of the given SQL directory.

Parameters:
  • dirname (str) – Name of a course’s directory of SQL files

  • schema_dir (Union[None, str]) – Directory where schema files live

  • outname (str) – The filename to give it to the generated report

Return type:

None

Returns:

Nothing, but writes the generated data to the target file

simeon.report.utilities.make_problem_analysis(state, **extras)

Use the state record from a studentmodule record to make a record to load into the problem_analysis table. The state is assumed to be from record of category “problem”.

Parameters:
  • state (dict) – Contents of the state field of studentmodule

  • extras (keyword arguments) – Things to be added to the generated record

Return type:

dict

Returns:

Return a record to be loaded in problem_analysis

simeon.report.utilities.make_roles_table(dirname, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', outname='roles.json.gz')

Generate a file to be loaded into the roles table of a dataset

Parameters:
  • dirname (str) – Name of a course’s directory of SQL files

  • schema_dir (Union[None, str]) – Directory where schema files live

  • outname (str) – The filename to give it to the generated report

Return type:

None

Returns:

Nothing, but writes the generated data to the target files

simeon.report.utilities.make_sql_tables_par(dirnames, verbose=False, logger=None, fail_fast=False, debug=False, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas')

Given a list of SQL directories, make the SQL tables defined in this module. This convenience function calls all the report generating functions for the given directory name

Parameters:
  • dirnames (List[str]) – Names of SQL directories

  • verbose (bool) – Print a message when a report is being made

  • logger (logging.Logger) – A logging.Logger object to print messages with

  • fail_fast (bool) – Whether or not to bail after the first error

  • debug (bool) – Show the stacktrace that caused the error

  • schema_dir (str) – The directory where schema files live

Return type:

bool

Returns:

True if the files are generated, and False otherwise.

simeon.report.utilities.make_sql_tables_seq(dirnames, verbose=False, logger=None, fail_fast=False, debug=False, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas')

Given an iterable of SQL directories, make the SQL tables defined in this module. This convenience function calls all the report generating functions for the given directory name

Parameters:
  • dirnames (Iterable[str]) – Names of SQL directories

  • verbose (bool) – Print a message when a report is being made

  • logger (logging.Logger) – A logging.Logger object to print messages with

  • fail_fast (bool) – Whether or not to bail after the first error

  • debug (bool) – Show the stacktrace that caused the error

  • schema_dir (str) – The directory where schema files live

Return type:

bool

Returns:

True if the files are generated, and False otherwise.

simeon.report.utilities.make_student_module(dirname, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', outname='studentmodule.json.gz')

Generate files to load into studentmodule and problem_analysis using the given SQL directory

Parameters:
  • dirname (str) – Name of a course’s directory of SQL files

  • schema_dir (str) – Directory where schema files live

  • outname (str) – The filename to give it to the generated report

Return type:

None

Returns:

Nothing, but writes the generated data to the target files

simeon.report.utilities.make_table_from_sql(table, course_id, client, project, append=False, query_dir='/home/runner/work/simeon/simeon/simeon/report/queries', schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', wait=False, geo_table='geocode.geoip', youtube_table='videos.youtube', target_directory='target', **kwargs)

Generate a BigQuery table using the given table name, course ID and a matching SQL query file in the query_dir folder. The query file contains placeholder for course ID, dataset name and other details.

Parameters:
  • table (str) – table name

  • course_id (str) – Course ID whose secondary reports are being generated

  • client (bigquery.Client) – An authenticated bigquery.Client object

  • project (str) – GCP project id where the video_axis table is loaded.

  • query_dir (Union[None, str]) – Directory where query files are saved.

  • schema_dir (Union[None, str]) – Directory where schema files live

  • geo_table (str) – Table name in BigQuery with geolocation data for IPs

  • youtube_table (str) – Table name in BigQuery with YouTube video details

  • wait (bool) – Whether to wait for the query job to finish running

  • target_directory (str) – Name of a directory where compiled SQL queries are stored

Return type:

Dict[str, Dict[str, str]]

Returns:

Returns the errors dictionary from the LoadJob object tied to the query

simeon.report.utilities.make_tables_from_sql(tables, course_id, client, project, append=False, query_dir='/home/runner/work/simeon/simeon/simeon/report/queries', wait=False, geo_table='geocode.geoip', youtube_table='videos.youtube', parallel=False, fail_fast=False, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', target_directory='target', **kwargs)

This is the plural/multiple tables version of make_table_from_sql

Parameters:
  • tables (Iterable[str]) – BigQuery table names to create or append to

  • course_id (str) – Course ID whose secondary reports are being generated

  • client (bigquery.Client) – An authenticated bigquery.Client object

  • project (str) – GCP project id where the video_axis table is loaded.

  • query_dir (Union[None, str]) – Directory where query files are saved.

  • geo_table (str) – Table name in BigQuery with geolocation data for IPs

  • youtube_table (str) – Table name in BigQuery with YouTube video details

  • wait (bool) – Whether to wait for the query job to finish running

  • parallel (bool) – Whether the function is running in a process pool

  • fail_fast (bool) – Whether to stop processing after the first error

  • schema_dir (Union[None, str]) – Directory where schema files live

  • target_directory (str) – Name of the directory where to store compiled SQL queries

Return type:

Dict[str, Dict[str, str]]

Returns:

Return a dict mapping table names to their corresponding errors

simeon.report.utilities.make_tables_from_sql_par(tables, courses, project, append=False, query_dir='/home/runner/work/simeon/simeon/simeon/report/queries', wait=False, geo_table='geocode.geoip', youtube_table='videos.youtube', safile=None, size=4, logger=None, fail_fast=False, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', target_directory='target', **kwargs)

Parallel version of make_tables_from_sql

Parameters:
  • tables (Iterable[str]) – An iterable of BigQuery table names

  • courses (Iterable[str]) – An iterable of course IDs

  • project (str) – The GCP project against which queries are run

  • append (bool) – Whether to append query results to the target tables

  • query_dir (str) – The directories where the SQL query files are found

  • wait (bool) – Whether to wait for the BigQuery load jobs to complete

  • geo_table (str) – Table name in BigQuery with geolocation data for IPs

  • youtube_table (str) – Table name in BigQuery with YouTube video details

  • safile (Union[None, str]) – GCP service account file to use to connect to BigQuery

  • size (int) – Size of the process pool to run queries in parallel

  • logger (logging.Logger) – A Logger object with which to report steps carried out

  • fail_fast (bool) – Whether to stop processing after the first error

  • schema_dir (Union[None, str]) – Directory where schema files live

  • target_directory (str) – Directory where compiled SQL queries are stored.

Return type:

Dict[str, Dict[str, Dict[str, str]]]

Returns:

A dict mapping course_ids to tables and their query errors

simeon.report.utilities.make_user_info_combo(dirname, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', outname='user_info_combo.json.gz')

Given a course’s SQL directory, make a user_info_combo report

Parameters:
  • dirname (str) – Name of a course’s directory of SQL files

  • outname (str) – The filename to give it to the generated report

Return type:

None

Returns:

Nothing, but writes the generated data to the outname argument

simeon.report.utilities.module_from_block(block)

Extract a module ID from the given block

Parameters:

block (str) – A module item’s block string

Return type:

str

Returns:

Extracts the module ID in a module’s block string

simeon.report.utilities.process_course_structure(data, start, mapping, parent=None)

The course structure data dictionary and starting point, loop through it and construct course axis data items

Parameters:
  • data (dict) – The data from the course_structure-analytics.json file

  • start (str) – The key from data to start looking up children

  • mapping (dict) – A dict mapping child blocks to their parents

  • parent (Union[None, str]) – Parent of start

Return type:

List[Dict]

Returns:

Returns the list of constructed data items

simeon.report.utilities.wait_for_bq_job_ids(job_list, client)

Given a list of BigQuery load or query job IDs, wait for them all to finish.

Parameters:
  • job_list (Iterable[str]) – An Iterable of job IDs

  • client (google.cloud.bigquery.client.Client) – A BigQuery Client object to do the waiting

Return type:

Dict[str, Dict[str, str]]

Returns:

Returns a dict of job IDs to job errors

TODO:

Improve this function to behave a little less like a tight loop

simeon.report.utilities.wait_for_bq_jobs(job_list)

Given a list of BigQuery load or query jobs, wait for them all to finish.

Parameters:

job_list (Iterable[LoadJob]) – An Iterable of job objects from the bigquery package

Return type:

None

Returns:

Nothing

TODO:

Improve this function to behave a little less like a tight loop

Components of the exceptions package

Exceptions module

Exception classes for the simeon package

exception simeon.exceptions.AWSException

Raised when an S3 resource can’t be made

exception simeon.exceptions.BadSQLFileException(message, context_dict=None)

Raised when a SQL file is not in its expected format

exception simeon.exceptions.BigQueryNameException(message, context_dict=None)

Raised when a fully qualified table or dataset name can’t be created.

exception simeon.exceptions.BlobDownloadError

Raised when a Blob fails to download. This could be an upstream issue. However, it can also be due to local file system access issues, or due to exhausted system resources

exception simeon.exceptions.DecryptionError(message, context_dict=None)

Raised when the GPG decryption process fails.

exception simeon.exceptions.EarlyExitError(message, context_dict=None)

Raised when an early exit is requested by the end user of the CLI tool

exception simeon.exceptions.LoadJobException(message, context_dict=None)

Raised from a BigQuery data load job

exception simeon.exceptions.MissingFileException(message, context_dict=None)

Raised when a necessary file is missing

exception simeon.exceptions.MissingQueryFileException(message, context_dict=None)

Raised when a report table does not have a query file in the given query directory.

exception simeon.exceptions.MissingSchemaException(message, context_dict=None)

Raised when a schema could not be found for a given BigQuery table name

exception simeon.exceptions.SQLQueryException(message, context_dict=None)

Raised when calling client.query raises an error

exception simeon.exceptions.SchemaMismatchException(message, context_dict=None)

Raised when a record does not match its corresponding schema

exception simeon.exceptions.SimeonError(message, context_dict=None)

Base exception for most simeon issues

exception simeon.exceptions.SplitException(message, context_dict=None)

Raised when an issue happens during a split operation