The python API of this tool is divided into the following sections:
download: Handles the downloading of data files from s3. It matches up with the
simeon download
andsimeon split
commandsupload: Handles the uploading of data to GCS and BigQuery. It matches up with the
simeon push
commandreport: Handles the generation of secondary tables in BigQuery. It matches up with the
simeon report
command
Components of the download
package¶
AWS module¶
Module of utilities to help with listing and downloading files from S3
- class simeon.download.aws.S3Blob(name, size, last_modified, bucket, local_name=None)¶
A class to represent S3 blobs
- download_file(filename=None)¶
Download the S3Blob to the local file system and return the full path where the file is saved
- Parameters:
filename (Union[None, str]) – Name of the output file
- Return type:
str
- Returns:
Returns the full path where the file is saved
- classmethod from_info(bucket, type_, date, org='mitx', site='edx')¶
Make a list of blobs with the given parameters
- Parameters:
bucket (s3.Bucket) – The boto3.s3.Bucket object to tie to this blob
type (str) – “sql” or “email” or “sql”
date (Union[str, datetime]) – A datetime or str object for a threshold date
org (str) – The org whose data will be fetched.
site (str) – The site from which data were generated
- Return type:
List[S3Blob]
- Raises:
AWSException
- classmethod from_prefix(bucket, prefix)¶
Fetch a list of S3Blob objects from AWS whose names have the given prefix.
- Parameters:
bucket (s3.Bucket) – The boto3.s3.Bucket object to tie to this blob
prefix (str) – A string with which to filter the list of objects
- Return type:
List[S3Blob]
- Returns:
A list of S3Blob objects
- Raises:
AWSException
- to_json()¶
Jsonify the Blob
- simeon.download.aws.get_file_date(fname)¶
Get the date in the name of the S3 blob
- simeon.download.aws.make_s3_bucket(bucket, client_id=None, client_secret=None, session_token=None, profile_name=None)¶
Make a simple boto3 Bucket object pointing to S3
Email opt-in module¶
Module to process email opt-in data from edX
- simeon.download.emails.compress_email_files(files, ddir, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas')¶
Generate a GZIP JSON file in the given ddir directory using the contents of the files. :NOTE: schema_dir is not used yet. But we may use to check that the generated records match their destination tables.
- Parameters:
files (Iterable[str]) – An iterable of email opt-in CSV files to process
ddir (str) – A destination directory
schema_dir (Union[None, str]) – Directory where schema files live
- Return type:
None
- Returns:
Writes the contents of files into email_opt_in.json.gz
- simeon.download.emails.parse_date(datestr)¶
Convert datestr to an iso formatted date. If not possible, return None
- simeon.download.emails.process_email_file(fname, verbose=True, logger=None, timeout=None, keepfiles=False)¶
Email opt-in files are kind of different in that they are zip archives inside which reside GPG encrypted files.
- Parameters:
fname (str) – Zip archive containing the email opt-in data file
verbose (bool) – Whether to print stuff when decrypting
logger (logging.Logger) – A Logger object to print messages with
timeout (Union[int, None]) – Number of seconds to wait for the decryption to finish
keepfiles (bool) – Whether to keep the .gpg files after decrypting them
- Return type:
str
- Returns:
Returns the path to the decrypted file
Tracking logs module¶
Module to process tracking log files from edX
- simeon.download.logs.batch_split_tracking_logs(filenames, ddir, dynamic_date=False, courses=None, verbose=True, logger=None, size=10, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', debug=False)¶
Call split_tracking_log on each file inside a process or thread pool
- simeon.download.logs.process_line(line: str | bytes, lcount: int, date: None | datetime = None, is_gzip=True, courses: Iterable[str] | None = None) dict ¶
Process the line from a tracking log file and return the reformatted line (deserialized) along with the name of its destination file.
- Parameters:
line (Union[str, bytes]) – A line from the tracking logs
lcount (int) – The line number of the given line
date (Union[None, datetime]) – The date of the file where this line comes from.
is_gzip (bool) – Whether or not this line came from a GZIP file
courses (Union[Iterable[str], None]) – A list of course IDs whose records are exported
- Return type:
Dict[str, Union[Dict[str, str], str]]
- Returns:
Dictionary with both the data and its destination file name
- simeon.download.logs.split_tracking_log(filename: str, ddir: str, dynamic_date: bool = False, courses: Iterable[str] | None = None, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas')¶
Split the records in the given GZIP tracking log file. This function is very resource hungry because it keeps around a lot of open file handles and writes to them whenever it processes a good record. Some attempts are made to keep records around whenever the process is no longer allowed to open new files. But that will likely lead to the exhaustion of the running process’s allotted memory.
- NOTE:
If you’ve got a better way, please update me.
- Parameters:
filename (str) – The GZIP file to split
ddir (str) – Destination directory of the generated file
dynamic_date (bool) – Use dates from the JSON records to make output file names
courses (Union[Iterable[str], None]) – A list of course IDs whose records are exported
schema_dir (Union[None, str]) – Directory where to find schema files
- Return type:
bool
- Returns:
True if files have been generated. False, otherwise
SQL data files module¶
Module to process SQL files from edX
- simeon.download.sqls.batch_decrypt_files(all_files, size=100, verbose=False, logger=None, timeout=None, keepfiles=False, njobs=5)¶
Batch the files by the given size and pass each batch to gpg to decrypt.
- Parameters:
all_files (List[str]) – List of file names
size (int) – The batch size
verbose (bool) – Print the command to be run
logger (logging.Logger) – A logging.Logger object to print the command with
timeout (Union[int, None]) – Number of seconds to wait for the decryption to finish
keepfiles (bool) – Keep the encrypted files after decrypting them.
njobs (int) – Number of threads to use to call gpg in parallel
- Return type:
None
- Returns:
Nothing, but decrypts the .sql files from the given archive
- simeon.download.sqls.force_delete_files(files, logger=None)¶
Delete the given files without regard for whatever or not they exist
- Parameters:
files (Iterable[str]) – Iterable of file names
logger (Union[None, logging.Logger]) – A logger object to log messages
- Return type:
None
- Returns:
Returns nothing, but deletes the given files from the local FS
- simeon.download.sqls.process_sql_archive(archive, ddir=None, include_edge=False, courses=None, size=5, tables_only=False, debug=False)¶
Unpack and decrypt files inside the given archive
- Parameters:
archive (str) – SQL data package (a ZIP archive)
ddir (str) – The destination directory of the unpacked files
include_edge (bool) – Include the files from the edge site
courses (Union[Iterable[str], None]) – A list of course IDs whose data files are unpacked
size (int) – The size of the thread or process pool doing the unpacking
tables_only (bool) – Whether to extract file names only (no unarchiving)
debug (bool) – Show the stacktrace when an error occurs
- Return type:
Set[str]
- Returns:
A set of file names
- simeon.download.sqls.unpacker(fname, names, ddir, cpaths=None, tables_only=False)¶
A worker callable to pass a Thread or Process pool
Utilities module for the download package¶
Some utility functions for working with the downloaded data files
- simeon.download.utilities.check_for_funny_keys(record, name='toplevel')¶
I am quite frankly not sure what Ike is trying to do here, but there should be a better way. For now, though, we’ll just have to make do.
- Parameters:
record (dict) – Dictionary whose values are modified
name (str) – Name of the level of the dict
- Return type:
None
- Returns:
Modifies the record in place
- simeon.download.utilities.decrypt_files(fnames, verbose=True, logger=None, timeout=None, keepfiles=False)¶
Decrypt the given file with gpg. This assumes that the gpg command is available in the SHELL running this script.
- Parameters:
fnames (Union[str, List]) – A file name or a list of file names to decrypt
verbose (bool) – Print the command to be run
logger (logging.Logger) – A logging.Logger object to print the command with
timeout (Union[int, None]) – Number of seconds to wait for the decryption to finish
keepfiles (bool) – Keep the encrypted files after decryption, if True.
- Return type:
bool
- Returns:
Returns True if the decryption does not fail
- Raises:
DecryptionError
- simeon.download.utilities.drop_empties(record, *keys)¶
Recursive drop keys whose corresponding values are empty from the given record.
- Parameters:
record (dict) – Dictionary whose values are modified
keys (Iterable[str]) – multiple args
- Return type:
None
- Returns:
Modifies the record in place
- simeon.download.utilities.format_sql_filename(fname: str)¶
Reformat the given edX SQL encrypted file name into a name indicative of where the file should end up after the SQL archive is unpacked. site/folder/filename.ext.gpg
- simeon.download.utilities.get_course_id(record: dict, paths=None) str ¶
Given a JSON record, try getting the course_id out of it.
- Parameters:
record (dict) – A deserialized JSON record
paths (Iterable[Iterable[str]]) – Paths to follow to find a matching course ID string
- Return type:
str
- Returns:
A valid edX course ID or an empty string
- simeon.download.utilities.get_file_date(fname)¶
Extract the date in a file name and parse it into a datetime object
- Parameters:
fname (str) – Some file name
- Return type:
Union[None, datetime]
- Returns:
Returns a datetime object or None
- simeon.download.utilities.get_module_id(record: dict, paths=None)¶
Get the module ID of the given record
- Parameters:
record (dict) – A deserialized JSON record
paths (Iterable[Iterable[str]]) – Paths to follow to find a matching course ID string
- Return type:
str
- Returns:
A valid edX course ID or an empty string
- simeon.download.utilities.get_sql_course_id(course_str: str) str ¶
Given a course ID string from the SQL files, pluck out of the actual course ID and format it as follows: ORG/COURSE_NUMBER/TERM
- Parameters:
course_str (str) – The course ID string from edX
- Return type:
str
- Returns:
Actual course ID and format it properly
- simeon.download.utilities.is_float(val)¶
Check that the string can be coerced into a float.
- simeon.download.utilities.make_file_handle(fname: str, mode: str = 'wt', is_gzip: bool = False)¶
Create a file handle pointing the given file name. If the directory of the file does not exist, create it.
- Parameters:
fname (str) – A file name whose handle needs to be created.
mode (str) – “a[bt]?” for append or “w[bt]?” for write
is_gzip (bool) – Open it as a gzip file handle, if True.
- Return type:
Union[TextIOWrapper, BufferedReader]
- simeon.download.utilities.make_tracklog_path(course_id: str, datestr: str, is_gzip=True) str ¶
Make a local file path name with the given course ID and datetime object
- Parameters:
course_id (str) – Properly formatted edX course ID
datestr (str) – %Y-%m-%d formatted date associated with the tracking log
is_gzip (bool) – Make a GZIP file, if True.
- Return type:
str
- Returns:
A local FS file path
- simeon.download.utilities.move_field_to_mongoid(record: dict, path: list)¶
Move the values associated with the given path into record[‘mongoid’]
- Parameters:
record (dict) – Dictionary whose values are modified
path (Iterable[str]) – A list of keys to traverse and move
- Return type:
None
- Returns:
Modifies the record in place
- simeon.download.utilities.move_unknown_fields_to_agent(record, *keys)¶
Move the values associated with the given keys into record[‘agent’]
- Parameters:
record (dict) – Dictionary whose values are modified
keys (Iterable[str]) – multiple args
- Return type:
None
- Returns:
Modifies the record in place
- simeon.download.utilities.parse_mongo_tstamp(timestamp: str)¶
Try converting a MongoDB timestamp into a stringified datetime
- Parameters:
timestamp (str) – String representing a timestamp. This can be either a unix timestamp or a datetime.
- Return type:
str
- Returns:
A formatted datetime
- simeon.download.utilities.rephrase_record(record: dict)¶
Update the given record in place. The purpose of this function is to turn this record into something with the same schema as that of the target BigQuery table.
- Parameters:
record (dict) – A deserialized JSON record
- Return type:
None
- Returns:
Nothing, but updates the given record in place
- simeon.download.utilities.stringify_dict(record, *keys)¶
Given a dictionary and some keys, JSON stringify the values at those keys in place.
- Parameters:
record (dict) – Dictionary whose values are modified
keys (Iterable[str]) – multiple args
- Return type:
None
- Returns:
Modifies the dict in place
Components of the upload
package¶
GCP module¶
Utilities functions and classes to help with loading data to Google Cloud
- class simeon.upload.gcp.BigqueryClient(project=None, credentials=None, _http=None, location=None, default_query_job_config=None, default_load_job_config=None, client_info=None, client_options=None)¶
Subclass bigquery.Client and add convenience methods
- static export_compiled_query(query, table, target_directory)¶
Export a query string to the target directory
- Parameters:
query (str) – Compiled SQL query that is sent to BigQuery
table (str) – Name of the table that is generated by the given query
target_directory (str) – The directory under which the compiled SQL query is stored
- Return type:
None
- Returns:
Stores SQL query under the given target directory
- static extract_error_messages(errors)¶
Return the error messages from given list of error objects (dict)
- get_course_tables(course_id)¶
Get all the tables related to the given course ID
- Parameters:
course_id (str) – edX course ID in format ORG/NUMBER/TERM
- Return type:
Dict[str, set]
- Returns:
A dict with keys as log and sql, and values as table names
- static get_not_found_object(message)¶
If the given message contains the keywords ‘Not found’, then try and determine the name and type of the object that is not found.
- has_latest_table(course_id, table)¶
Check if the given table name exists in the _latest dataset of the given course ID
- Parameters:
course_id (str) – edX course ID in format ORG/NUMBER/TERM
table (str) – Name of the table being looked up
- Return type:
bool
- Returns:
True if the table is currently in BigQuery
- has_log_table(course_id, table)¶
Check if the given table name exists in the _logs dataset of the given course ID
- Parameters:
course_id (str) – edX course ID in format ORG/NUMBER/TERM
table (str) – Name of the table being looked up
- Return type:
bool
- Returns:
True if the table is currently in BigQuery
- load_one_file_to_table(fname: str, file_type: str, project: str, create: bool, append: bool, use_storage: bool = False, bucket: str | None = None, max_bad_rows=0, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', format_='json', patch=False)¶
Load the given file to a target BigQuery table
- Parameters:
fname (str) – The specific file to load
file_type (str) – One of sql, email, log, rdx
project (str) – Target GCP project
create (bool) – Whether or not to create the destination table
append (bool) – Whether or not to append the records to the table
use_storage (bool) – Whether or not to load the data from GCS
bucket (str) – GCS bucket name to use
max_bad_rows (int) – Max number of bad rows allowed during loading
schema_dir (Union[None, str]) – Directory where schema files are found
format (str) – File format (json or csv)
patch (bool) – Whether or not to patch the description of the table
- Return type:
bigquery.LoadJob
- Returns:
The LoadJob object associated with the work being done
- Raises:
Propagates everything from the underlying package
- load_tables_from_dir(dirname: str, file_type: str, project: str, create: bool, append: bool, use_storage: bool = False, bucket: str | None = None, max_bad_rows=0, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', format_='json', patch=False) List[LoadJob] ¶
Load all the files in the given directory.
- Parameters:
dirname (str) – Grandparent or parent directory of split up files
file_type (str) – One of sql, email, log, rdx
project (str) – Target GCP project
create (bool) – Whether or not to create the destination table
append (bool) – Whether or not to append the records to the table
use_storage (bool) – Whether or not to load the data from GCS
bucket (str) – GCS bucket name to use
max_bad_rows (int) – Max number of bad rows allowed during loading
schema_dir (str) – Directory where schema files are found
format (str) – File format (json or csv)
patch (bool) – Whether or not to patch the description of the table
- Return type:
List[bigquery.LoadJob]
- Returns:
List of load jobs
- Raises:
Propagates everything from the underlying package
- make_template(query)¶
Create a Template object whose environment includes some of the client’s methods as filters
- Parameters:
query (str) – SQL query to use with the template being generated
- Return type:
jinja2.Template
- Returns:
Jinja2 Template object with the passed query
- merge_to_table(fname, table, col, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', use_storage=False, patch=False, match_equal_columns=None, match_unequal_columns=None, target_directory='target')¶
Merge the given file to the target table name. If the latter does not exist, create it first. This process waits for all the jobs needed
- Parameters:
fname (str) – A local file name or a GCS URI
table (str) – Fully qualified BigQuery table name
col (str) – Column by which to merge
schema_dir (Union[None, str]) – The directory where schema files live
use_storage (bool) – Whether or not the given path is a GCS URI
patch (bool) – Whether or not to patch the description of the table
match_equal_columns (Union[List[str], None, Tuple[str]]) – List of column names for which to set equality (=) if WHEN MATCH is met during the merge.
match_unequal_columns (Union[List[str], None, Tuple[str]]) – List of column names for which to set inequality (<>) if WHEN MATCH is met during the merge.
target_directory (str) – Target directory where to store SQL queries
- Return type:
bigquery.QueryJob
- Returns:
The QueryJob object associated with the merge carried out
- Raises:
Propagates everything from the underlying package
- class simeon.upload.gcp.GCSClient(project=<object object>, credentials=None, _http=None, client_info=None, client_options=None, use_auth_w_custom_endpoint=True, extra_headers={})¶
Make a client to load data files to GCS
- load_dir(dirname: str, file_type: str, bucket: str)¶
Load all the files in the given directory or any immediate subdirectories
- Parameters:
dirname (str) – The directory whose files are loaded
bucket (str) – GCS bucket name
- Param:
file_type: One of sql, email, log, rdx
- Return type:
None
- Returns:
Nothing, but should load file(s) in dirname to GCS
- Raises:
Propagates everything from the underlying package
- load_one_file_to_gcs(fname: str, file_type: str, bucket: str)¶
Load the given file to GCS
- Parameters:
fname (str) – The local file to load to GCS
bucket (str) – GCS bucket name
- Param:
file_type: One of sql, email, log, rdx
- Return type:
None
- Returns:
Nothing, but should load the given file to GCS
- Raises:
Propagates everything from the underlying package
Utilities module for the upload package¶
Utility functions and classes associated with uploading data to GCP, so far.
- simeon.upload.utilities.course_to_bq_dataset(course_id: str, file_type: str, project: str) str ¶
Make a fully qualified BigQuery dataset name with the given info
- Parameters:
course_id (str) – edX course ID to format into a GCS path
file_type (str) – One of sql, log, email, rdx
project (str) – A GCP project ID
- Return type:
str
- Returns:
BigQuery dataset name with components separated by dots
- simeon.upload.utilities.course_to_gcs_folder(course_id: str, file_type: str, bucket: str) str ¶
Use the given course ID to make a Google Cloud Storage path
- Parameters:
course_id (str) – edX course ID to format into a GCS path
file_type (str) – One of sql, log, email, rdx
bucket (str) – A GCS bucket name
- Return type:
str
- Returns:
A nicely formatted GCS path
- simeon.upload.utilities.dict_to_schema_field(schema_dict: dict)¶
Make a SchemaField from a schema directory
- Parameters:
schema_dict (dict) – One of the objects in the schema JSON file
- Return type:
bigquery.SchemaField
- Returns:
A SchemaField matching the given dictionary’s name, type, etc.
- simeon.upload.utilities.get_bq_schema(table: str, schema_dir: str = '/home/runner/work/simeon/simeon/simeon/upload/schemas')¶
Given a bare table name (without leading project or dataset), make a list of bigquery.SchemaField objects to act as the table’s schema.
- Parameters:
table (str) – A BigQuery (bare) table name
schema_dir (str) – Directory where schema JSON file is looked up
- Return type:
Tuple[List[bigquery.SchemaField], str]
- Returns:
A 2-tuple with list of bigquery.SchemaField objects and a description text for the target table
- Raises:
MissingSchemaException
- simeon.upload.utilities.local_to_bq_table(fname: str, file_type: str, project: str) str ¶
Use the given local file to make a fully qualified BigQuery table name
- Parameters:
fname (str) – A local file name
file_type (str) – One of sql, log, email, rdx
project (str) – A GCP project ID
- Return type:
str
- Returns:
BigQuery dataset name with components separated by dots
- simeon.upload.utilities.local_to_gcs_path(fname: str, file_type: str, bucket: str) str ¶
Convert the local file name into a GCS path
- Parameters:
fname (str) – A local file name
file_type (str) – One of sql, log, email, rdx, cold
bucket (str) – A GCS bucket name
- Return type:
str
- Returns:
A nicely formatted GCS path
- simeon.upload.utilities.make_bq_load_config(table: str, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', append: bool = False, create: bool = True, file_format: str = 'json', delim=',', max_bad_rows=0)¶
Make a bigquery.LoadJobConfig object and description of a table
- Parameters:
table (str) – Fully qualified table name
schema_dir (str) – The directory where schema files live
append (bool) – Whether to append the loaded to the table
create (bool) – Whether to create the target table if it does not exist
file_format (str) – One of sql, json, csv, txt
delim (str) – The delimiter of the file being loaded
max_bad_rows (int) – The number of bad rows to tolerate when loading the data
- Return type:
Tuple[bigquery.LoadJobConfig, str]
- Returns:
A 2-tuple with a bigquery.LoadJobConfig object and a description text for the destination table
- simeon.upload.utilities.make_bq_query_config(append: bool = False, plain=True, table=None)¶
Make a bigquery.QueryJobConfig object to tie to a query to be sent to BigQuery for secondary table generation
- Parameters:
append (bool) – Whether to append the loaded to the table
plain (bool) – Make an empty QueryJobConfig object
table (Union[None, str]) – Fully qualified name of a destination table
- Return type:
bigquery.QueryJobConfig
- Returns:
Make a bigquery.QueryJobConfig object
- simeon.upload.utilities.sqlify_bq_field(field, named=True)¶
Convert a bigquery.SchemaField object into a DDL column definition.
- Parameters:
field (bigquery.SchemaField) – A SchemaField object to convert to a DDL statement
named (bool) – Whether the returned statement start with the field name
- Return type:
str
- Returns:
A SQL column’s DDL statement
Components of the report
package¶
Utilities module for the report package¶
Utility functions and classes to help with making course reports like user_info_combo, person_course, etc.
- simeon.report.utilities.check_record_schema(record, schema, coerce=True, nullify=False)¶
Check that the given record matches the same keys found in the given schema list of fields. The latter is one of the schemas in simeon/upload/schemas/
- Parameters:
record (dict) – Dictionary whose values are modified
schema (Iterable[Dict[str, Union[str, Dict]]]) – A list of dicts with info on BigQuery table fields
coerce (bool) – Whether or not to coerce values into BigQuery types
nullify (bool) – Whether to set values mapping missing keys to None
- Return type:
None
- Returns:
Modifies the record if needed
- Raises:
SchemaMismatchException
- simeon.report.utilities.course_from_block(block)¶
Extract a course ID from the given block ID
- Parameters:
block (str) – A module item’s block string
- Return type:
str
- Returns:
Extracts the course ID in a module’s block string
- simeon.report.utilities.drop_extra_keys(record, schema)¶
Walk through the record and drop key-value pairs that are not in the given schema
- Parameters:
record (dict) – Dictionary whose values are modified
schema (Iterable[Dict[str, Union[str, Dict]]]) – A list of dicts with info on BigQuery table fields
- Return type:
None
- Returns:
Modifies the record if needed
- simeon.report.utilities.extract_table_query(table, query_dir)¶
Given a table name and a query directory, extract both the query string and the table description. The latter is assumed to be any line in the query file that starts with # or –
- Parameters:
table (str) – BigQuery table name whose query info is extracted
query_dir (str) – The directory where the query file is expected to be
- Return type:
Tuple[str, str]
- Returns:
A tuple of strings (query string, table description)
- Raises:
MissingQueryFileException
- simeon.report.utilities.get_has_solution(record)¶
Extract whether the given record is a problem that has showanswer. If it’s present and its associated value is not “never”, then return True. Otherwise, return False.
- Parameters:
record (dict) – A course_axis record
- Return type:
bool
- Returns:
Wether the course_axis record has a solution in the data
- simeon.report.utilities.get_problem_nitems(record)¶
Get a value for data.num_items in course_axis
- Parameters:
record (dict) – A course_axis record
- Return type:
Union[int, None]
- Returns:
The number of subitems of a problem item
- simeon.report.utilities.get_youtube_id(record)¶
Given a course structure record, extract the YouTube ID associated with the video element.
- Parameters:
record (dict) – A course_axis record
- Return type:
Union[str, None]
- Returns:
The YouTube video ID associated with the record
- simeon.report.utilities.make_course_axis(dirname, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', outname='course_axis.json.gz')¶
Given a course’s SQL directory, make a course_axis report
- Parameters:
dirname (str) – Name of a course’s directory of SQL files
outname (str) – The filename to give it to the generated report
- Return type:
None
- Returns:
Nothing, but writes the generated data to the outname argument
- simeon.report.utilities.make_forum_table(dirname, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', outname='forum.json.gz')¶
Generate a file to load into the forum table using the given SQL directory
- Parameters:
dirname (str) – Name of a course’s directory of SQL files
schema_dir (str) – Directory where schema files live
outname (str) – The filename to give it to the generated report
- Return type:
None
- Returns:
Nothing, but writes the generated data to the target file
- simeon.report.utilities.make_grades_persistent(dirname, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', first_outname='grades_persistent.json.gz', second_outname='grades_persistent_subsection.json.gz')¶
Given a course’s SQL directory, make the grades_persistent and grades_persistent_subsection reports.
- Parameters:
dirname (str) – Name of a course’s directory of SQL files
outname (str) – The filename to give it to the generated report
- Return type:
None
- Returns:
Nothing, but writes the generated data to the target files
- simeon.report.utilities.make_grading_policy(dirname, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', outname='grading_policy.json.gz')¶
Generate a file to be loaded into the grading_policy table of the given SQL directory.
- Parameters:
dirname (str) – Name of a course’s directory of SQL files
schema_dir (Union[None, str]) – Directory where schema files live
outname (str) – The filename to give it to the generated report
- Return type:
None
- Returns:
Nothing, but writes the generated data to the target file
- simeon.report.utilities.make_problem_analysis(state, **extras)¶
Use the state record from a studentmodule record to make a record to load into the problem_analysis table. The state is assumed to be from record of category “problem”.
- Parameters:
state (dict) – Contents of the state field of studentmodule
extras (keyword arguments) – Things to be added to the generated record
- Return type:
dict
- Returns:
Return a record to be loaded in problem_analysis
- simeon.report.utilities.make_roles_table(dirname, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', outname='roles.json.gz')¶
Generate a file to be loaded into the roles table of a dataset
- Parameters:
dirname (str) – Name of a course’s directory of SQL files
schema_dir (Union[None, str]) – Directory where schema files live
outname (str) – The filename to give it to the generated report
- Return type:
None
- Returns:
Nothing, but writes the generated data to the target files
- simeon.report.utilities.make_sql_tables_par(dirnames, verbose=False, logger=None, fail_fast=False, debug=False, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas')¶
Given a list of SQL directories, make the SQL tables defined in this module. This convenience function calls all the report generating functions for the given directory name
- Parameters:
dirnames (List[str]) – Names of SQL directories
verbose (bool) – Print a message when a report is being made
logger (logging.Logger) – A logging.Logger object to print messages with
fail_fast (bool) – Whether or not to bail after the first error
debug (bool) – Show the stacktrace that caused the error
schema_dir (str) – The directory where schema files live
- Return type:
bool
- Returns:
True if the files are generated, and False otherwise.
- simeon.report.utilities.make_sql_tables_seq(dirnames, verbose=False, logger=None, fail_fast=False, debug=False, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas')¶
Given an iterable of SQL directories, make the SQL tables defined in this module. This convenience function calls all the report generating functions for the given directory name
- Parameters:
dirnames (Iterable[str]) – Names of SQL directories
verbose (bool) – Print a message when a report is being made
logger (logging.Logger) – A logging.Logger object to print messages with
fail_fast (bool) – Whether or not to bail after the first error
debug (bool) – Show the stacktrace that caused the error
schema_dir (str) – The directory where schema files live
- Return type:
bool
- Returns:
True if the files are generated, and False otherwise.
- simeon.report.utilities.make_student_module(dirname, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', outname='studentmodule.json.gz')¶
Generate files to load into studentmodule and problem_analysis using the given SQL directory
- Parameters:
dirname (str) – Name of a course’s directory of SQL files
schema_dir (str) – Directory where schema files live
outname (str) – The filename to give it to the generated report
- Return type:
None
- Returns:
Nothing, but writes the generated data to the target files
- simeon.report.utilities.make_table_from_sql(table, course_id, client, project, append=False, query_dir='/home/runner/work/simeon/simeon/simeon/report/queries', schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', wait=False, geo_table='geocode.geoip', youtube_table='videos.youtube', target_directory='target', **kwargs)¶
Generate a BigQuery table using the given table name, course ID and a matching SQL query file in the query_dir folder. The query file contains placeholder for course ID, dataset name and other details.
- Parameters:
table (str) – table name
course_id (str) – Course ID whose secondary reports are being generated
client (bigquery.Client) – An authenticated bigquery.Client object
project (str) – GCP project id where the video_axis table is loaded.
query_dir (Union[None, str]) – Directory where query files are saved.
schema_dir (Union[None, str]) – Directory where schema files live
geo_table (str) – Table name in BigQuery with geolocation data for IPs
youtube_table (str) – Table name in BigQuery with YouTube video details
wait (bool) – Whether to wait for the query job to finish running
target_directory (str) – Name of a directory where compiled SQL queries are stored
- Return type:
Dict[str, Dict[str, str]]
- Returns:
Returns the errors dictionary from the LoadJob object tied to the query
- simeon.report.utilities.make_tables_from_sql(tables, course_id, client, project, append=False, query_dir='/home/runner/work/simeon/simeon/simeon/report/queries', wait=False, geo_table='geocode.geoip', youtube_table='videos.youtube', parallel=False, fail_fast=False, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', target_directory='target', **kwargs)¶
This is the plural/multiple tables version of make_table_from_sql
- Parameters:
tables (Iterable[str]) – BigQuery table names to create or append to
course_id (str) – Course ID whose secondary reports are being generated
client (bigquery.Client) – An authenticated bigquery.Client object
project (str) – GCP project id where the video_axis table is loaded.
query_dir (Union[None, str]) – Directory where query files are saved.
geo_table (str) – Table name in BigQuery with geolocation data for IPs
youtube_table (str) – Table name in BigQuery with YouTube video details
wait (bool) – Whether to wait for the query job to finish running
parallel (bool) – Whether the function is running in a process pool
fail_fast (bool) – Whether to stop processing after the first error
schema_dir (Union[None, str]) – Directory where schema files live
target_directory (str) – Name of the directory where to store compiled SQL queries
- Return type:
Dict[str, Dict[str, str]]
- Returns:
Return a dict mapping table names to their corresponding errors
- simeon.report.utilities.make_tables_from_sql_par(tables, courses, project, append=False, query_dir='/home/runner/work/simeon/simeon/simeon/report/queries', wait=False, geo_table='geocode.geoip', youtube_table='videos.youtube', safile=None, size=4, logger=None, fail_fast=False, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', target_directory='target', **kwargs)¶
Parallel version of make_tables_from_sql
- Parameters:
tables (Iterable[str]) – An iterable of BigQuery table names
courses (Iterable[str]) – An iterable of course IDs
project (str) – The GCP project against which queries are run
append (bool) – Whether to append query results to the target tables
query_dir (str) – The directories where the SQL query files are found
wait (bool) – Whether to wait for the BigQuery load jobs to complete
geo_table (str) – Table name in BigQuery with geolocation data for IPs
youtube_table (str) – Table name in BigQuery with YouTube video details
safile (Union[None, str]) – GCP service account file to use to connect to BigQuery
size (int) – Size of the process pool to run queries in parallel
logger (logging.Logger) – A Logger object with which to report steps carried out
fail_fast (bool) – Whether to stop processing after the first error
schema_dir (Union[None, str]) – Directory where schema files live
target_directory (str) – Directory where compiled SQL queries are stored.
- Return type:
Dict[str, Dict[str, Dict[str, str]]]
- Returns:
A dict mapping course_ids to tables and their query errors
- simeon.report.utilities.make_user_info_combo(dirname, schema_dir='/home/runner/work/simeon/simeon/simeon/upload/schemas', outname='user_info_combo.json.gz')¶
Given a course’s SQL directory, make a user_info_combo report
- Parameters:
dirname (str) – Name of a course’s directory of SQL files
outname (str) – The filename to give it to the generated report
- Return type:
None
- Returns:
Nothing, but writes the generated data to the outname argument
- simeon.report.utilities.module_from_block(block)¶
Extract a module ID from the given block
- Parameters:
block (str) – A module item’s block string
- Return type:
str
- Returns:
Extracts the module ID in a module’s block string
- simeon.report.utilities.process_course_structure(data, start, mapping, parent=None)¶
The course structure data dictionary and starting point, loop through it and construct course axis data items
- Parameters:
data (dict) – The data from the course_structure-analytics.json file
start (str) – The key from data to start looking up children
mapping (dict) – A dict mapping child blocks to their parents
parent (Union[None, str]) – Parent of start
- Return type:
List[Dict]
- Returns:
Returns the list of constructed data items
- simeon.report.utilities.wait_for_bq_job_ids(job_list, client)¶
Given a list of BigQuery load or query job IDs, wait for them all to finish.
- Parameters:
job_list (Iterable[str]) – An Iterable of job IDs
client (google.cloud.bigquery.client.Client) – A BigQuery Client object to do the waiting
- Return type:
Dict[str, Dict[str, str]]
- Returns:
Returns a dict of job IDs to job errors
- TODO:
Improve this function to behave a little less like a tight loop
- simeon.report.utilities.wait_for_bq_jobs(job_list)¶
Given a list of BigQuery load or query jobs, wait for them all to finish.
- Parameters:
job_list (Iterable[LoadJob]) – An Iterable of job objects from the bigquery package
- Return type:
None
- Returns:
Nothing
- TODO:
Improve this function to behave a little less like a tight loop
Components of the exceptions
package¶
Exceptions module¶
Exception classes for the simeon package
- exception simeon.exceptions.AWSException¶
Raised when an S3 resource can’t be made
- exception simeon.exceptions.BadSQLFileException(message, context_dict=None)¶
Raised when a SQL file is not in its expected format
- exception simeon.exceptions.BigQueryNameException(message, context_dict=None)¶
Raised when a fully qualified table or dataset name can’t be created.
- exception simeon.exceptions.BlobDownloadError¶
Raised when a Blob fails to download. This could be an upstream issue. However, it can also be due to local file system access issues, or due to exhausted system resources
- exception simeon.exceptions.DecryptionError(message, context_dict=None)¶
Raised when the GPG decryption process fails.
- exception simeon.exceptions.EarlyExitError(message, context_dict=None)¶
Raised when an early exit is requested by the end user of the CLI tool
- exception simeon.exceptions.LoadJobException(message, context_dict=None)¶
Raised from a BigQuery data load job
- exception simeon.exceptions.MissingFileException(message, context_dict=None)¶
Raised when a necessary file is missing
- exception simeon.exceptions.MissingQueryFileException(message, context_dict=None)¶
Raised when a report table does not have a query file in the given query directory.
- exception simeon.exceptions.MissingSchemaException(message, context_dict=None)¶
Raised when a schema could not be found for a given BigQuery table name
- exception simeon.exceptions.SQLQueryException(message, context_dict=None)¶
Raised when calling client.query raises an error
- exception simeon.exceptions.SchemaMismatchException(message, context_dict=None)¶
Raised when a record does not match its corresponding schema
- exception simeon.exceptions.SimeonError(message, context_dict=None)¶
Base exception for most simeon issues
- exception simeon.exceptions.SplitException(message, context_dict=None)¶
Raised when an issue happens during a split operation