2020 ARTG Summer Projects

2020 ARTG Summer Projects

We will be offering 2 project opportunities to students in the 2020 summer period.

Please see the details of the two projects below.


Summer Project II: Data Engineering Skill-building Project

Project sponsor: John Moeller, Director | jmoeller@email.arizona.edu | 520-621-4545

 

Background

The MIS Academic and Research Technologies Group (ARTG) seeks to give guidance and structure on a self-study project over the summer. The design of this project comes from student skill-building surveys as well as input from MSMIS alumni. This project will be owned by an ARTG project team (ARTG-PT)

 

Project Outline Goals and Opportunities

Students will develop an end-to-end data engineering project using modern public and free tools. They'll be required to implement the 7 steps of the pipeline:

  1. identify dataset(s),
  2. ingest dataset(s)
  3. cleanse data,
  4. load data into a database technology,
  5. retrieve data from the database,
  6. analyze data, and
  7. visualize data in an interesting way.

 

We will define a technology stack for use: Jupyter notebooks, Python and git. Students will also select a public git remote repository host (ie, GitHub) and, optionally, a live notebook hosting provider; Kaggle Kernels is encouraged.

 

After receiving feedback on their initial proposals, students will complete the initial iteration of their end-to-end data pipeline notebook. We will assess the notebooks and work with students to iterate on their initial effort, adding complexity and opportunities for self-learning. Notebooks combine both code and presentation, so it is important that both are of high quality.

 

Project Timeline (subject to change)

May 26th - Project start; instructions for initial proposals given to students.

June 1st - Project proposals due. Proposals should be 1-2 pages and identify a dataset, their planned analysis and visualizations, and acknowledge the other steps of the data engineering pipeline (ingesting, cleansing, loading). Proposals reviewed by ARTG-PT this week.

June 8th - Students set up their accounts and environments. First iteration begins.

June 15th and beyond - Iterations continue.

July 27th - Project closed; final notebooks shared.

 

Proposed Workflow

Work should take place in an iterative fashion. Lengths of iteration will vary from individual to individual. Once a student completes the initial pipeline, they can iterate on their original design, adding complexity to build skills and experience where they should or wish to.

 

Each iteration is likely to have the following steps:

  1. Propose. In this step, the student proposes their work for this iteration. The proposals should be clear, complete and to-the-point.
  2. Evaluate. The ARTG-PT evaluates the proposal and gives guidance on what was proposed.
  3. Execute. The student performs and completes the work they proposed, and related work necessary to complete their iteration. Students can contact peers and the ARTG-PT for assistance if they become blocked in their execution phase.
  4. Review. The student submits their work for review by having the ARTG-PT review their hosted notebook. We assess their work, and give them ideas for their next iteration(s).

 

Each student interaction with the project team is an opportunity for them to practice effective business communication. The ARTG-PT will provide feedback to students about the quality of their preparedness and presentation skills.

 

Deliverables and Outcomes

The hosted Jupyter notebook(s) and other elements are intended to serve as a part of the student's professional portfolio, a valuable part of their personal brand. Ideally, these notebooks would exist on a public provider and be referenced in the student's professional materials.

 

In their final notebooks, students should display not only an understanding of the end-to-end data engineering tasks they worked to understand, but should also have compelling presentation and documentation of what they learned or discovered.

 

By the end of the project, it's likely that the student has the confidence to pivot to a more complex data science or machine learning project, using elements from this technology stack. Students could use the collaborative tools they learned to team up and participate in a competition, as a further example.

 

 


Summer Project III: Oracle database administration script improvement project

Project sponsor: John Moeller, Director | jmoeller@email.arizona.edu | 520-621-4545

 

Background

The MIS Academic and Research Technologies Group (ARTG) supports database systems used for instruction in the MIS Department. We use Oracle RDBMS servers for courses at the graduate level, and make an additional Oracle database server available for students in the program. The Oracle instances will be upgraded over the summer to version 19c. We have scripts to manage administrative needs on these systems, for user creation, account migration and database backup. These scripts are in a mix of Python and Bash shell, with some SQL and RMAN. These scripts are of varying quality and age, having been written over a long period of time with many contributors. They're missing key elements like documentation and unit testing, and are not tracked with version control.

 

I am seeking committed volunteers who would like to contribute to a real-world systems programming project to improve these scripts. In addition to programming roles available, we also have a need for contributors interested in project management and QA/QC to lead those efforts.

 

Project Goals and Opportunities

With the summer implementation of the new Oracle 19c systems, the time is right to improve our Oracle database administration scripts. I am confident that we can develop improved scripts in all functionality areas, and that the scripts will be more manageable and better understood.

 

In addition to the benefits that our students and my group receives from improved scripts, this project may have benefits for interested contributors. This is an opportunity to learn Python, as our intention is to refactor as much of the code as possible into Python 3; some SQL and RMAN will also be used. I plan to use a collaborative development and version control system, probably GitHub, and use the project management tools available within to manage the project. I will require that quality-centered coding practices be followed, including Feature Branch Workflow, test-driven development with unit testing, and peer code reviews. I'd like us to use a common IDE, PyCharm Community. Finally, I'd like us to embrace the virtual team nature of the toolset as much as is helpful; we can certainly meet whenever  necessary, but many of these tools foster asynchronous, non-interactive work.

 

Note: Work produced in this project will be private and property of the University of Arizona. While you will be able to describe what you worked on in a general sense, you won't be able to share details like code, snippets, documentation, notes or screenshots.

 

Project Timeline (subject to change)

June 8th - Project kick-off: Organize teams and roles, develop ground-rules, discuss style and set up environments.

June 15th - Asset inventory: Gather legacy scripts, build per-function project boards and fill out tasks and issues. Assign initial tasks to contributors.

June 22nd - Sprint 1

July 6th - Extended retrospective and assessment. Gather input and refine process.

July 13th - Sprint 2

July 27th - Sprint 3

August 10th - Finish project

 

Proposed Workflow

We will plan on an iterative, sprint-based process. I could see sprint periods lasting 2 weeks and having the following elements:

  1. Task (backlog) inventory, development, prioritization organization and assignment (1 day)
  2. Per-task workflow: (1-7 days; sequential; multiple tasks per contributor per sprint)
    1. New feature branch created for a task
    2. Create function/object skeletons and docstrings for functions and entry points
    3. Unit tests developed per function/objection
    4. Function/object code developed, docstrings filled out, code documented where it should be
    5. Code tested. If necessary, develop additional testing code and function/object code
    6. Once code is tested fully, code is submitted as a pull request to the master kicking off our QA/QC process:
      1. Peer code review takes place. Special attention should be paid to unit test code completeness, test code results, and docstring/comment correctness.
      2. Additional unit testing code may be developed, or changes to the function/object code or documentation may take place. This is essentially repeating the process from steps C and/or D.
    7. Once the QA/QC participants are satisfied, the feature branch will be merged into the master via the pull request.
  3. Work-in-Progress review (1-2 days). Toward the end of our sprint, we should assess if there are in-progress tasks that need attention to complete; we may re-organize resources to push to finish them.
  4. Integration and retrospective (1 day). Bring final work together and discuss how our sprint went, with an eye towards addressing challenges and planning our next sprint.

 

In any sprint phase, new work may be identified and added to the task list, or existing work items could be broken down into smaller tasks. Similarly, we may find opportunities for code reuse across functional areas and that optimization and integration could be noted as future tasks.

 

Prior to beginning of sprint phases might be the ideal time for contributors to join or exit the project if they become available or unavailable to participate for whatever reason.

 

Final Deliverables

The final deliverable will include a code base of high-quality scripts to handle our Oracle database administration tasks. The code will be highly tested, modular and well-documented, and we will have history and metadata in our private repositories and project management boards.

 

When the project is complete, ARTG staff will work to integrate the new code base into existing ARTG git remote repository sources and onto the new servers for future use.