Project proposal due Friday February 25

Draft report due Monday, March 28

Peer review due Friday, April 1

Final report due Wednesday April 20

Repo and presentation video due Monday April 25 at 9:00am (scheduled exam time)

Comments and group evaluation due Friday April 29

Introduction

The goal of this project is to demonstrate proficiency in data science techniques by conducting a novel analysis of a dataset of your own choosing or creation. The dataset may already exist, or you may collect your own data using a survey or by scraping the web.You will also get practice presenting your results.

Brief project logistics

The final project will be done with your lab groups.

The five deliverables for the final project are

  • A project proposal describing three datasets of interest
  • A written, reproducible report using R Markdown detailing your analysis
  • A GitHub repository corresponding to your report
  • Slides + a group video presentation
  • A formal peer review of another team’s project

Late projects will not be accepted without prior approval As per the syllabus, you and your team must complete all components of the final project to pass the course.

The grade breakdown is as follows:

Total 100 pts
Project proposal 5 pts
Written report 50 pts
Slides 10 pts
Repository 5 pts
Peer feedback 10 pts
Video Presentation 20 pts

Data Sources

To perform a successful analysis it is imperative that you choose a manageable dataset that can be analyzed using the tools we have learned in STA 199. This means that the data should be readily accessible, not contain too many missing values, and be large enough so that multiple relationships can be explored. Your dataset must have at least 500 observations and at least ten variables (or has been approved by Prof. Fisher). The dataset should include a rich mix of categorical, discrete numeric, and continuous numeric data. If you have any doubts or are having trouble please reach out to me.

All analyses must be done in RStudio and your final written report and analysis must be reproducible. This means that you must create an R Markdown document attached to a GitHub repository that will create your written report exactly upon knitting.

If you are using a dataset that comes in a format that we haven’t encountered in class (for instance, a .DAT file), make sure that you are able to load it into RStudio as this can be tricky depending on the source. If you are having trouble, google first, then ask for help.

Data sets that cannot be used:

  • Data sets used as examples/homework in this class

  • Data that has been analyzed in another class

  • Kaggle

The resources below may be helpful for finding an interesting dataset but feel free to explore on your own.

Project components

Project Proposal

The first stage of the final project is the project proposal.

The proposal serves two purposes:

  • to get you started early with reading and thinking about a dataset and the questions you want to answer

  • to ensure that the data you wish to analyze, methods you plan to use, and the scope of your analyses are feasible and have maximize your chance of success

Choose three substantially different datasets you are interested in analyzing. For each, identify the components below.

Introduction and Data

  • Identify the source of the data

  • When and how it was originally collected (by the curator, not necessarily how you found the data)

  • A brief description of the observations

Research question

  • Describe the research topic along with a concise statement of the research question and hypotheses.

Data set

Use the glimpse() function to glimpse your data and include the output at the end of your proposal.

Place the file containing your data in the data folder of your project repo

Submission

Order your proposals from first choice to third choice.

Submit the PDF of your proposal to Gradescope by Friday February 25 at 11:59 PM. I will provide feedback on your proposal and help you decide which dataset you should use for your final project. Project proposals should have no more than one page of text. (With the glimpse data it is likely to go beyond a page total.)

The project proposal will be graded as follows:

Total 5 pts
Introduction/data 2 pts
Research questions 2 pts
Data glimpses 1 pts

Written report

Your final report must be written using R Markdown. All team members must contribute to the GitHub repository, with regular meaningful commits / pushes. Before you finalize your report, make sure the printing of code chunks is turned off with the option echo = FALSE.

Your final report must match your GitHub repository exactly. The mandatory components of the report are as follows, but feel free to expand with additional sections as necessary. Your final written report should not exceed ten pages inclusive of all tables and figures.

The written report is worth 80 points, broken down as follows:

Total 50 pts
Introduction/data 5 pts
Methodology 10 pts
Results 20 pts
Discussion 10 pts
Formatting 5 pts

Introduction and data

The introduction provides motivation and context for your research. Describe your topic (citing sources) and provide a concise, clear statement of your research question and hypotheses.

Then identify the source of the data, when and how it was collected, relevant context and a general description of relevant variables.

Methodology

The methodology section should include justification of the choice of statistical method(s) used to answer your research question. You can also include summary statistics or figures used in exploratory data analysis to investigate your research question.

Results

Place figure(s) here to illustrate the main results from your analysis. 1 beautiful figure is worth more than several poorly formatted figures. You should not use the default gray background ggplot theme.

Showcase how you arrived at answers to your research question using the techniques we have learned in class (and beyond, if you’re feeling adventurous).

Provide only the main results from your analysis. The goal is not to do an exhaustive data analysis (calculate every possible statistic and perform every possible procedure for all variables). Rather, you should demonstrate that you are proficient at asking meaningful questions and answering them using data, that you are skilled in interpreting and presenting results, and that you can accomplish these tasks using R. More is not better.

Discussion

This section is a conclusion and discussion. This will require a summary of what you have learned about your research question along with statistical arguments supporting your conclusions. You should critique your own methods and provide suggestions for improving your analysis and future work. Issues pertaining to the reliability and validity of your data and the appropriateness of the statistical analyses should also be discussed. Also include a brief paragraph on what you would do differently if you were able to start over with the project.

Formatting

Your project should be professionally formatted. For example, this means labeling graphs and figures, turning off code chunks, and using tidyverse style.

Repository

In addition to your Gradescope submissions, we will be checking your GitHub repository. The repos should be formatted in the same manner as labs with a data folder. This repository should have equal contribution by all team members and should include

  • RMarkdown file (formatted to clearly present all of your code and results) that will output the final report in one document.
  • Meaningful README file on the GitHub repository that contains a codebook for relevant variables
  • Dataset(s) (in csv or RData format, in a /data folder)

Style and format does count for this assignment, so please take the time to make sure everything looks good and your data and code are properly formatted.

Peer feedback

Critically reviewing others’ work is a crucial part of the scientific process, and STA 199 is no exception. You will be assigned a team to review. As part of the review process, you must provide your partner team a copy of your current report by Monday, March 28. After giving the report to your partner team, they will have until Friday, April 1 to provide a detailed critique about the written report and data analysis. This feedback is intended to help you create a high quality final project, as well as give you experience reading and constructively critiquing the work of others.

During the peer feedback process, you will be provided read-only access to your partner team’s GitHub repo. Provide your feedback in the form of GitHub issues to your partner team’s GitHub repo.

Peer feedback will be graded on the extent to which it comprehensively and constructively addresses the components of the partner team’s report: the research context and motivation, exploratory data analysis, and any inference, modeling, or conclusions. As you work on the draft, the focus should be on the analysis and less on crafting the final report. Your draft must include a reasonable attempt at each analysis component - exploratory data analysis, inference or modeling, and drawing initial conclusions.

Lab on November Tuesday March 29 will be devoted to allowing your group to write your peer review.

Presentation Slides

For your presentation, you and your team must also create presentation slides that summarize and showcase your project. Introduce your research question and dataset, showcase visualizations, and provide some conclusions. These slides should serve as a brief visual accompaniment to your write-up and will be graded for content and quality.

The slide deck should have no more than 6 content slides + 1 title slide. Here is a suggested outline as you think through the slides; you do not have to use this exact format for the slide deck.

  • Title Slide

  • Slide 1: Introduce the topic and motivation

  • Slide 2: Introduce the data

  • Slide 3 - 4: Highlights from exploratory data analysis

  • Slide 4 - 5: Inference / modeling

  • Slide 6: Conclusions + future work

  • Video presentation

  • Sometime by [Exam Time], you/your group will upload a video presentation of your project to Warpwire. Note that all members must present, and that a ten-minute time limit is strictly enforced.

For the presentation, you can speak over your slide deck, similar to the lecture content videos. I recommend using Zoom to record your presentation; however, you can use whatever platform works best for your group. Below are a few resources to help you record video presentations:

You will post the presentation video in Warpwire, which is accessible from the the course Sakai site (bottom of the left-hand tool bar).

To upload your video to Warpwire:

  • Click the Warpwire tab in the course Sakai site.

  • Click the “+” and select “Upload files”.

  • Locate the video on your computer and click to upload.

  • Once you’ve uploaded the video to Warpwire, click to share the video and make a copy of the video’s URL. You will need this when you post the video in the discussion forum.

To post the video to the discussion forum:

  • Click the Presentations tab in the course Sakai site.

  • Click the Presentations topic.

  • Click “Start New Presentation”.

  • Make the title “Your Team Name: Project Title”. For example, “Teaching Team: Analysis of Cars in the US”.

  • Click the Warpwire icon (between the flag and shopping cart icons).

  • Select your video, then click “Insert 1 item.” This will embed your video in the conversation.

  • Under the video, paste the URL to your video.

  • You’re done!

You can see the Teaching team example in Sakai.

Presentation comments

Each student will be assigned 2 presentations to watch.

Watch the group’s video, then click “Reply” to post a question for the group. You may not post a question that’s already been asked on the discussion thread. Additionally, the question should be (i) substantive (i.e. it shouldn’t be “Why did you use a bar plot instead of a pie chart”?), (ii) demonstrate your understanding of the content from the course, and (iii) relevant to that group’s specific presentation, i.e demonstrating that you’ve watched the presentation.

You may start posting questions and comments on Tuesday April 25. All questions must be posted by Friday April 28.

This portion of the project will be assessed individually

Tips

  • Ask questions if any of the expectations are unclear.

  • Code: In your write up your code should be hidden (echo = FALSE) so that your document is neat and easy to read. However your document should include all your code such that if I re-knit your Rmd file I should be able to obtain the results you presented. Exception: If you want to highlight something specific about a piece of code, you’re welcome to show that portion.

  • Merge conflicts will happen, issues will arise, and that’s fine! Commit and push often, and ask questions when stuck.

  • The project is very open ended. For instance, in creating compelling visualization of your data in R, there is no limit on what tools or packages you may use. You do not need to visualize all of the data at once. A single high quality visualization will receive a much higher grade than a large number of poor quality visualizations.

  • Make sure each team member is contributing, both in terms of quality and quantity of contribution (we will be reviewing commits from different team members).

  • All team members are expected to contribute equally to the completion of this assignment and group assessments will be given at its completion - anyone judged to not have sufficient contributed to the final product will have their grade penalized.

While different teams members may have different backgrounds and abilities, it is the responsibility of every team member to understand how and why all code and approaches in the assignment works.

  • Before you finalize your write up, make sure the printing of code chunks is turned off with the option echo = FALSE.

  • Pay attention to details in your write-up. Neatness, coherency, and clarity will count.

  • Write all R code according to the style guidelines discussed in class.

Grading

Grading of the project will take into account the following:

  • Content - What is the quality of research and/or policy question and relevancy of data to those questions?
  • Correctness - Are statistical procedures carried out and explained correctly?
  • Writing - What is the quality of the statistical presentation, writing, and explanations?
  • Creativity and Critical Thought - Is the project carefully thought out? Are the limitations carefully considered? Does it appear that time and effort went into the planning and implementation of the project?

A general breakdown of scoring is as follows:

  • 90%-100%: Outstanding effort. Student understands how to apply all statistical concepts, can put the results into a cogent argument, can identify weaknesses in the argument, and can clearly communicate the results to others.
  • 80%-89%: Good effort. Student understands most of the concepts, puts together an adequate argument, identifies some weaknesses of their argument, and communicates most results clearly to others.
  • 70%-79%: Passing effort. Student has misunderstanding of concepts in several areas, has some trouble putting results together in a cogent argument, and communication of results is sometimes unclear.
  • 60%-69%: Struggling effort. Student is making some effort, but has misunderstanding of many concepts and is unable to put together a cogent argument. Communication of results is unclear.
  • Below 60%: Student is not making a sufficient effort.

Peer evaluation

You will be asked to fill out a survey where you rate the contribution and teamwork of each team member by assigning a contribution percentage for each team member with justification. This will contribute to your final project grade. Final survey evaluations will be due ****Friday April 28**.

Late work policy

There is no late work accepted on this project Be sure to turn in your work early to avoid any technological mishaps.