Research and Discovery Group Project
- Overview
- [Milestone 1] Reading Data Science and Social Science Literature
- [Milestone 2] Exploratory Data Analysis
- [Milestone 3] Exploratory Data Analysis, Python, and independent research question
- [Presentation] Final Presentation
Overview
This summer seminar provide scholars an opportunity to explore independent projects early in their data scientist career that center around social science contexts.
Unlike a traditional open-ended research project, students will understand, explore, and reproduce existing contexts and findings of particular datasets; through reproducibility, students will build research skills and bridge interdisciplinary fields of study.
Components
There are several components to this seminar:
-
Lecture Lecture series on how to read semi-technical data science articles, consider ethical and social implications when studying a dataset, and do exploratory data analysis.
-
Group Group work series where work on projects together.
-
Project Project series-based exploration of parts 1 and 2 in teams of 3-4 Tuskegee students and 1 UC Berkeley student. Each group will focus on one social context.
Learning Goals and Workload
After this seminar, scholars will have exposure to the following:
- How to read data science articles
- How to consider ethical and social implications when studying a dataset
- How to do Exploratory Data Analysis
We expect there to be little commitment outside of the scheduled 4 hours.
[Milestone 1] Reading Data Science and Social Science Literature
Due: Wednesday, June 26, as a presentation
Project links:
- Project one-pager: One-Pager
- Group/Topic assignments: Group sheet
- Project adapted from Stanford CS197 Assignment 1 link
- Guided Reading Form link
Read a paper and outline its argument and structure.
Research gets transmitted in many forms — demos, code, talks, and more — but its formal report most often occurs through a written paper. The paper explains the problem, the approach, and an evaluation. When ready, the academic submits the paper for review by other academics; once it passes this peer review process, the paper appears at a venue such as conference or journal. Since most research is disseminated in the form of papers, it’s critical to be able to read research papers and make sense of them.
Each group has one starter research paper. Find yourself a quiet place and work your way through it. Don’t worry if you can’t understand every detail; focus on understanding the paper’s big ideas and how they are argued. It can take time to read a paper — don’t feel discouraged if it takes you a long time.
Deliverable: Google Slides and Presentation
Your Google Slide deck should outline the paper and provide additional research context. While these slides will not be presented anywhere, the Faculty Director will review all slides as part of the Project Checkoff and ask you questions.
Outline of paper (one slide per bullet)
- Overview: What paper did you read? Who are its authors, and what institution(s) are they affiliated with?
- Problem: What problem is it solving? Why does this problem matter?
- Assumption in prior work: What was the assumption that prior research made when solving this problem? Why was that assumption inadequate?
- Insight: What is the novel idea that this paper introduces, breaking from that prior assumption? Alternatively, how does this paper synthesize prior research?
- Proof: How did the paper evaluate or prove that its insight is correct, and better than holding onto the old assumption?
- Visualization: What visualizations did the paper use to illustrate its proof? Share a figure that is meaningful to the thesis of the paper.
- Impact: What are the implications of this paper? How will it chnage how we think about the problem?
Additional research context (one slide per bullet):
- Summaries of recent current event articles relevant to this work
- Bio of a prominent researcher in the field at the intersection of society and data science
Additional readings (one slide):
- Suggest and summarize three (3) additional readings that provide additional context to your research question. At least one should be a research papers; the other two can be news articles or blog posts.
Submission
Share the Google Slides with Faculty Director Lisa Yan by Wednesday Week 2.
[Milestone 2] Exploratory Data Analysis
Conduct initial Exploratory Data Analysis (EDA), settle on a few datasets, and propose a more precise research question and an associated visualization.
Project links:
- Project one-pager: One-Pager
- Group/Topic assignments: Group sheet
- D-Lab Consultation form: https://dlab.berkeley.edu/scholar-consult
Due: Wednesday, July 10, as a presentation
By the end of Week 3 (July 3): Explore and understand how to obtain your data. Document this process.
-
Create a Google Drive that is shared with Lisa and your group. Title the Google Drive with your team name.
-
Create a Google Slides slidedeck that is also stored in your Google Drive. This will be your eventual deliverable.
-
Find at least 2 datasets that you would like to analyze as part of your research. Put in your slidedeck your answers to the following EDA questions (based off of seminar slides this week), for each dataset that you pick:
- Structure: Is the dataset readily available in a rectangular shape?
- Granularity: What does each record represent? A person, a group of people? If a group, how is the group defined? Do all records capture granularity at the same level?
- Scope Does the dataset cover your area of interest? Consider both physical and temporal definitions of scope.
- Temporality: When was the data collected/last updated? What is the meaning of time/date fields, if they exist?
- Faithfulness: Are there specific missing values in your data? Spot-check with a few records: are there unrealistic/incorrect values? could data be potentially falsified or entered incorrectly (e.g., people’s names)? What data is missing that you may want to additionally collect?
-
Create 2-3 questions that you may need to plot visualizations in order to answer. List these questions in your slidedeck.
-
Data wrangling: Store any structured rectangular data you have as Google Spreadsheets/CSV files within your Google Drive. If you need to wrangle your data to be in a rectangular form, do so, and document your process as needed.
-
Schedule a consultation with D-Lab. Each individual should submit their own form; D-Lab consultants may choose to schedule consultations individually or with the entire group. Form link: https://dlab.berkeley.edu/scholar-consult
By the end of Week 4 (July 10): Prepare your slidedeck for checkoff.
- Complete the D-Lab consultation, and take notes about the meeting.
- Answer the 2-3 questions you proposed that required visualizations. Include your visualizations in your slidedeck.
- Compile the deliverable (see below).
Deliverable: Google Slides
A slidedeck is a good method to transmit your ideas clearly to other folks who may occasionally check into your work. You should compile at least 5 Google Slides
- One slide shares your methodology/how to find your dataset
- One slide shares any takeaways from your meeting with D-Lab consultants.
- One slide answers the initial (non-visualization) questions you proposed.
- The remaining 2 slides are different visualizations along with a sentence or two describing the takeaways.
Your visualizations can be created in Google Sheets or with Python (which is part of Milestone 3).
[Milestone 3] Exploratory Data Analysis, Python, and independent research question
Continue EDA, use Python to create visualizations, and make progress towards your research question.
Due: Monday July 29
Expected Work Time:
- Monday 7/15: Finish Colab setup; Reproduce 2 figures from Google Sheets.
- Thursday 7/17: Continue work. Start exploring 2 new figures or tables in Python.
- Monday 7/22: Sociology activity with Dave Harding
- Wednesday 7/24: Identify an additional research question.
- Monday 7/29: Get checked off.
Note: The datascience
library has different plotting styles from Google Sheets. When “reproducing” figure/plot, we expect that you will take considerable time getting the right tables and columns for plotting, then choosing the right arguments for datascience
library functions. Here are the function reference sheets for Data 8 and Data 6.
It is less important to reproduce the formatting of the plot– in fact doing so requires advanced plotting knowledge beyond the scope of Data 6/Data 8. UC Berkeley students can explain pandas as needed.
[Presentation] Final Presentation
Construct a 60-second elevator pitch that summarizes your research project.
Build and edit your slides from Project 1 to include EDA findings and social context discussion questions. The final presentation should be a standalone slide deck that can be shared with others.
Expected Work Time:
- Monday 7/29: Explore additional research direction.
- Wednesday 7/31: Continue exploring additional research direction.
- Outside of class: Work on research slides.
- Monday 8/5: Final Presentations.
Timing and slide limit:
- Max time per presentation: 12 minutes. Please be sure to distribute speaking parts equally among project memberes.
- Max slide count: Your presentations should be no longer than 15 slides, plus extra reference slides as needed.
- Audience Q&A: After your presentation, audience members (in-person or over Zoom) will ask questions.
Final Presentation Components:
- Introduction
- Include as the first content slide your elevator pitch. This is a two-sentence description about your project and what you did.
- Everyone is expected to remember their elevator pitch by latest Monday 8/7.
- One person should deliver the elevator pitch out loud during the Wednesday presentations, but everyone is expected to remember and be able to recite the group elevator pitch by end of class on Monday 8/7.
- Pleae include whatever else in your introduction that would provide a powerful motivation of your research problem.
- Include as the first content slide your elevator pitch. This is a two-sentence description about your project and what you did.
- Literature review: A summary of your Project 1.
- Exploratory Data Analysis: A summary of your Project 2.
- Python figures only that answer the questions originally posed to you.
- Independent research question and paired visualization.
- Clearly state what you wanted to explore, and what takeaways you drew from either your visualization or the process of creating this visualization.
- Conclusion, thoughts, and reflections: At most 1-2 slides.
- How was your experience exploring this dataset and context this summer?
- What did you like, and what did you learn?
- Reference slides (not covered, but included in the presentation)
- Required readings for students and instructors
- Anything you read that you think an instructor or researcher would find useful, but would possibly be too in-depth for a student