STAT 19000: Project 8 — Spring 2021
Motivation: A key component to writing efficient code is writing functions. Functions allow us to repeat and reuse coding steps that we used previously, over and over again. If you find you are repeating code over and over, a function may be a good way to reduce lots of lines of code. There are some pretty powerful features of functions that we have yet to explore that aren’t necessarily present in R. In this project we will continue to learn about and harness the power of functions to solve data-driven problems.
Context: We are taking a small hiatus from our pandas
and numpy
focused series to learn about and write our own functions in Python!
Scope: python, functions, pandas
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/yelp/data/parquet
Questions
Question 1
The company you work for is assigning you to the task of building out some functions for the new API they’ve built. Please load these two pandas
DataFrames:
users = pd.read_parquet("/class/datamine/data/yelp/data/parquet/users.parquet")
reviews = pd.read_parquet("/class/datamine/data/yelp/data/parquet/reviews.parquet")
You do not need these four DataFrames in this project.
photos = pd.read_parquet("/class/datamine/data/yelp/data/parquet/photos.parquet")
businesses = pd.read_parquet("/class/datamine/data/yelp/data/parquet/businesses.parquet")
checkins = pd.read_parquet("/class/datamine/data/yelp/data/parquet/checkins.parquet")
tips = pd.read_parquet("/class/datamine/data/yelp/data/parquet/tips.parquet")
You would expect that friends may have a similar taste in restaurants or businesses. Write a function called get_friends_data
that accepts a user_id
as an argument, and returns a pandas
DataFrame with the information in the users
DataFrame for each friend of user_id
. Look at the solutions from the previous project, as well as this page. Add type hints for your function. You should have a type hint for our argument, user_id
, as well as a type hint for the returned data. In addition to type hints, make sure to document your function with a docstring.
Every function in the solutions for last week’s projects has a docstring. You can use this as a reference. |
You should get the same number of friends for the following code: |
print(get_friends_data("ntlvfPzc8eglqvk92iDIAw").shape) # (13,22)
print(get_friends_data("AY-laIws3S7YXNl_f_D6rQ").shape) # (1, 22)
print(get_friends_data("xvu8G900tezTzbbfqmTKvA").shape) # (193,22)
It is sufficient to just load the first of these three examples, when you Knit your project (to save time during Knitting). |
-
Python code used to solve the problem.
-
Output from running your code.
Question 2
Write a function called calculate_avg_business_stars
that accepts a business_id
and returns the average number of stars that business received in reviews. Like in question (1) make sure to add type hints and docstrings. In addition, add comments when and if they are necessary.
There is a really cool method that gives us the same "powers" that tapply
gives us in R. Use the groupby
method from pandas
to calculate the average stars for all businesses. Index the result to confirm that your calculate_avg_business_stars
function worked properly.
You should get the same average number of start value for the following code:
|
-
Python code used to solve the problem.
-
Output from running your code.
Question 3
Write a function called visualize_stars_over_time
that accepts a business_id
and returns a line plot that shows the average number of stars for each year the business has review data. Like in previous questions, make sure to add type hints and docstrings. In addition, add comments when (and if) necessary. You can test your function with some of these:
visualize_stars_over_time('RESDUcs7fIiihp38-d6_6g')
-
Python code used to solve the problem.
-
Output from running your code.
Question 4
Modify question (3), and add an argument called granularity
that dictates whether the plot will show the average rating over years, or months. granularity
should accept one of two strings: "years", or "months". By default, if granularity
isn’t specified, it should be "years".
visualize_stars_over_time('RESDUcs7fIiihp38-d6_6g', "months")
-
Python code used to solve the problem.
-
Output from running your code.
Question 5
Modify question (4) to accept multiple business_id’s, and create a line for each id. Each of the following should work:
visualize_stars_over_time("RESDUcs7fIiihp38-d6_6g", "4JNXUYY8wbaaDmk3BPzlWw", "months")
visualize_stars_over_time("RESDUcs7fIiihp38-d6_6g", "4JNXUYY8wbaaDmk3BPzlWw", "K7lWdNUhCbcnEvI0NhGewg", "months")
visualize_stars_over_time("RESDUcs7fIiihp38-d6_6g", "4JNXUYY8wbaaDmk3BPzlWw", "K7lWdNUhCbcnEvI0NhGewg", granularity="years")
Use |
It is sufficient to just load the first of these three examples, when you Knit your project (to save time during Knitting). |
-
Python code used to solve the problem.
-
Output from running your code.
Question 6
After some thought, your boss decided that using the function from question (5) would get pretty tedious when there are a lot of businesses to include in the plot. You disagree. You think there is a way to pass a list of `business_id`s without modifying your function, but rather how you pass the arguments to the function. Demonstrate how to do this with the list provided:
our_businesses = ["RESDUcs7fIiihp38-d6_6g", "4JNXUYY8wbaaDmk3BPzlWw", "K7lWdNUhCbcnEvI0NhGewg"]
# modify something below to make this work:
visualize_stars_over_time(our_businesses, granularity="years")
Google "python packing unpacking arguments". |
-
Python code used to solve the problem.
-
Output from running your code.