Eric Leung Code and Data Learnings     about     blog     projects     misc     feed

Everything I googled in a week as a professional data scientist

I ran across this blog post from a software engineer who decided to document what they googled in a week of work.

Their goal was to dispel the idea that “if you have to google stuff you’re not a software engineer.” I wanted to do something similar, but from the perspective of a data scientist.

Disclaimer: although “data science” is such a broad field and my account won’t be representative of all data workers out there, I thought it would be data point for us to have to understand what could go on in our day-to-day. This week apparently was full of package development with {pkgdown}, plotting results with {ggplot2} and making small aesthetic changes, and making a table with {gt}.

Monday

pkgdown Failed to parse example for topic - Turned out some code in an function example was invalid

git ammend specific commit message - Wanted to be more clear with a commit message

pkgdown Topics missing from index - A function was missing from my references page so I just put it back in the _pkgdown.yml file and all was good

roxygen2 documentation - Needed an overall page on roxygen2 syntax

Tuesday

gt add table header - Found the official website and just took a look at the introduction page

gt change header color - Wanted to change the color, found tab_options() and found the parameter column_labels.background.color to change the color

forcats relevel factors - To have more control on how a plot is created, I need extra control on my factors

forcats relevel by other variable - Self-explanatory, this page was useful

r get just file name of file path - Stack Overflow to the rescue with basename() and also dirname() here

gt left align columns - Eventually got me to find the cols_align() function

ggplot2 change order of legend - Need to change order of the factor levels with help here

ggplot2 change order of stacked bar - Again, factor reorder

r scales change axis to thousands - This question was good enough because it led me to a comment about unit_format(), which brings me to the next search…

r scales unit_format - Which brings me to the official documentation page and what I needed was the unit and scale parameters

r ggplot2 add numbers to bar plot - Needed geom_text() and passing in a label aesthetic

ggplot2 add two labels to bar plot - I ended up back at my previous search, but figured because of the power of ggplot2, I can simply have two geom_text() calls with two different aesthetic mappings, one to each kind of label I wanted and adjust them accordingly to fix the plot

Wednesday

ggplot2 stacked bar - This site helped

ggplot2 legend on top - Possible with + theme(legend.position = "top")

ggplot2 empty space - I wanted to make an empty space between certain bars in my bar plot, but I figured it might easier to make an empty space instead. So…

forcats add factor - Just the documentation page

ggplot2 format x-axis labels - A solid general resource

ggplot2 change ordering of legend - I found this site , but the answer seems outdated because it doesn’t work

ggplot2 change labels with one function - I kind of didn’t search for this one exactly, rather, I used my Twitter to find the answer that uses the labs() function

ggplot2 color code geom_text - You can simply pass in a color aesthetic and manually color it

ggplot2 change number of rows in legend - I can use guides(colour = guide_legend(nrow = 1)

gghighlight - Didn’t end up using it, but still a useful package to know about

ggplot2 format y-axis - The {scales} package is absolutely wonderful, but I keep on forgetting which function to use

ggplot2 geom_col side by side bars - I always forget the position = "dodge"

ggplot2 match geom text with dodged bars - With position_dodge() within geom_text()

Thursday

ggplot2 bar width - Looks like a simple width = X in your geom_bar()

ggplot2 scales label_number - Good documentation is the best

ggplot2 change text size - Such a common thing I’d imagine this would be easier. I was in a time crunch so maybe there’s a better way for another time

?geom_vline - I remembered this is to generate a vertical line, but I have forgotten the parameters, so I ran this one right in RStudio

ggplot2 add textbox - Ah with the annotate() function

Friday

ggplot2 better spacing of geom_text stacked bar plot - This brought me to learn about the lineheight paremter, but ultimately, I wanted the text not to overlap, and after looking at the documentation, geom_text has a built-in parameter check_overlap for just this.

ggrepl for stacked bar plot - …But after using the solution above, I realized that check_overlap actually removes text that overlaps, which I didn’t want. I then found this post using ggrepel. I knew about this package but wasn’t sure if it was useful for stacked bar plots. The example here kind of works, except it changes the location of text I don’t want moving, like in the larger bars. I abandoned this and simply removed “bars” with zero values.

ggplot2 show all factors in legend - Added a drop = False in there, found here

ggplot2 stacked bar plot position dodge with change in x - I was frustrated with where the text annotation for my columns were. This solution here didn’t exactly solve it outright for me, but it did show me what’s possible to move around the column label. The parameter I was looking forward was simply the x and y aesthetics, which allow me to fine tune where my text labels are. In hindsight, this makes sense.

Guess I was wanting to be a bit more verbose on my thoughts on these challenges. At this point, I was doing some very custom changes to my plots.

Reflection

A similar conclusion to the software engineering post I linked at the beginning, being a data scientist will still need to search and look things up. Regularly.

I’ve never really thought too much about what I’ve had to search for during my job. This turned out to be a really fun exercise in mindfulness. Ideally, I would keep track of these kinds of searches and then find ways to write helper functions to do these things for me. Alas, a low priority for now. But a possible side project idea.

Altogether, thank you Stack Overflow solutions, the whole ggplot2 system, and the countless volunteers out there writing out their solutions on the web for making my work possible.