Data Sci Adventures - part 2, something fishy

Published on 2021-02-23 21:28

Available in:

Doing courses and making notes is good. Going on a solo adventure is better. In this post I'll take my notes from Python and R courses and test what I learned so far by analyzing fish related data set.

My goal was to explore data set, so for starters that means looking at minimum, maximum, average, mean and difference values. When I get more practice there will be deeper analysis in both Python and R. For now my goal is to get comfortable using both them.

Python needs to have pandas imported before anything can begin. R has everything I needed already builtin. Importing in both languages is quite similar:

Python:

import pandas as pd

fish = pd.read_csv('Fish.csv')

R:

data <- read.csv("Fish.csv", sep=',', fileEncoding="UTF-8-BOM")

# file encoding is necessary otherwise first column will start with 'i..'

Python

After this both languages diverge a bit. So let's cover Python first. First step was to get unique names of fish:

fish["Species"].unique()

This allowed me to iterate through specific groups of fish because I could select a specific group with a help of for loop.

for specie in fish["Species"].unique():
    group = fish[fish["Species"] == specie]

With these smaller data set I am able to use data frame's builtin functions on them. They usually follow English names.

# Mean

group.mean(axis=0)

# Median

group.median(axis=0)

# Minimum

group.min(axis=0)

# Maximum

group.max(axis=0)

I couldn't find a function that would calculate difference between maximum and minimum, so the last calculation is constructed from previous two. Also because of first column is string there I had to use numeric_only parameter to filter it out.

# Difference

group.max(axis=0, numeric_only=True) - group.min(axis=0, numeric_only=True)

R

R language is still a bit harder for me because I haven't used it over six years when I took Statistics class in graduate school and we had to use it. There isn't 1:1 solution that I have made because both language work a bit differently.

R has builtin function aggregate which did a lot of heavy lifting. One thing that I found an issue was the name of the fish species which threw a bunch of errors. I fixed that by creating a helper methods, in the future I plan to explore if there's a better solution.

# helper method template

myFunction <- function (i) {

if (!is.numeric(i)) { return(NA) } return(realFunction(i)) }

In the aggregate function I had to specify three parameters

Next

In next post I'll be exploring same data set but from the visual standpoint with D3.js

Full code

import pandas as pd

fish = pd.read_csv('Fish.csv')

for specie in fish["Species"].unique():

group = fish[fish["Species"] == specie] print("Mean") print(group.mean(axis=0)) print("Median") print(group.median(axis=0)) print("Min") print(group.min(axis=0)) print("Max") print(group.max(axis=0)) print("diff") print(group.max(axis=0, numeric_only=True) - group.min(axis=0, numeric_only=True))
data <- read.csv("Fish.csv", sep=',', fileEncoding="UTF-8-BOM")

file encoding is necessary otherwise first column will start with 'i..'

print("min") myMin <- function (i) {

if (!is.numeric(i)) { return(NA) } return(min(i)) }

aggregate(data, by = list(data$Species), FUN = myMin) print("max") myMax <- function (i) {

if (!is.numeric(i)) { return(NA) } return(max(i)) }

aggregate(data, by = list(data$Species), FUN = myMax) print("mean") myMean <- function (i) {

if (!is.numeric(i)) { return(NA) } return(mean(i)) }

aggregate(data, by = list(data$Species), FUN = myMean) print("median") myMedian <- function (i) {

if (!is.numeric(i)) { return(NA) } return(median(i)) }

aggregate(data, by = list(data$Species), FUN = myMedian) print("difference") myDifference <- function(i) {

if (!is.numeric(i)) { return(NA) } return(max(i) - min(i)) }

aggregate(data, by = list(data$Species), FUN = myDifference)