# Data Sci Adventures - part 2, something fishy

Published on 2021-02-23 21:28

Doing courses and making notes is good. Going on a solo adventure is better. In this post I'll take my notes from Python and R courses and test what I learned so far by analyzing fish related data set.

My goal was to explore data set, so for starters that means looking at minimum, maximum, average, mean and difference values. When I get more practice there will be deeper analysis in both Python and R. For now my goal is to get comfortable using both them.

Python needs to have pandas imported before anything can begin. R has everything I needed already builtin. Importing in both languages is quite similar:

Python:

```
import pandas as pd
fish = pd.read_csv('Fish.csv')
```

R:

```
data <- read.csv("Fish.csv", sep=',', fileEncoding="UTF-8-BOM")
# file encoding is necessary otherwise first column will start with 'i..'
```

## Python

After this both languages diverge a bit. So let's cover Python first. First step was to get unique names of fish:

```
fish["Species"].unique()
```

This allowed me to iterate through specific groups of fish because I could select a specific group with a help of for loop.

```
for specie in fish["Species"].unique():
group = fish[fish["Species"] == specie]
```

With these smaller data set I am able to use data frame's builtin functions on them. They usually follow English names.

```
# Mean
group.mean(axis=0)
# Median
group.median(axis=0)
# Minimum
group.min(axis=0)
# Maximum
group.max(axis=0)
```

I couldn't find a function that would calculate difference between maximum and minimum, so the last calculation is constructed from previous two. Also because of first column is string there I had to use numeric_only parameter to filter it out.

```
# Difference
group.max(axis=0, numeric_only=True) - group.min(axis=0, numeric_only=True)
```

## R

R language is still a bit harder for me because I haven't used it over six years when I took Statistics class in graduate school and we had to use it. There isn't 1:1 solution that I have made because both language work a bit differently.

R has builtin function aggregate which did a lot of heavy lifting. One thing that I found an issue was the name of the fish species which threw a bunch of errors. I fixed that by creating a helper methods, in the future I plan to explore if there's a better solution.

```
# helper method template
myFunction <- function (i) {
if (!is.numeric(i)) {
return(NA)
}
return(realFunction(i))
}
```

In the aggregate function I had to specify three parameters

- data which contains all of my fish data
- by where I did put first column as a list list(data$Species)
- data$Species and data['Species'] are equivalent

- FUN where I set my actual function without brackets

## Next

In next post I'll be exploring same data set but from the visual standpoint with D3.js

## Full code

```
import pandas as pd
fish = pd.read_csv('Fish.csv')
for specie in fish["Species"].unique():
group = fish[fish["Species"] == specie]
print("Mean")
print(group.mean(axis=0))
print("Median")
print(group.median(axis=0))
print("Min")
print(group.min(axis=0))
print("Max")
print(group.max(axis=0))
print("diff")
print(group.max(axis=0, numeric_only=True) - group.min(axis=0, numeric_only=True))
```

```
data <- read.csv("Fish.csv", sep=',', fileEncoding="UTF-8-BOM")
# file encoding is necessary otherwise first column will start with 'i..'
print("min")
myMin <- function (i) {
if (!is.numeric(i)) {
return(NA)
}
return(min(i))
}
aggregate(data, by = list(data$Species), FUN = myMin)
print("max")
myMax <- function (i) {
if (!is.numeric(i)) {
return(NA)
}
return(max(i))
}
aggregate(data, by = list(data$Species), FUN = myMax)
print("mean")
myMean <- function (i) {
if (!is.numeric(i)) {
return(NA)
}
return(mean(i))
}
aggregate(data, by = list(data$Species), FUN = myMean)
print("median")
myMedian <- function (i) {
if (!is.numeric(i)) {
return(NA)
}
return(median(i))
}
aggregate(data, by = list(data$Species), FUN = myMedian)
print("difference")
myDifference <- function(i) {
if (!is.numeric(i)) {
return(NA)
}
return(max(i) - min(i))
}
aggregate(data, by = list(data$Species), FUN = myDifference)
```