2  Using Python

Published

2022-11-07

For this chapter, I will use Python via R’s reticulate package. All code blocks that do not use Python will have a comment at the top, just like this one which uses R:

# r
library("reticulate")
use_virtualenv("renv/python/virtualenvs/renv-python-3.10")
from pins import board_folder, board_urls
from pyarrow import feather
import pandas as pd

2.1 Folder board

board_here = board_folder("pins")
board_here.pin_list()
['pins.txt', 'time-csv', 'time-json', 'penguins-csv', 'penguins-json', '_pins.yaml', 'time-arrow', 'penguins-arrow']

Should this be listing a test file?

2.1.1 Read

penguins_csv = board_here.pin_read("penguins-csv")
penguins_csv
       species     island  bill_length_mm  ...  body_mass_g     sex  year
0       Adelie  Torgersen            39.1  ...       3750.0    male  2007
1       Adelie  Torgersen            39.5  ...       3800.0  female  2007
2       Adelie  Torgersen            40.3  ...       3250.0  female  2007
3       Adelie  Torgersen             NaN  ...          NaN     NaN  2007
4       Adelie  Torgersen            36.7  ...       3450.0  female  2007
..         ...        ...             ...  ...          ...     ...   ...
339  Chinstrap      Dream            55.8  ...       4000.0    male  2009
340  Chinstrap      Dream            43.5  ...       3400.0  female  2009
341  Chinstrap      Dream            49.6  ...       3775.0    male  2009
342  Chinstrap      Dream            50.8  ...       4100.0    male  2009
343  Chinstrap      Dream            50.2  ...       3775.0  female  2009

[344 rows x 8 columns]

2.1.2 Timeseries

time_csv = board_here.pin_read("time-csv")
time_csv
          date             datetime  value
0   2010-01-01  2020-09-01 00:00:00      0
1   2010-01-02  2020-09-01 00:00:01      1
2   2010-01-03  2020-09-01 00:00:02      2
3   2010-01-04  2020-09-01 00:00:03      3
4   2010-01-05  2020-09-01 00:00:04      4
5   2010-01-06  2020-09-01 00:00:05      5
6   2010-01-07  2020-09-01 00:00:06      6
7   2010-01-08  2020-09-01 00:00:07      7
8   2010-01-09  2020-09-01 00:00:08      8
9   2010-01-10  2020-09-01 00:00:09      9
10  2010-01-11  2020-09-01 00:00:10     10
time_csv.dtypes
date        object
datetime    object
value        int64
dtype: object

We are not parsing dates or datetimes. We don’t have the timezone.

There’s no driver yet for JSON, so let’s try Arrow:

# this seems hacky, but I'm sure it will get sorted out
time_arrow = pd.read_feather(board_here.pin_read("time-arrow")[0])
time_arrow
          date                  datetime  value
0   2010-01-01 2020-09-01 00:00:00-06:00      0
1   2010-01-02 2020-09-01 00:00:01-06:00      1
2   2010-01-03 2020-09-01 00:00:02-06:00      2
3   2010-01-04 2020-09-01 00:00:03-06:00      3
4   2010-01-05 2020-09-01 00:00:04-06:00      4
5   2010-01-06 2020-09-01 00:00:05-06:00      5
6   2010-01-07 2020-09-01 00:00:06-06:00      6
7   2010-01-08 2020-09-01 00:00:07-06:00      7
8   2010-01-09 2020-09-01 00:00:08-06:00      8
9   2010-01-10 2020-09-01 00:00:09-06:00      9
10  2010-01-11 2020-09-01 00:00:10-06:00     10
time_arrow.dtypes
date                                object
datetime    datetime64[ns, America/Denver]
value                                int32
dtype: object
time_arrow['date'].values
array([datetime.date(2010, 1, 1), datetime.date(2010, 1, 2),
       datetime.date(2010, 1, 3), datetime.date(2010, 1, 4),
       datetime.date(2010, 1, 5), datetime.date(2010, 1, 6),
       datetime.date(2010, 1, 7), datetime.date(2010, 1, 8),
       datetime.date(2010, 1, 9), datetime.date(2010, 1, 10),
       datetime.date(2010, 1, 11)], dtype=object)

2.2 Remote board

I don’t think I will create a Python package; maybe I can just put together a quick script to convert a pins manafest into a dictionary.

import requests
import yaml

# read file, parse into manifest
url_root = "https://ijlyttle.github.io/pins-three-ways/pins"
req = requests.get(f"{url_root}/_pins.yaml")
manifest = yaml.safe_load(req.text)
# use the most-recent version
pin_paths = {}
for key, value in manifest.items():
    pin_paths[key] = max(value)
 
pin_paths   
{'penguins-arrow': 'penguins-arrow/20220811T170224Z-ef034/', 'penguins-csv': 'penguins-csv/20220811T170157Z-809e9/', 'penguins-json': 'penguins-json/20220811T170152Z-fa33e/', 'time-arrow': 'time-arrow/20220811T224202Z-b1900/', 'time-csv': 'time-csv/20220811T224202Z-06d53/', 'time-json': 'time-json/20220811T224202Z-70d59/'}

The Python version has a board constructor board_urls():

board_remote = board_urls(url_root, pin_paths)
board_remote.pin_list()
['penguins-arrow', 'penguins-csv', 'penguins-json', 'time-arrow', 'time-csv', 'time-json']

2.2.1 Read

We can read the CSV pin:

board_remote.pin_read("penguins-csv")
       species     island  bill_length_mm  ...  body_mass_g     sex  year
0       Adelie  Torgersen            39.1  ...       3750.0    male  2007
1       Adelie  Torgersen            39.5  ...       3800.0  female  2007
2       Adelie  Torgersen            40.3  ...       3250.0  female  2007
3       Adelie  Torgersen             NaN  ...          NaN     NaN  2007
4       Adelie  Torgersen            36.7  ...       3450.0  female  2007
..         ...        ...             ...  ...          ...     ...   ...
339  Chinstrap      Dream            55.8  ...       4000.0    male  2009
340  Chinstrap      Dream            43.5  ...       3400.0  female  2009
341  Chinstrap      Dream            49.6  ...       3775.0    male  2009
342  Chinstrap      Dream            50.8  ...       4100.0    male  2009
343  Chinstrap      Dream            50.2  ...       3775.0  female  2009

[344 rows x 8 columns]
penguins_arrow = pd.read_feather(board_here.pin_read("penguins-arrow")[0])
penguins_arrow
       species     island  bill_length_mm  ...  body_mass_g     sex  year
0       Adelie  Torgersen            39.1  ...       3750.0    male  2007
1       Adelie  Torgersen            39.5  ...       3800.0  female  2007
2       Adelie  Torgersen            40.3  ...       3250.0  female  2007
3       Adelie  Torgersen             NaN  ...          NaN     NaN  2007
4       Adelie  Torgersen            36.7  ...       3450.0  female  2007
..         ...        ...             ...  ...          ...     ...   ...
339  Chinstrap      Dream            55.8  ...       4000.0    male  2009
340  Chinstrap      Dream            43.5  ...       3400.0  female  2009
341  Chinstrap      Dream            49.6  ...       3775.0    male  2009
342  Chinstrap      Dream            50.8  ...       4100.0    male  2009
343  Chinstrap      Dream            50.2  ...       3775.0  female  2009

[344 rows x 8 columns]
penguins_arrow.dtypes
species              category
island               category
bill_length_mm        float64
bill_depth_mm         float64
flipper_length_mm     float64
body_mass_g           float64
sex                  category
year                    int32
dtype: object
penguins_arrow['species'].values
['Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', ..., 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap']
Length: 344
Categories (3, object): ['Adelie', 'Chinstrap', 'Gentoo']