Notes on HDF5 Computation in Python (Updating)

Introduction

This post is my personal notes on the application of HDF5 in Python.

As we all know, an HDF5 file contains two kinds of objects: groups and datasets. Groups are folder-like containers which hold all the datasets. Every HDF5 file is a root group and its name is ‘/'. Datasets are arry-like containers of data.

As for datasets, every dataset could be split into two parts: raw data values and metadata.

  • Raw data are just arrays.
  • Metadata is more interesting. Metadata is a set of data that describes and gives information about raw data. Metadata contains:
    • Dataspace: gives raw data’s rank and dimensions
    • Datatype: gives raw data’s datatype such as integers etc.
    • Properties: describes if the dataset is chunked or compressed
      • Chunked: chunked dataset has better access time for subsets and it’s extendible
      • Compressed: compressed dataset improves storage efficiency and transmission speed
    • Attributes: provides other self-defined attributes of the datasets
Basic Computation

Following are some codes of how I simply use python to compute(read, write and append)a hdf5 file. Also, I use the magic function timeit to find out how a hdf5 file could improve our trasmission speed. Of course, before computing hdf5 files in python, you have to download the package first. After downloading package that I need in terminal, I imported all the packages.

1
2
3
4
5
6
7
$ brew install hdf5
$ pip3 install h5py

import h5py
import numpy as np
import pandas as pd
import timeit

Here’s some basic computations of h5py to warm up:

  • Create a new HDF5 file by using method “w”.
1
f = h5py.File("christine.hdf5", "w")
  • Check root group’s name
1
f.name
'/'
  • Create a new dataset under the root group.
1
dset = f.create_dataset('chris_dataset', (100, 100), dtype='i')
  • Check dataset’s name under the root group
1
dset.name
'/chris_dataset'
  • In order to create subgroup under the root group, we should first open the file in the “append” mode.
1
2
f = h5py.File("christine.hdf5", "a")
group_sub = f.create_group('subgroup')
  • Check subgroup’s name
1
group_sub.name
'/subgroup'
  • Under the subgroup, we could also create dataset by using “create_dataset”.
1
2
dset2 = group_sub.create_dataset("under_group_sub", (50,), dtype='i')
dset2.name
'/subgroup/under_group_sub'
  • Groups are like dictionaries which have keys and values. By using the attribute keys, we could get all the datasets that are under the group.
1
group_sub.keys()
<KeysViewHDF5 ['under_group_sub', 'under_group_sub_2']>
1
f.keys()
<KeysViewHDF5 ['chris_dataset', 'group_sub', 'subgroup']>
  • Also, as long as we’ve already created a hdf5 file, we could store any array-like data into the hdf5 at any time. What’s the most convinient thing is that, by setting the key while storting the dataset, like a dictionary, all the datasets will be stored in the file by their keys and if you want any of the dataset, you could read any dataset from the hdf5 by entering their keys.
1
2
3
4
s = pd.Series([1, 2, 3, 4])
s.to_hdf('data.hdf', key='s')

pd.read_hdf('data.hdf', 's')
0    1
1    2
2    3
3    4
dtype: int64
  • We could set whatever attribute we want for the dataset
1
2
dset.attrs['temp'] = 20
dset.attrs['temp']
20
Comparison

Here in this section, by using timeit, I compare the efficiency of converting a dataframe into a hdf5 file and into a csv file. I also compare the efficiency of reading a hdf5 file and a csv file.

1
2
data = np.random.uniform(0, 1, size=(1000000, 100))
df = pd.DataFrame(data)
1
2
%%timeit -n1
df.to_hdf('data.hdf', key='1')
16.4 ms ± 6.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1
2
%%timeit -n1
df.to_csv('data.csv')
1.79 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1
2
%%timeit -n1
pd.read_hdf('data.hdf', '1')
22.1 ms ± 7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1
2
%%timeit -n1
pd.read_csv('data.csv')
248 ms ± 17.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Obviously, in dealing with large datasets, using hdf5 file is way more efficient than using csv file.