1. Getting started

One important aspect of proposing new machine learning/Statistical estimators and methods is the performance test phrase. With that in mind, we present here a short introduction

If you use this software, please cite it as:

@misc{2004.14479,
Author = {Marco H A Inácio},
Title = {Simulation studies on Python using sstudy package with SQL databases as storage},
Year = {2020},
Eprint = {arXiv:2004.14479},
}

We start by installing the package:

[1]:
!pip install sstudy
Requirement already satisfied: sstudy in /home/marco/Documents/projects/sstudy (0.0.5)
Requirement already satisfied: peewee in /home/marco/miniforge3/lib/python3.7/site-packages (from sstudy) (3.10.0)

Let us first define the structure of our dataset and create it:

[2]:
from peewee import *
import os

db = SqliteDatabase('results.sqlite3')

class Result(Model):
    # Data settings
    data_distribution = TextField()
    method = TextField()
    no_instances = IntegerField()

    # Results
    score = DoubleField()
    elapsed_time = DoubleField()

    class Meta:
        database = db

Result.create_table()

Now, let’s run the simulations (which will be stored in results.sqlite3):

[3]:
import numpy as np
import time
from scipy import stats
from sklearn.linear_model import LinearRegression, Lasso
from sstudy import do_simulation_study

no_simulations = 5

to_sample = dict(
    data_distribution = ["complete", "sparse"],
    no_instances = [100, 1000],
    method = ['ols', 'lasso'],
)

def func(
    data_distribution,
    no_instances,
    method,
    ):

    x = stats.norm.rvs(0, 2, size=(no_instances + 10000, 10))
    beta = stats.norm.rvs(0, 2, size=(10, 1))
    eps = stats.norm.rvs(0, 5, size=(no_instances + 10000, 1))
    if data_distribution == "complete":
        y = np.matmul(x, beta) + eps
    elif data_distribution == "sparse":
        y = np.matmul(x[:,:5], beta[:5]) + eps
    else:
        raise ValueError

    y_train = y[:no_instances]
    y_test = y[no_instances:]
    x_train = x[:no_instances]
    x_test = x[no_instances:]

    start_time = time.time()
    if method == 'ols':
        reg = LinearRegression()
    elif method == 'lasso':
        reg = Lasso(alpha=0.1)
    reg.fit(x_train, y_train)
    score = reg.score(x_test, y_test)
    elapsed_time = time.time() - start_time

    return dict(
        score = score,
        elapsed_time = elapsed_time,
    )

do_simulation_study(to_sample, func, db, Result, max_count=no_simulations)

8 combinations left
Result:
{'score': 0.8575248931182969, 'elapsed_time': 0.0013320446014404297}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8905348838183208, 'elapsed_time': 0.008800506591796875}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.7945695968134672, 'elapsed_time': 0.009409189224243164}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.7287316399844865, 'elapsed_time': 0.0032854080200195312}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.2564334250136251, 'elapsed_time': 0.016244173049926758}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8975782712376218, 'elapsed_time': 0.0028858184814453125}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.5373575169302656, 'elapsed_time': 0.008737564086914062}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8178943078163557, 'elapsed_time': 0.0012826919555664062}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.9144324333131287, 'elapsed_time': 0.0013051033020019531}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8218113034053461, 'elapsed_time': 0.0015320777893066406}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.769481073379765, 'elapsed_time': 0.00133514404296875}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.6275486131272321, 'elapsed_time': 0.0014030933380126953}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8691754007727437, 'elapsed_time': 0.0018489360809326172}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8410425486682132, 'elapsed_time': 0.002569913864135742}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8451526407461323, 'elapsed_time': 0.004086971282958984}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.542794259807144, 'elapsed_time': 0.0062961578369140625}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8173641561035503, 'elapsed_time': 0.0030364990234375}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8189176038766482, 'elapsed_time': 0.0016186237335205078}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8768221999784449, 'elapsed_time': 0.002727031707763672}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8425073925649836, 'elapsed_time': 0.0018978118896484375}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.7132189468128055, 'elapsed_time': 0.001692056655883789}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.9266474169971257, 'elapsed_time': 0.0032205581665039062}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8522948743811272, 'elapsed_time': 0.0015649795532226562}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.5756438633125094, 'elapsed_time': 0.014742612838745117}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.7313379763922424, 'elapsed_time': 0.002069234848022461}
Result successfully stored in the database
8 combinations left
7 combinations left
Result:
{'score': 0.8442770360636538, 'elapsed_time': 0.001718282699584961}
Result successfully stored in the database
7 combinations left
Result:
{'score': 0.6811604644181054, 'elapsed_time': 0.002541780471801758}
Result successfully stored in the database
7 combinations left
Result:
{'score': 0.8743938809791751, 'elapsed_time': 0.002440214157104492}
Result successfully stored in the database
7 combinations left
Result:
{'score': 0.8477776526527044, 'elapsed_time': 0.0017156600952148438}
Result successfully stored in the database
7 combinations left
Result:
{'score': 0.930162530219549, 'elapsed_time': 0.0019092559814453125}
Result successfully stored in the database
7 combinations left
6 combinations left
Result:
{'score': 0.6561026610417788, 'elapsed_time': 0.00104522705078125}
Result successfully stored in the database
6 combinations left
Result:
{'score': 0.6303432542515472, 'elapsed_time': 0.0014503002166748047}
Result successfully stored in the database
6 combinations left
Result:
{'score': 0.7963865338492976, 'elapsed_time': 0.022031784057617188}
Result successfully stored in the database
6 combinations left
5 combinations left
Result:
{'score': 0.5984406975347196, 'elapsed_time': 0.0017325878143310547}
Result successfully stored in the database
5 combinations left
Result:
{'score': 0.6946891656361566, 'elapsed_time': 0.0030274391174316406}
Result successfully stored in the database
5 combinations left
Result:
{'score': 0.8483846489543649, 'elapsed_time': 0.002885580062866211}
Result successfully stored in the database
5 combinations left
Result:
{'score': 0.36985368772169913, 'elapsed_time': 0.002604961395263672}
Result successfully stored in the database
5 combinations left
4 combinations left
3 combinations left
Result:
{'score': 0.8868214247864545, 'elapsed_time': 0.0029604434967041016}
Result successfully stored in the database
3 combinations left
Result:
{'score': 0.572918197397766, 'elapsed_time': 0.003069639205932617}
Result successfully stored in the database
3 combinations left
Result:
{'score': 0.8566470616687658, 'elapsed_time': 0.0018100738525390625}
Result successfully stored in the database
3 combinations left
2 combinations left
1 combinations left

The good news is that sqlite works though atomic transactions, so either a commit (i.e.: adding a result to the dataset) will happen entirelly or it won’t happen at all.

Therefore, you can kill the simulation study process without fear that the dataset will became corrupted in case you happen to kill while it’s commiting the results.

sstudy works chooses the test cases randomly and independently, therefore, you can spawn multiple simulation study processes that will work independly and can be terminated at any moment.

[4]:
import numpy as np
import pandas as pd
from matplotlib.backends.backend_pdf import PdfPages
import matplotlib.pyplot as plt

df = pd.DataFrame(list(Result.select().dicts()))
del(df['id'])
df.groupby(['data_distribution', 'no_instances', 'method']).mean()
[4]:
score elapsed_time
data_distribution no_instances method
complete 100 lasso 0.803092 0.003834
ols 0.863425 0.002493
1000 lasso 0.856177 0.003833
ols 0.846198 0.001619
sparse 100 lasso 0.608317 0.002824
ols 0.742987 0.007308
1000 lasso 0.746540 0.001939
ols 0.584299 0.007723
[5]:
def mpse(data):
    mean = data.mean()
    std_error = np.std(data) / np.sqrt(len(data))
    return "{0:.3f} ({1:.3f})".format(mean, std_error)

df.groupby(['data_distribution', 'no_instances', 'method']).agg(mpse)
[5]:
score elapsed_time
data_distribution no_instances method
complete 100 lasso 0.803 (0.035) 0.004 (0.001)
ols 0.863 (0.016) 0.002 (0.000)
1000 lasso 0.856 (0.011) 0.004 (0.001)
ols 0.846 (0.024) 0.002 (0.000)
sparse 100 lasso 0.608 (0.073) 0.003 (0.001)
ols 0.743 (0.054) 0.007 (0.004)
1000 lasso 0.747 (0.049) 0.002 (0.000)
ols 0.584 (0.084) 0.008 (0.003)

2. Sample filter

Suppose that we are not interested in testing some configurations our simulation study. For instance, we are not interested in testing method lasso with 10000 instances, we can then use a sample_filter function:

[6]:
%%capture
# Note: the %%capture line is only here to suppress output on jupyter notebooks.
# You can remove it on your application.

to_sample = dict(
    data_distribution = ["complete", "sparse"],
    no_instances = [100, 1000, 10000],
    method = ['ols', 'lasso'],
)

def sample_filter(
    data_distribution,
    no_instances,
    method,
    ):

    if method == 'lasso' and no_instances == 10000:
        return False

    return True

do_simulation_study(to_sample, func, db, Result,
    max_count=no_simulations,
    sample_filter=sample_filter)

df = pd.DataFrame(list(Result.select().dicts()))
df.groupby(['data_distribution', 'no_instances', 'method']).count().iloc[:,0]

Suppose now that for some configuration(s) we want to increase or decrease the number of simulations to be perfomed. For instance, for method ols, we want run 50 simulations for each configuration:

[7]:
%%capture

def sample_filter(
    data_distribution,
    no_instances,
    method,
    ):

    if method == 'lasso' and no_instances == 1000:
        return False

    if method == 'ols':
        return 50

    return True

do_simulation_study(to_sample, func, db, Result,
    max_count=no_simulations,
    sample_filter=sample_filter)

df = pd.DataFrame(list(Result.select().dicts()))
df.groupby(['data_distribution', 'no_instances', 'method']).count().iloc[:,0]

3. Deleting or updating results

Suppose you commited a programing mistake while coding distribution 1. Here’ how to delete results related to while preserving the results for the other distribution:

[8]:
query = Result.delete().where(Result.data_distribution==1)
query.execute()
[8]:
0

Note that this function returns the number of affected (i.e.: deleted) row (i.e.: simulations) of the dataset.

After that, you could then fix your code for distribution 1, and run do_simulation_study again to generate new results for it.

Updating works similary. For instance, let’s change the rows with data_distribution 0 to 3.

[9]:
query = Result.update(data_distribution=3).where(Result.data_distribution==0)
query.execute()
[9]:
0

4. Postgresql database

You can also use Postgresql (or MySQL or CockroachDB as they are supported by the peewee package) by installing the Python Postgresql driver: the psycopg2 package. The greatest advantage of using a managed database server is the ability to easily run sstudy on many machines at the same sharing the workload of the simulations.

As the database hosting server, you can install a free server on your local computer or use a third party one like on Elephantsql, Amazon AWS or Google Cloud, and change of db configuration:

from peewee import *
import os

pgdb = 'database_name'
pguser = 'username'
pgpassword = 'password'
pghost = 'host_address'

db = PostgresqlDatabase(pgdb, user=pguser, password=pgpassword, host=pghost)

Ideally though, you should not hardcode your credential, they should instead be passed as enviromental variables

from peewee import *
import os

try:
    pgdb = os.environ['pgdb']
    pguser = os.environ['pguser']
    pgpass = os.environ['pgpass']
    pghost = os.environ['pghost']
    pgport = os.environ['pgport']

    db = PostgresqlDatabase(pgdb, user=pguser, password=pgpass,
    host=pghost, port=pgport)
except KeyError:
    db = SqliteDatabase('results.sqlite3')

e.g.: run

pgdb='databasename' pguser='username' pgpassword = 'password' pghost = 'host_address' ipython your_script.py

5. Remote SQLite access

An alternative for using a remote dataset, is using multiple sqlite datasets and merging them for analisys.

Suppose you want to merge results.sqlite3 in the remote host 192.168.1.100 for which you have ssh access, you could then use the following set of commands on Linux:

scp 192.168.1.100:path_to_remote_database/results.sqlite3 db2.sqlite3

cp results.sqlite3 combined.sqlite3

sqlite3 combined.sqlite3 "BEGIN; ATTACH DATABASE 'db2.sqlite3' AS toMerge; insert into result (data_distribution, no_instances, method, score, elapsed_time) select data_distribution, no_instances, method, score, elapsed_time from toMerge.result; COMMIT; detach toMerge;"

The disadvantage of such method over Postgresql is that, sstudy will not be able to track the progress of the server globally and allocate new simulations accordingly (e.g.: one node might finish all its scheduled simulations while others have many more simulations to do).

Another possibility which does not have this shortcoming is to mount the remote server on a local folder using the Linux tool sshfs and from there, have access to the sqlite database file.

6. Storage of binary data

Storage of binary data (e.g.: lists, numpy arrays, etc) is also supported using a BlobField:

long_data = BlobField()

Once the data is requested to be stored, sstudy will automatically run pickle.dumps (unless data is already binary type). You can then reload your data later using pickle.loads.

7. Real data

It’s also possible to use the package to help your results with real datasets, as in the example below:

[10]:
from peewee import *
import os

db = SqliteDatabase('results.sqlite3')

class Result2(Model):
    # Data settings
    dataset = TextField()
    method = TextField()

    # Results
    score = DoubleField()
    elapsed_time = DoubleField()

    class Meta:
        database = db

Result2.create_table()
[11]:
%%capture

import numpy as np
import time
from scipy import stats
from sklearn.linear_model import LinearRegression, Lasso
from sstudy import do_simulation_study
from sklearn import datasets

no_simulations = 10

to_sample = dict(
    dataset = ["boston", "diabetes"],
    method = ['ols', 'lasso'],
)

def func(
    dataset,
    method,
    ):

    if dataset == 'diabetes':
        rdata = datasets.load_diabetes()
    elif dataset == 'boston':
        rdata = datasets.load_diabetes()
    else:
        raise ValueError

    x = rdata["data"]
    y = rdata["target"]
    no_instances = round(len(y)*.9)

    y_train = y[:no_instances]
    y_test = y[no_instances:]
    x_train = x[:no_instances]
    x_test = x[no_instances:]

    start_time = time.time()
    if method == 'ols':
        reg = LinearRegression()
    elif method == 'lasso':
        reg = Lasso(alpha=0.1)
    reg.fit(x_train, y_train)
    score = reg.score(x_test, y_test)
    elapsed_time = time.time() - start_time

    return dict(
        score = score,
        elapsed_time = elapsed_time,
    )

do_simulation_study(to_sample, func, db, Result2, max_count=1)
[12]:
df2 = pd.DataFrame(list(Result2.select().dicts()))
df2.sort_values(list(df2.columns))
[12]:
id dataset method score elapsed_time
0 1 boston ols 0.685685 0.001097
1 2 diabetes lasso 0.670936 0.001186
2 3 diabetes ols 0.685685 0.001005
3 4 boston lasso 0.670936 0.000773

8. Deterministic results

If it’s important to have determinisc results on the simulation study, one possibility is to set up the random seed as a variable of the experiment as given in the example below.

In this case, it’s usefull and recommended to have the set of unique parameters marked with a unique constraint on the dataset so the values (data_distribution, method, no_instances, random_seed), so the dataset system itself will enforce such uniqueness constraint.

See more about constraints at http://docs.peewee-orm.com/en/latest/peewee/models.html#indexes-and-constraints

[13]:
from peewee import *
import os

db = SqliteDatabase('results.sqlite3')

class Result3(Model):
    # Data settings
    data_distribution = TextField()
    method = TextField()
    no_instances = IntegerField()
    random_seed = IntegerField()

    # Results
    score = DoubleField()
    elapsed_time = DoubleField()

    class Meta:
        database = db
        indexes = (
            (('data_distribution', 'method', 'no_instances', 'random_seed'), True),
        )

Result3.create_table()
[14]:
%%capture

import numpy as np
import time
from scipy import stats
from sklearn.linear_model import LinearRegression, Lasso
from sstudy import do_simulation_study

to_sample = dict(
    data_distribution = ["complete", "sparse"],
    no_instances = [100, 1000],
    method = ['ols', 'lasso'],
    random_seed = range(30),
)

def func(
    data_distribution,
    no_instances,
    method,
    random_seed,
    ):
    np.random.seed(random_seed)

    x = stats.norm.rvs(0, 2, size=(no_instances + 10000, 10))
    beta = stats.norm.rvs(0, 2, size=(10, 1))
    eps = stats.norm.rvs(0, 5, size=(no_instances + 10000, 1))
    if data_distribution == "complete":
        y = np.matmul(x, beta) + eps
    elif data_distribution == "sparse":
        y = np.matmul(x[:,:5], beta[:5]) + eps
    else:
        raise ValueError

    y_train = y[:no_instances]
    y_test = y[no_instances:]
    x_train = x[:no_instances]
    x_test = x[no_instances:]

    start_time = time.time()
    if method == 'ols':
        reg = LinearRegression()
    elif method == 'lasso':
        reg = Lasso(alpha=0.1)
    reg.fit(x_train, y_train)
    score = reg.score(x_test, y_test)
    elapsed_time = time.time() - start_time

    return dict(
        score = score,
        elapsed_time = elapsed_time,
    )

do_simulation_study(to_sample, func, db, Result3, max_count=1)
[15]:
%%capture
df3 = pd.DataFrame(list(Result3.select().dicts()))
df3.sort_values('score')

Let us now, delete the results, run the simulation again and show that the results do not change.

[16]:
%%capture
Result3.delete().execute()
do_simulation_study(to_sample, func, db, Result3, max_count=1)
[17]:
df3 = pd.DataFrame(list(Result3.select().dicts()))
df3.sort_values('score')
[17]:
id data_distribution method no_instances random_seed score elapsed_time
15 16 sparse ols 100 25 0.161505 0.001216
43 44 sparse lasso 100 25 0.166967 0.001260
59 60 sparse ols 100 12 0.359669 0.001216
213 214 sparse lasso 100 12 0.368209 0.001302
128 129 sparse ols 100 6 0.371189 0.001080
... ... ... ... ... ... ... ...
58 59 complete lasso 1000 17 0.920788 0.002146
109 110 complete ols 1000 9 0.922379 0.001233
158 159 complete lasso 1000 9 0.922590 0.001191
155 156 complete lasso 1000 22 0.947963 0.001266
198 199 complete ols 1000 22 0.948076 0.002316

240 rows × 7 columns

9. Miscellaneous hints

[18]:
gdf = df.groupby(['data_distribution', 'no_instances', 'method']).agg(mpse)
[19]:
# You can export your table to latex using `to_latex` pandas method:
print(gdf.to_latex())
# Bonus hint: to enable enable multirow layout (see multirow Latex package)
# try print(gdf.to_latex(multirow=True))
\begin{tabular}{llllll}
\toprule
       &       &     &                id &          score &   elapsed\_time \\
data\_distribution & no\_instances & method &                   &                &                \\
\midrule
complete & 100   & lasso &    14.600 (4.065) &  0.803 (0.035) &  0.004 (0.001) \\
       &       & ols &  177.780 (13.974) &  0.838 (0.008) &  0.001 (0.000) \\
       & 1000  & lasso &    20.600 (5.174) &  0.856 (0.011) &  0.004 (0.001) \\
       &       & ols &  155.320 (10.350) &  0.834 (0.010) &  0.001 (0.000) \\
       & 10000 & lasso &    99.800 (7.623) &  0.842 (0.016) &  0.002 (0.000) \\
       &       & ols &  193.620 (13.599) &  0.825 (0.012) &  0.003 (0.000) \\
sparse & 100   & lasso &    29.200 (3.242) &  0.608 (0.073) &  0.003 (0.001) \\
       &       & ols &  164.900 (12.723) &  0.660 (0.022) &  0.002 (0.000) \\
       & 1000  & lasso &    29.400 (4.540) &  0.747 (0.049) &  0.002 (0.000) \\
       &       & ols &  195.120 (13.249) &  0.702 (0.022) &  0.002 (0.000) \\
       & 10000 & lasso &    71.400 (5.001) &  0.688 (0.040) &  0.002 (0.000) \\
       &       & ols &  179.060 (11.890) &  0.695 (0.021) &  0.003 (0.000) \\
\bottomrule
\end{tabular}

[20]:
# Add number of simulations as a column
to_group = ['data_distribution', 'no_instances', 'method']
count = df.groupby(to_group).count().iloc[:,-1]
gdf['no simulations'] = count
gdf
[20]:
id score elapsed_time no simulations
data_distribution no_instances method
complete 100 lasso 14.600 (4.065) 0.803 (0.035) 0.004 (0.001) 5
ols 177.780 (13.974) 0.838 (0.008) 0.001 (0.000) 50
1000 lasso 20.600 (5.174) 0.856 (0.011) 0.004 (0.001) 5
ols 155.320 (10.350) 0.834 (0.010) 0.001 (0.000) 50
10000 lasso 99.800 (7.623) 0.842 (0.016) 0.002 (0.000) 5
ols 193.620 (13.599) 0.825 (0.012) 0.003 (0.000) 50
sparse 100 lasso 29.200 (3.242) 0.608 (0.073) 0.003 (0.001) 5
ols 164.900 (12.723) 0.660 (0.022) 0.002 (0.000) 50
1000 lasso 29.400 (4.540) 0.747 (0.049) 0.002 (0.000) 5
ols 195.120 (13.249) 0.702 (0.022) 0.002 (0.000) 50
10000 lasso 71.400 (5.001) 0.688 (0.040) 0.002 (0.000) 5
ols 179.060 (11.890) 0.695 (0.021) 0.003 (0.000) 50
[21]:
# Add the stardard deviation
# and the standard error of the standard deviation
def stdpse(data):
    size = len(data)
    mu = data.mean()
    var = data.var()
    std = np.sqrt(var)

    mu4 = ((data - mu)**4).mean()
    se_of_std = mu4 - (size-3)/(size-1)*var**2
    se_of_std = np.sqrt(se_of_std/size) / 2 / std
    return "{0:.3f} ({1:.3f})".format(std, se_of_std)
score_std_se = df.groupby(to_group).agg(stdpse).score
gdf['score std'] = score_std_se
gdf
[21]:
id score elapsed_time no simulations score std
data_distribution no_instances method
complete 100 lasso 14.600 (4.065) 0.803 (0.035) 0.004 (0.001) 5 0.089 (0.015)
ols 177.780 (13.974) 0.838 (0.008) 0.001 (0.000) 50 0.054 (0.007)
1000 lasso 20.600 (5.174) 0.856 (0.011) 0.004 (0.001) 5 0.028 (0.005)
ols 155.320 (10.350) 0.834 (0.010) 0.001 (0.000) 50 0.074 (0.008)
10000 lasso 99.800 (7.623) 0.842 (0.016) 0.002 (0.000) 5 0.041 (0.006)
ols 193.620 (13.599) 0.825 (0.012) 0.003 (0.000) 50 0.089 (0.012)
sparse 100 lasso 29.200 (3.242) 0.608 (0.073) 0.003 (0.001) 5 0.183 (0.040)
ols 164.900 (12.723) 0.660 (0.022) 0.002 (0.000) 50 0.159 (0.025)
1000 lasso 29.400 (4.540) 0.747 (0.049) 0.002 (0.000) 5 0.122 (0.016)
ols 195.120 (13.249) 0.702 (0.022) 0.002 (0.000) 50 0.158 (0.021)
10000 lasso 71.400 (5.001) 0.688 (0.040) 0.002 (0.000) 5 0.101 (0.013)
ols 179.060 (11.890) 0.695 (0.021) 0.003 (0.000) 50 0.148 (0.020)
[22]:
# Renaming grouped pandas datasets
gdf = gdf.reset_index()
gdf.loc[gdf.method=="ols", "method"] = 'linear'
gdf.index = pd.MultiIndex.from_frame(gdf[to_group])
gdf.drop(to_group, axis=1, inplace=True)

gdf
[22]:
id score elapsed_time no simulations score std
data_distribution no_instances method
complete 100 lasso 14.600 (4.065) 0.803 (0.035) 0.004 (0.001) 5 0.089 (0.015)
linear 177.780 (13.974) 0.838 (0.008) 0.001 (0.000) 50 0.054 (0.007)
1000 lasso 20.600 (5.174) 0.856 (0.011) 0.004 (0.001) 5 0.028 (0.005)
linear 155.320 (10.350) 0.834 (0.010) 0.001 (0.000) 50 0.074 (0.008)
10000 lasso 99.800 (7.623) 0.842 (0.016) 0.002 (0.000) 5 0.041 (0.006)
linear 193.620 (13.599) 0.825 (0.012) 0.003 (0.000) 50 0.089 (0.012)
sparse 100 lasso 29.200 (3.242) 0.608 (0.073) 0.003 (0.001) 5 0.183 (0.040)
linear 164.900 (12.723) 0.660 (0.022) 0.002 (0.000) 50 0.159 (0.025)
1000 lasso 29.400 (4.540) 0.747 (0.049) 0.002 (0.000) 5 0.122 (0.016)
linear 195.120 (13.249) 0.702 (0.022) 0.002 (0.000) 50 0.158 (0.021)
10000 lasso 71.400 (5.001) 0.688 (0.040) 0.002 (0.000) 5 0.101 (0.013)
linear 179.060 (11.890) 0.695 (0.021) 0.003 (0.000) 50 0.148 (0.020)

10. Multiple files management

In the examples so far, we have managed everything as a single file, but our recommended design is to have each setup separated in 3 files: one for database structure, another for running the simulations and another the explore/plot the results. As skeleton for this, we have an example folder distributed together with the package and also available at: https://gitlab.com/marcoinacio/sstudy/-/tree/master/examples.