1. Getting started¶

One important aspect of proposing new machine learning/Statistical estimators and methods is the performance test phrase. With that in mind, we present here a short introduction

If you use this software, please cite it as:

@misc{2004.14479,
Author = {Marco H A Inácio},
Title = {Simulation studies on Python using sstudy package with SQL databases as storage},
Year = {2020},
Eprint = {arXiv:2004.14479},
}

We start by installing the package:

[1]:

!pip install sstudy

Requirement already satisfied: sstudy in /home/marco/Documents/projects/sstudy (0.0.5)
Requirement already satisfied: peewee in /home/marco/miniforge3/lib/python3.7/site-packages (from sstudy) (3.10.0)

Let us first define the structure of our dataset and create it:

[2]:

from peewee import *
import os

db = SqliteDatabase('results.sqlite3')

class Result(Model):
    # Data settings
    data_distribution = TextField()
    method = TextField()
    no_instances = IntegerField()

    # Results
    score = DoubleField()
    elapsed_time = DoubleField()

    class Meta:
        database = db

Result.create_table()

Now, let’s run the simulations (which will be stored in results.sqlite3):

[3]:

import numpy as np
import time
from scipy import stats
from sklearn.linear_model import LinearRegression, Lasso
from sstudy import do_simulation_study

no_simulations = 5

to_sample = dict(
    data_distribution = ["complete", "sparse"],
    no_instances = [100, 1000],
    method = ['ols', 'lasso'],
)

def func(
    data_distribution,
    no_instances,
    method,
    ):

    x = stats.norm.rvs(0, 2, size=(no_instances + 10000, 10))
    beta = stats.norm.rvs(0, 2, size=(10, 1))
    eps = stats.norm.rvs(0, 5, size=(no_instances + 10000, 1))
    if data_distribution == "complete":
        y = np.matmul(x, beta) + eps
    elif data_distribution == "sparse":
        y = np.matmul(x[:,:5], beta[:5]) + eps
    else:
        raise ValueError

    y_train = y[:no_instances]
    y_test = y[no_instances:]
    x_train = x[:no_instances]
    x_test = x[no_instances:]

    start_time = time.time()
    if method == 'ols':
        reg = LinearRegression()
    elif method == 'lasso':
        reg = Lasso(alpha=0.1)
    reg.fit(x_train, y_train)
    score = reg.score(x_test, y_test)
    elapsed_time = time.time() - start_time

    return dict(
        score = score,
        elapsed_time = elapsed_time,
    )

do_simulation_study(to_sample, func, db, Result, max_count=no_simulations)

8 combinations left
Result:
{'score': 0.8575248931182969, 'elapsed_time': 0.0013320446014404297}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8905348838183208, 'elapsed_time': 0.008800506591796875}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.7945695968134672, 'elapsed_time': 0.009409189224243164}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.7287316399844865, 'elapsed_time': 0.0032854080200195312}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.2564334250136251, 'elapsed_time': 0.016244173049926758}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8975782712376218, 'elapsed_time': 0.0028858184814453125}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.5373575169302656, 'elapsed_time': 0.008737564086914062}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8178943078163557, 'elapsed_time': 0.0012826919555664062}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.9144324333131287, 'elapsed_time': 0.0013051033020019531}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8218113034053461, 'elapsed_time': 0.0015320777893066406}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.769481073379765, 'elapsed_time': 0.00133514404296875}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.6275486131272321, 'elapsed_time': 0.0014030933380126953}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8691754007727437, 'elapsed_time': 0.0018489360809326172}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8410425486682132, 'elapsed_time': 0.002569913864135742}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8451526407461323, 'elapsed_time': 0.004086971282958984}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.542794259807144, 'elapsed_time': 0.0062961578369140625}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8173641561035503, 'elapsed_time': 0.0030364990234375}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8189176038766482, 'elapsed_time': 0.0016186237335205078}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8768221999784449, 'elapsed_time': 0.002727031707763672}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8425073925649836, 'elapsed_time': 0.0018978118896484375}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.7132189468128055, 'elapsed_time': 0.001692056655883789}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.9266474169971257, 'elapsed_time': 0.0032205581665039062}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8522948743811272, 'elapsed_time': 0.0015649795532226562}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.5756438633125094, 'elapsed_time': 0.014742612838745117}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.7313379763922424, 'elapsed_time': 0.002069234848022461}
Result successfully stored in the database
8 combinations left
7 combinations left
Result:
{'score': 0.8442770360636538, 'elapsed_time': 0.001718282699584961}
Result successfully stored in the database
7 combinations left
Result:
{'score': 0.6811604644181054, 'elapsed_time': 0.002541780471801758}
Result successfully stored in the database
7 combinations left
Result:
{'score': 0.8743938809791751, 'elapsed_time': 0.002440214157104492}
Result successfully stored in the database
7 combinations left
Result:
{'score': 0.8477776526527044, 'elapsed_time': 0.0017156600952148438}
Result successfully stored in the database
7 combinations left
Result:
{'score': 0.930162530219549, 'elapsed_time': 0.0019092559814453125}
Result successfully stored in the database
7 combinations left
6 combinations left
Result:
{'score': 0.6561026610417788, 'elapsed_time': 0.00104522705078125}
Result successfully stored in the database
6 combinations left
Result:
{'score': 0.6303432542515472, 'elapsed_time': 0.0014503002166748047}
Result successfully stored in the database
6 combinations left
Result:
{'score': 0.7963865338492976, 'elapsed_time': 0.022031784057617188}
Result successfully stored in the database
6 combinations left
5 combinations left
Result:
{'score': 0.5984406975347196, 'elapsed_time': 0.0017325878143310547}
Result successfully stored in the database
5 combinations left
Result:
{'score': 0.6946891656361566, 'elapsed_time': 0.0030274391174316406}
Result successfully stored in the database
5 combinations left
Result:
{'score': 0.8483846489543649, 'elapsed_time': 0.002885580062866211}
Result successfully stored in the database
5 combinations left
Result:
{'score': 0.36985368772169913, 'elapsed_time': 0.002604961395263672}
Result successfully stored in the database
5 combinations left
4 combinations left
3 combinations left
Result:
{'score': 0.8868214247864545, 'elapsed_time': 0.0029604434967041016}
Result successfully stored in the database
3 combinations left
Result:
{'score': 0.572918197397766, 'elapsed_time': 0.003069639205932617}
Result successfully stored in the database
3 combinations left
Result:
{'score': 0.8566470616687658, 'elapsed_time': 0.0018100738525390625}
Result successfully stored in the database
3 combinations left
2 combinations left
1 combinations left

The good news is that sqlite works though atomic transactions, so either a commit (i.e.: adding a result to the dataset) will happen entirelly or it won’t happen at all.

Therefore, you can kill the simulation study process without fear that the dataset will became corrupted in case you happen to kill while it’s commiting the results.

sstudy works chooses the test cases randomly and independently, therefore, you can spawn multiple simulation study processes that will work independly and can be terminated at any moment.

[4]:

import numpy as np
import pandas as pd
from matplotlib.backends.backend_pdf import PdfPages
import matplotlib.pyplot as plt

df = pd.DataFrame(list(Result.select().dicts()))
del(df['id'])
df.groupby(['data_distribution', 'no_instances', 'method']).mean()

[4]:

			score	elapsed_time
data_distribution	no_instances	method
complete	100	lasso	0.803092	0.003834
	100	ols	0.863425	0.002493
	1000	lasso	0.856177	0.003833
	1000	ols	0.846198	0.001619
sparse	100	lasso	0.608317	0.002824
	100	ols	0.742987	0.007308
	1000	lasso	0.746540	0.001939
	1000	ols	0.584299	0.007723

[5]:

def mpse(data):
    mean = data.mean()
    std_error = np.std(data) / np.sqrt(len(data))
    return "{0:.3f} ({1:.3f})".format(mean, std_error)

df.groupby(['data_distribution', 'no_instances', 'method']).agg(mpse)

[5]:

			score	elapsed_time
data_distribution	no_instances	method
complete	100	lasso	0.803 (0.035)	0.004 (0.001)
	100	ols	0.863 (0.016)	0.002 (0.000)
	1000	lasso	0.856 (0.011)	0.004 (0.001)
	1000	ols	0.846 (0.024)	0.002 (0.000)
sparse	100	lasso	0.608 (0.073)	0.003 (0.001)
	100	ols	0.743 (0.054)	0.007 (0.004)
	1000	lasso	0.747 (0.049)	0.002 (0.000)
	1000	ols	0.584 (0.084)	0.008 (0.003)

2. Sample filter¶

Suppose that we are not interested in testing some configurations our simulation study. For instance, we are not interested in testing method lasso with 10000 instances, we can then use a sample_filter function:

[6]:

%%capture
# Note: the %%capture line is only here to suppress output on jupyter notebooks.
# You can remove it on your application.

to_sample = dict(
    data_distribution = ["complete", "sparse"],
    no_instances = [100, 1000, 10000],
    method = ['ols', 'lasso'],
)

def sample_filter(
    data_distribution,
    no_instances,
    method,
    ):

    if method == 'lasso' and no_instances == 10000:
        return False

    return True

do_simulation_study(to_sample, func, db, Result,
    max_count=no_simulations,
    sample_filter=sample_filter)

df = pd.DataFrame(list(Result.select().dicts()))
df.groupby(['data_distribution', 'no_instances', 'method']).count().iloc[:,0]

Suppose now that for some configuration(s) we want to increase or decrease the number of simulations to be perfomed. For instance, for method ols, we want run 50 simulations for each configuration:

[7]:

%%capture

def sample_filter(
    data_distribution,
    no_instances,
    method,
    ):

    if method == 'lasso' and no_instances == 1000:
        return False

    if method == 'ols':
        return 50

    return True

do_simulation_study(to_sample, func, db, Result,
    max_count=no_simulations,
    sample_filter=sample_filter)

df = pd.DataFrame(list(Result.select().dicts()))
df.groupby(['data_distribution', 'no_instances', 'method']).count().iloc[:,0]

3. Deleting or updating results¶

Suppose you commited a programing mistake while coding distribution 1. Here’ how to delete results related to while preserving the results for the other distribution:

[8]:

query = Result.delete().where(Result.data_distribution==1)
query.execute()

[8]:

Note that this function returns the number of affected (i.e.: deleted) row (i.e.: simulations) of the dataset.

After that, you could then fix your code for distribution 1, and run do_simulation_study again to generate new results for it.

Updating works similary. For instance, let’s change the rows with data_distribution 0 to 3.

[9]:

query = Result.update(data_distribution=3).where(Result.data_distribution==0)
query.execute()

[9]:

See more possibilities at peewee’s documentation: http://docs.peewee-orm.com/en/latest/peewee/querying.html
You can also browse, update and delete your SQLite database using tools such as “DB Browser for SQLite”.

4. Postgresql database¶

You can also use Postgresql (or MySQL or CockroachDB as they are supported by the peewee package) by installing the Python Postgresql driver: the psycopg2 package. The greatest advantage of using a managed database server is the ability to easily run sstudy on many machines at the same sharing the workload of the simulations.

As the database hosting server, you can install a free server on your local computer or use a third party one like on Elephantsql, Amazon AWS or Google Cloud, and change of db configuration:

from peewee import *
import os

pgdb = 'database_name'
pguser = 'username'
pgpassword = 'password'
pghost = 'host_address'

db = PostgresqlDatabase(pgdb, user=pguser, password=pgpassword, host=pghost)

Ideally though, you should not hardcode your credential, they should instead be passed as enviromental variables

from peewee import *
import os

try:
    pgdb = os.environ['pgdb']
    pguser = os.environ['pguser']
    pgpass = os.environ['pgpass']
    pghost = os.environ['pghost']
    pgport = os.environ['pgport']

    db = PostgresqlDatabase(pgdb, user=pguser, password=pgpass,
    host=pghost, port=pgport)
except KeyError:
    db = SqliteDatabase('results.sqlite3')

e.g.: run

pgdb='databasename' pguser='username' pgpassword = 'password' pghost = 'host_address' ipython your_script.py

5. Remote SQLite access¶

An alternative for using a remote dataset, is using multiple sqlite datasets and merging them for analisys.

Suppose you want to merge results.sqlite3 in the remote host 192.168.1.100 for which you have ssh access, you could then use the following set of commands on Linux:

scp 192.168.1.100:path_to_remote_database/results.sqlite3 db2.sqlite3

cp results.sqlite3 combined.sqlite3

sqlite3 combined.sqlite3 "BEGIN; ATTACH DATABASE 'db2.sqlite3' AS toMerge; insert into result (data_distribution, no_instances, method, score, elapsed_time) select data_distribution, no_instances, method, score, elapsed_time from toMerge.result; COMMIT; detach toMerge;"

The disadvantage of such method over Postgresql is that, sstudy will not be able to track the progress of the server globally and allocate new simulations accordingly (e.g.: one node might finish all its scheduled simulations while others have many more simulations to do).

Another possibility which does not have this shortcoming is to mount the remote server on a local folder using the Linux tool sshfs and from there, have access to the sqlite database file.

6. Storage of binary data¶

Storage of binary data (e.g.: lists, numpy arrays, etc) is also supported using a BlobField:

long_data = BlobField()

Once the data is requested to be stored, sstudy will automatically run pickle.dumps (unless data is already binary type). You can then reload your data later using pickle.loads.

7. Real data¶

It’s also possible to use the package to help your results with real datasets, as in the example below:

[10]:

from peewee import *
import os

db = SqliteDatabase('results.sqlite3')

class Result2(Model):
    # Data settings
    dataset = TextField()
    method = TextField()

    # Results
    score = DoubleField()
    elapsed_time = DoubleField()

    class Meta:
        database = db

Result2.create_table()

[11]:

%%capture

import numpy as np
import time
from scipy import stats
from sklearn.linear_model import LinearRegression, Lasso
from sstudy import do_simulation_study
from sklearn import datasets

no_simulations = 10

to_sample = dict(
    dataset = ["boston", "diabetes"],
    method = ['ols', 'lasso'],
)

def func(
    dataset,
    method,
    ):

    if dataset == 'diabetes':
        rdata = datasets.load_diabetes()
    elif dataset == 'boston':
        rdata = datasets.load_diabetes()
    else:
        raise ValueError

    x = rdata["data"]
    y = rdata["target"]
    no_instances = round(len(y)*.9)

    y_train = y[:no_instances]
    y_test = y[no_instances:]
    x_train = x[:no_instances]
    x_test = x[no_instances:]

    start_time = time.time()
    if method == 'ols':
        reg = LinearRegression()
    elif method == 'lasso':
        reg = Lasso(alpha=0.1)
    reg.fit(x_train, y_train)
    score = reg.score(x_test, y_test)
    elapsed_time = time.time() - start_time

    return dict(
        score = score,
        elapsed_time = elapsed_time,
    )

do_simulation_study(to_sample, func, db, Result2, max_count=1)

[12]:

df2 = pd.DataFrame(list(Result2.select().dicts()))
df2.sort_values(list(df2.columns))

[12]:

	id	dataset	method	score	elapsed_time
0	1	boston	ols	0.685685	0.001097
1	2	diabetes	lasso	0.670936	0.001186
2	3	diabetes	ols	0.685685	0.001005
3	4	boston	lasso	0.670936	0.000773

8. Deterministic results¶

If it’s important to have determinisc results on the simulation study, one possibility is to set up the random seed as a variable of the experiment as given in the example below.

In this case, it’s usefull and recommended to have the set of unique parameters marked with a unique constraint on the dataset so the values (data_distribution, method, no_instances, random_seed), so the dataset system itself will enforce such uniqueness constraint.

See more about constraints at http://docs.peewee-orm.com/en/latest/peewee/models.html#indexes-and-constraints

[13]:

from peewee import *
import os

db = SqliteDatabase('results.sqlite3')

class Result3(Model):
    # Data settings
    data_distribution = TextField()
    method = TextField()
    no_instances = IntegerField()
    random_seed = IntegerField()

    # Results
    score = DoubleField()
    elapsed_time = DoubleField()

    class Meta:
        database = db
        indexes = (
            (('data_distribution', 'method', 'no_instances', 'random_seed'), True),
        )

Result3.create_table()

[14]:

%%capture

import numpy as np
import time
from scipy import stats
from sklearn.linear_model import LinearRegression, Lasso
from sstudy import do_simulation_study

to_sample = dict(
    data_distribution = ["complete", "sparse"],
    no_instances = [100, 1000],
    method = ['ols', 'lasso'],
    random_seed = range(30),
)

def func(
    data_distribution,
    no_instances,
    method,
    random_seed,
    ):
    np.random.seed(random_seed)

    x = stats.norm.rvs(0, 2, size=(no_instances + 10000, 10))
    beta = stats.norm.rvs(0, 2, size=(10, 1))
    eps = stats.norm.rvs(0, 5, size=(no_instances + 10000, 1))
    if data_distribution == "complete":
        y = np.matmul(x, beta) + eps
    elif data_distribution == "sparse":
        y = np.matmul(x[:,:5], beta[:5]) + eps
    else:
        raise ValueError

    y_train = y[:no_instances]
    y_test = y[no_instances:]
    x_train = x[:no_instances]
    x_test = x[no_instances:]

    start_time = time.time()
    if method == 'ols':
        reg = LinearRegression()
    elif method == 'lasso':
        reg = Lasso(alpha=0.1)
    reg.fit(x_train, y_train)
    score = reg.score(x_test, y_test)
    elapsed_time = time.time() - start_time

    return dict(
        score = score,
        elapsed_time = elapsed_time,
    )

do_simulation_study(to_sample, func, db, Result3, max_count=1)

[15]:

%%capture
df3 = pd.DataFrame(list(Result3.select().dicts()))
df3.sort_values('score')

Let us now, delete the results, run the simulation again and show that the results do not change.

[16]:

%%capture
Result3.delete().execute()
do_simulation_study(to_sample, func, db, Result3, max_count=1)

[17]:

df3 = pd.DataFrame(list(Result3.select().dicts()))
df3.sort_values('score')

[17]:

	id	data_distribution	method	no_instances	random_seed	score	elapsed_time
15	16	sparse	ols	100	25	0.161505	0.001216
43	44	sparse	lasso	100	25	0.166967	0.001260
59	60	sparse	ols	100	12	0.359669	0.001216
213	214	sparse	lasso	100	12	0.368209	0.001302
128	129	sparse	ols	100	6	0.371189	0.001080
...	...	...	...	...	...	...	...
58	59	complete	lasso	1000	17	0.920788	0.002146
109	110	complete	ols	1000	9	0.922379	0.001233
158	159	complete	lasso	1000	9	0.922590	0.001191
155	156	complete	lasso	1000	22	0.947963	0.001266
198	199	complete	ols	1000	22	0.948076	0.002316

240 rows × 7 columns

9. Miscellaneous hints¶

[18]:

gdf = df.groupby(['data_distribution', 'no_instances', 'method']).agg(mpse)

[19]:

# You can export your table to latex using `to_latex` pandas method:
print(gdf.to_latex())
# Bonus hint: to enable enable multirow layout (see multirow Latex package)
# try print(gdf.to_latex(multirow=True))

\begin{tabular}{llllll}
\toprule
       &       &     &                id &          score &   elapsed\_time \\
data\_distribution & no\_instances & method &                   &                &                \\
\midrule
complete & 100   & lasso &    14.600 (4.065) &  0.803 (0.035) &  0.004 (0.001) \\
       &       & ols &  177.780 (13.974) &  0.838 (0.008) &  0.001 (0.000) \\
       & 1000  & lasso &    20.600 (5.174) &  0.856 (0.011) &  0.004 (0.001) \\
       &       & ols &  155.320 (10.350) &  0.834 (0.010) &  0.001 (0.000) \\
       & 10000 & lasso &    99.800 (7.623) &  0.842 (0.016) &  0.002 (0.000) \\
       &       & ols &  193.620 (13.599) &  0.825 (0.012) &  0.003 (0.000) \\
sparse & 100   & lasso &    29.200 (3.242) &  0.608 (0.073) &  0.003 (0.001) \\
       &       & ols &  164.900 (12.723) &  0.660 (0.022) &  0.002 (0.000) \\
       & 1000  & lasso &    29.400 (4.540) &  0.747 (0.049) &  0.002 (0.000) \\
       &       & ols &  195.120 (13.249) &  0.702 (0.022) &  0.002 (0.000) \\
       & 10000 & lasso &    71.400 (5.001) &  0.688 (0.040) &  0.002 (0.000) \\
       &       & ols &  179.060 (11.890) &  0.695 (0.021) &  0.003 (0.000) \\
\bottomrule
\end{tabular}

[20]:

# Add number of simulations as a column
to_group = ['data_distribution', 'no_instances', 'method']
count = df.groupby(to_group).count().iloc[:,-1]
gdf['no simulations'] = count
gdf

[20]:

			id	score	elapsed_time	no simulations
data_distribution	no_instances	method
complete	100	lasso	14.600 (4.065)	0.803 (0.035)	0.004 (0.001)	5
	100	ols	177.780 (13.974)	0.838 (0.008)	0.001 (0.000)	50
	1000	lasso	20.600 (5.174)	0.856 (0.011)	0.004 (0.001)	5
	1000	ols	155.320 (10.350)	0.834 (0.010)	0.001 (0.000)	50
	10000	lasso	99.800 (7.623)	0.842 (0.016)	0.002 (0.000)	5
	10000	ols	193.620 (13.599)	0.825 (0.012)	0.003 (0.000)	50
sparse	100	lasso	29.200 (3.242)	0.608 (0.073)	0.003 (0.001)	5
	100	ols	164.900 (12.723)	0.660 (0.022)	0.002 (0.000)	50
	1000	lasso	29.400 (4.540)	0.747 (0.049)	0.002 (0.000)	5
	1000	ols	195.120 (13.249)	0.702 (0.022)	0.002 (0.000)	50
	10000	lasso	71.400 (5.001)	0.688 (0.040)	0.002 (0.000)	5
	10000	ols	179.060 (11.890)	0.695 (0.021)	0.003 (0.000)	50

[21]:

# Add the stardard deviation
# and the standard error of the standard deviation
def stdpse(data):
    size = len(data)
    mu = data.mean()
    var = data.var()
    std = np.sqrt(var)

    mu4 = ((data - mu)**4).mean()
    se_of_std = mu4 - (size-3)/(size-1)*var**2
    se_of_std = np.sqrt(se_of_std/size) / 2 / std
    return "{0:.3f} ({1:.3f})".format(std, se_of_std)
score_std_se = df.groupby(to_group).agg(stdpse).score
gdf['score std'] = score_std_se
gdf

[21]:

			id	score	elapsed_time	no simulations	score std
data_distribution	no_instances	method
complete	100	lasso	14.600 (4.065)	0.803 (0.035)	0.004 (0.001)	5	0.089 (0.015)
	100	ols	177.780 (13.974)	0.838 (0.008)	0.001 (0.000)	50	0.054 (0.007)
	1000	lasso	20.600 (5.174)	0.856 (0.011)	0.004 (0.001)	5	0.028 (0.005)
	1000	ols	155.320 (10.350)	0.834 (0.010)	0.001 (0.000)	50	0.074 (0.008)
	10000	lasso	99.800 (7.623)	0.842 (0.016)	0.002 (0.000)	5	0.041 (0.006)
	10000	ols	193.620 (13.599)	0.825 (0.012)	0.003 (0.000)	50	0.089 (0.012)
sparse	100	lasso	29.200 (3.242)	0.608 (0.073)	0.003 (0.001)	5	0.183 (0.040)
	100	ols	164.900 (12.723)	0.660 (0.022)	0.002 (0.000)	50	0.159 (0.025)
	1000	lasso	29.400 (4.540)	0.747 (0.049)	0.002 (0.000)	5	0.122 (0.016)
	1000	ols	195.120 (13.249)	0.702 (0.022)	0.002 (0.000)	50	0.158 (0.021)
	10000	lasso	71.400 (5.001)	0.688 (0.040)	0.002 (0.000)	5	0.101 (0.013)
	10000	ols	179.060 (11.890)	0.695 (0.021)	0.003 (0.000)	50	0.148 (0.020)

[22]:

# Renaming grouped pandas datasets
gdf = gdf.reset_index()
gdf.loc[gdf.method=="ols", "method"] = 'linear'
gdf.index = pd.MultiIndex.from_frame(gdf[to_group])
gdf.drop(to_group, axis=1, inplace=True)

gdf

[22]:

			id	score	elapsed_time	no simulations	score std
data_distribution	no_instances	method
complete	100	lasso	14.600 (4.065)	0.803 (0.035)	0.004 (0.001)	5	0.089 (0.015)
	100	linear	177.780 (13.974)	0.838 (0.008)	0.001 (0.000)	50	0.054 (0.007)
	1000	lasso	20.600 (5.174)	0.856 (0.011)	0.004 (0.001)	5	0.028 (0.005)
	1000	linear	155.320 (10.350)	0.834 (0.010)	0.001 (0.000)	50	0.074 (0.008)
	10000	lasso	99.800 (7.623)	0.842 (0.016)	0.002 (0.000)	5	0.041 (0.006)
	10000	linear	193.620 (13.599)	0.825 (0.012)	0.003 (0.000)	50	0.089 (0.012)
sparse	100	lasso	29.200 (3.242)	0.608 (0.073)	0.003 (0.001)	5	0.183 (0.040)
	100	linear	164.900 (12.723)	0.660 (0.022)	0.002 (0.000)	50	0.159 (0.025)
	1000	lasso	29.400 (4.540)	0.747 (0.049)	0.002 (0.000)	5	0.122 (0.016)
	1000	linear	195.120 (13.249)	0.702 (0.022)	0.002 (0.000)	50	0.158 (0.021)
	10000	lasso	71.400 (5.001)	0.688 (0.040)	0.002 (0.000)	5	0.101 (0.013)
	10000	linear	179.060 (11.890)	0.695 (0.021)	0.003 (0.000)	50	0.148 (0.020)

10. Multiple files management¶

In the examples so far, we have managed everything as a single file, but our recommended design is to have each setup separated in 3 files: one for database structure, another for running the simulations and another the explore/plot the results. As skeleton for this, we have an example folder distributed together with the package and also available at: https://gitlab.com/marcoinacio/sstudy/-/tree/master/examples.