1. Getting started¶
One important aspect of proposing new machine learning/Statistical estimators and methods is the performance test phrase. With that in mind, we present here a short introduction
If you use this software, please cite it as:
@misc{2004.14479,
Author = {Marco H A Inácio},
Title = {Simulation studies on Python using sstudy package with SQL databases as storage},
Year = {2020},
Eprint = {arXiv:2004.14479},
}
We start by installing the package:
[1]:
!pip install sstudy
Requirement already satisfied: sstudy in /home/marco/Documents/projects/sstudy (0.0.5)
Requirement already satisfied: peewee in /home/marco/miniforge3/lib/python3.7/site-packages (from sstudy) (3.10.0)
Let us first define the structure of our dataset and create it:
[2]:
from peewee import *
import os
db = SqliteDatabase('results.sqlite3')
class Result(Model):
# Data settings
data_distribution = TextField()
method = TextField()
no_instances = IntegerField()
# Results
score = DoubleField()
elapsed_time = DoubleField()
class Meta:
database = db
Result.create_table()
Now, let’s run the simulations (which will be stored in results.sqlite3):
[3]:
import numpy as np
import time
from scipy import stats
from sklearn.linear_model import LinearRegression, Lasso
from sstudy import do_simulation_study
no_simulations = 5
to_sample = dict(
data_distribution = ["complete", "sparse"],
no_instances = [100, 1000],
method = ['ols', 'lasso'],
)
def func(
data_distribution,
no_instances,
method,
):
x = stats.norm.rvs(0, 2, size=(no_instances + 10000, 10))
beta = stats.norm.rvs(0, 2, size=(10, 1))
eps = stats.norm.rvs(0, 5, size=(no_instances + 10000, 1))
if data_distribution == "complete":
y = np.matmul(x, beta) + eps
elif data_distribution == "sparse":
y = np.matmul(x[:,:5], beta[:5]) + eps
else:
raise ValueError
y_train = y[:no_instances]
y_test = y[no_instances:]
x_train = x[:no_instances]
x_test = x[no_instances:]
start_time = time.time()
if method == 'ols':
reg = LinearRegression()
elif method == 'lasso':
reg = Lasso(alpha=0.1)
reg.fit(x_train, y_train)
score = reg.score(x_test, y_test)
elapsed_time = time.time() - start_time
return dict(
score = score,
elapsed_time = elapsed_time,
)
do_simulation_study(to_sample, func, db, Result, max_count=no_simulations)
8 combinations left
Result:
{'score': 0.8575248931182969, 'elapsed_time': 0.0013320446014404297}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8905348838183208, 'elapsed_time': 0.008800506591796875}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.7945695968134672, 'elapsed_time': 0.009409189224243164}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.7287316399844865, 'elapsed_time': 0.0032854080200195312}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.2564334250136251, 'elapsed_time': 0.016244173049926758}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8975782712376218, 'elapsed_time': 0.0028858184814453125}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.5373575169302656, 'elapsed_time': 0.008737564086914062}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8178943078163557, 'elapsed_time': 0.0012826919555664062}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.9144324333131287, 'elapsed_time': 0.0013051033020019531}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8218113034053461, 'elapsed_time': 0.0015320777893066406}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.769481073379765, 'elapsed_time': 0.00133514404296875}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.6275486131272321, 'elapsed_time': 0.0014030933380126953}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8691754007727437, 'elapsed_time': 0.0018489360809326172}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8410425486682132, 'elapsed_time': 0.002569913864135742}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8451526407461323, 'elapsed_time': 0.004086971282958984}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.542794259807144, 'elapsed_time': 0.0062961578369140625}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8173641561035503, 'elapsed_time': 0.0030364990234375}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8189176038766482, 'elapsed_time': 0.0016186237335205078}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8768221999784449, 'elapsed_time': 0.002727031707763672}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8425073925649836, 'elapsed_time': 0.0018978118896484375}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.7132189468128055, 'elapsed_time': 0.001692056655883789}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.9266474169971257, 'elapsed_time': 0.0032205581665039062}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.8522948743811272, 'elapsed_time': 0.0015649795532226562}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.5756438633125094, 'elapsed_time': 0.014742612838745117}
Result successfully stored in the database
8 combinations left
Result:
{'score': 0.7313379763922424, 'elapsed_time': 0.002069234848022461}
Result successfully stored in the database
8 combinations left
7 combinations left
Result:
{'score': 0.8442770360636538, 'elapsed_time': 0.001718282699584961}
Result successfully stored in the database
7 combinations left
Result:
{'score': 0.6811604644181054, 'elapsed_time': 0.002541780471801758}
Result successfully stored in the database
7 combinations left
Result:
{'score': 0.8743938809791751, 'elapsed_time': 0.002440214157104492}
Result successfully stored in the database
7 combinations left
Result:
{'score': 0.8477776526527044, 'elapsed_time': 0.0017156600952148438}
Result successfully stored in the database
7 combinations left
Result:
{'score': 0.930162530219549, 'elapsed_time': 0.0019092559814453125}
Result successfully stored in the database
7 combinations left
6 combinations left
Result:
{'score': 0.6561026610417788, 'elapsed_time': 0.00104522705078125}
Result successfully stored in the database
6 combinations left
Result:
{'score': 0.6303432542515472, 'elapsed_time': 0.0014503002166748047}
Result successfully stored in the database
6 combinations left
Result:
{'score': 0.7963865338492976, 'elapsed_time': 0.022031784057617188}
Result successfully stored in the database
6 combinations left
5 combinations left
Result:
{'score': 0.5984406975347196, 'elapsed_time': 0.0017325878143310547}
Result successfully stored in the database
5 combinations left
Result:
{'score': 0.6946891656361566, 'elapsed_time': 0.0030274391174316406}
Result successfully stored in the database
5 combinations left
Result:
{'score': 0.8483846489543649, 'elapsed_time': 0.002885580062866211}
Result successfully stored in the database
5 combinations left
Result:
{'score': 0.36985368772169913, 'elapsed_time': 0.002604961395263672}
Result successfully stored in the database
5 combinations left
4 combinations left
3 combinations left
Result:
{'score': 0.8868214247864545, 'elapsed_time': 0.0029604434967041016}
Result successfully stored in the database
3 combinations left
Result:
{'score': 0.572918197397766, 'elapsed_time': 0.003069639205932617}
Result successfully stored in the database
3 combinations left
Result:
{'score': 0.8566470616687658, 'elapsed_time': 0.0018100738525390625}
Result successfully stored in the database
3 combinations left
2 combinations left
1 combinations left
The good news is that sqlite works though atomic transactions, so either a commit (i.e.: adding a result to the dataset) will happen entirelly or it won’t happen at all.
Therefore, you can kill the simulation study process without fear that the dataset will became corrupted in case you happen to kill while it’s commiting the results.
sstudy works chooses the test cases randomly and independently, therefore, you can spawn multiple simulation study processes that will work independly and can be terminated at any moment.
[4]:
import numpy as np
import pandas as pd
from matplotlib.backends.backend_pdf import PdfPages
import matplotlib.pyplot as plt
df = pd.DataFrame(list(Result.select().dicts()))
del(df['id'])
df.groupby(['data_distribution', 'no_instances', 'method']).mean()
[4]:
score | elapsed_time | |||
---|---|---|---|---|
data_distribution | no_instances | method | ||
complete | 100 | lasso | 0.803092 | 0.003834 |
ols | 0.863425 | 0.002493 | ||
1000 | lasso | 0.856177 | 0.003833 | |
ols | 0.846198 | 0.001619 | ||
sparse | 100 | lasso | 0.608317 | 0.002824 |
ols | 0.742987 | 0.007308 | ||
1000 | lasso | 0.746540 | 0.001939 | |
ols | 0.584299 | 0.007723 |
[5]:
def mpse(data):
mean = data.mean()
std_error = np.std(data) / np.sqrt(len(data))
return "{0:.3f} ({1:.3f})".format(mean, std_error)
df.groupby(['data_distribution', 'no_instances', 'method']).agg(mpse)
[5]:
score | elapsed_time | |||
---|---|---|---|---|
data_distribution | no_instances | method | ||
complete | 100 | lasso | 0.803 (0.035) | 0.004 (0.001) |
ols | 0.863 (0.016) | 0.002 (0.000) | ||
1000 | lasso | 0.856 (0.011) | 0.004 (0.001) | |
ols | 0.846 (0.024) | 0.002 (0.000) | ||
sparse | 100 | lasso | 0.608 (0.073) | 0.003 (0.001) |
ols | 0.743 (0.054) | 0.007 (0.004) | ||
1000 | lasso | 0.747 (0.049) | 0.002 (0.000) | |
ols | 0.584 (0.084) | 0.008 (0.003) |
2. Sample filter¶
Suppose that we are not interested in testing some configurations our simulation study. For instance, we are not interested in testing method lasso with 10000 instances, we can then use a sample_filter function:
[6]:
%%capture
# Note: the %%capture line is only here to suppress output on jupyter notebooks.
# You can remove it on your application.
to_sample = dict(
data_distribution = ["complete", "sparse"],
no_instances = [100, 1000, 10000],
method = ['ols', 'lasso'],
)
def sample_filter(
data_distribution,
no_instances,
method,
):
if method == 'lasso' and no_instances == 10000:
return False
return True
do_simulation_study(to_sample, func, db, Result,
max_count=no_simulations,
sample_filter=sample_filter)
df = pd.DataFrame(list(Result.select().dicts()))
df.groupby(['data_distribution', 'no_instances', 'method']).count().iloc[:,0]
Suppose now that for some configuration(s) we want to increase or decrease the number of simulations to be perfomed. For instance, for method ols, we want run 50 simulations for each configuration:
[7]:
%%capture
def sample_filter(
data_distribution,
no_instances,
method,
):
if method == 'lasso' and no_instances == 1000:
return False
if method == 'ols':
return 50
return True
do_simulation_study(to_sample, func, db, Result,
max_count=no_simulations,
sample_filter=sample_filter)
df = pd.DataFrame(list(Result.select().dicts()))
df.groupby(['data_distribution', 'no_instances', 'method']).count().iloc[:,0]
3. Deleting or updating results¶
Suppose you commited a programing mistake while coding distribution 1. Here’ how to delete results related to while preserving the results for the other distribution:
[8]:
query = Result.delete().where(Result.data_distribution==1)
query.execute()
[8]:
0
Note that this function returns the number of affected (i.e.: deleted) row (i.e.: simulations) of the dataset.
After that, you could then fix your code for distribution 1, and run do_simulation_study
again to generate new results for it.
Updating works similary. For instance, let’s change the rows with data_distribution 0 to 3.
[9]:
query = Result.update(data_distribution=3).where(Result.data_distribution==0)
query.execute()
[9]:
0
See more possibilities at
peewee
’s documentation: http://docs.peewee-orm.com/en/latest/peewee/querying.htmlYou can also browse, update and delete your SQLite database using tools such as “DB Browser for SQLite”.
4. Postgresql database¶
You can also use Postgresql (or MySQL or CockroachDB as they are supported by the peewee
package) by installing the Python Postgresql driver: the psycopg2
package. The greatest advantage of using a managed database server is the ability to easily run sstudy
on many machines at the same sharing the workload of the simulations.
As the database hosting server, you can install a free server on your local computer or use a third party one like on Elephantsql, Amazon AWS or Google Cloud, and change of db
configuration:
from peewee import *
import os
pgdb = 'database_name'
pguser = 'username'
pgpassword = 'password'
pghost = 'host_address'
db = PostgresqlDatabase(pgdb, user=pguser, password=pgpassword, host=pghost)
Ideally though, you should not hardcode your credential, they should instead be passed as enviromental variables
from peewee import *
import os
try:
pgdb = os.environ['pgdb']
pguser = os.environ['pguser']
pgpass = os.environ['pgpass']
pghost = os.environ['pghost']
pgport = os.environ['pgport']
db = PostgresqlDatabase(pgdb, user=pguser, password=pgpass,
host=pghost, port=pgport)
except KeyError:
db = SqliteDatabase('results.sqlite3')
e.g.: run
pgdb='databasename' pguser='username' pgpassword = 'password' pghost = 'host_address' ipython your_script.py
5. Remote SQLite access¶
An alternative for using a remote dataset, is using multiple sqlite datasets and merging them for analisys.
Suppose you want to merge results.sqlite3 in the remote host 192.168.1.100 for which you have ssh access, you could then use the following set of commands on Linux:
scp 192.168.1.100:path_to_remote_database/results.sqlite3 db2.sqlite3
cp results.sqlite3 combined.sqlite3
sqlite3 combined.sqlite3 "BEGIN; ATTACH DATABASE 'db2.sqlite3' AS toMerge; insert into result (data_distribution, no_instances, method, score, elapsed_time) select data_distribution, no_instances, method, score, elapsed_time from toMerge.result; COMMIT; detach toMerge;"
The disadvantage of such method over Postgresql is that, sstudy will not be able to track the progress of the server globally and allocate new simulations accordingly (e.g.: one node might finish all its scheduled simulations while others have many more simulations to do).
Another possibility which does not have this shortcoming is to mount the remote server on a local folder using the Linux tool sshfs
and from there, have access to the sqlite database file.
6. Storage of binary data¶
Storage of binary data (e.g.: lists, numpy arrays, etc) is also supported using a BlobField:
long_data = BlobField()
Once the data is requested to be stored, sstudy
will automatically run pickle.dumps
(unless data is already binary type). You can then reload your data later using pickle.loads
.
7. Real data¶
It’s also possible to use the package to help your results with real datasets, as in the example below:
[10]:
from peewee import *
import os
db = SqliteDatabase('results.sqlite3')
class Result2(Model):
# Data settings
dataset = TextField()
method = TextField()
# Results
score = DoubleField()
elapsed_time = DoubleField()
class Meta:
database = db
Result2.create_table()
[11]:
%%capture
import numpy as np
import time
from scipy import stats
from sklearn.linear_model import LinearRegression, Lasso
from sstudy import do_simulation_study
from sklearn import datasets
no_simulations = 10
to_sample = dict(
dataset = ["boston", "diabetes"],
method = ['ols', 'lasso'],
)
def func(
dataset,
method,
):
if dataset == 'diabetes':
rdata = datasets.load_diabetes()
elif dataset == 'boston':
rdata = datasets.load_diabetes()
else:
raise ValueError
x = rdata["data"]
y = rdata["target"]
no_instances = round(len(y)*.9)
y_train = y[:no_instances]
y_test = y[no_instances:]
x_train = x[:no_instances]
x_test = x[no_instances:]
start_time = time.time()
if method == 'ols':
reg = LinearRegression()
elif method == 'lasso':
reg = Lasso(alpha=0.1)
reg.fit(x_train, y_train)
score = reg.score(x_test, y_test)
elapsed_time = time.time() - start_time
return dict(
score = score,
elapsed_time = elapsed_time,
)
do_simulation_study(to_sample, func, db, Result2, max_count=1)
[12]:
df2 = pd.DataFrame(list(Result2.select().dicts()))
df2.sort_values(list(df2.columns))
[12]:
id | dataset | method | score | elapsed_time | |
---|---|---|---|---|---|
0 | 1 | boston | ols | 0.685685 | 0.001097 |
1 | 2 | diabetes | lasso | 0.670936 | 0.001186 |
2 | 3 | diabetes | ols | 0.685685 | 0.001005 |
3 | 4 | boston | lasso | 0.670936 | 0.000773 |
8. Deterministic results¶
If it’s important to have determinisc results on the simulation study, one possibility is to set up the random seed as a variable of the experiment as given in the example below.
In this case, it’s usefull and recommended to have the set of unique parameters marked with a unique constraint on the dataset so the values (data_distribution, method, no_instances, random_seed)
, so the dataset system itself will enforce such uniqueness constraint.
See more about constraints at http://docs.peewee-orm.com/en/latest/peewee/models.html#indexes-and-constraints
[13]:
from peewee import *
import os
db = SqliteDatabase('results.sqlite3')
class Result3(Model):
# Data settings
data_distribution = TextField()
method = TextField()
no_instances = IntegerField()
random_seed = IntegerField()
# Results
score = DoubleField()
elapsed_time = DoubleField()
class Meta:
database = db
indexes = (
(('data_distribution', 'method', 'no_instances', 'random_seed'), True),
)
Result3.create_table()
[14]:
%%capture
import numpy as np
import time
from scipy import stats
from sklearn.linear_model import LinearRegression, Lasso
from sstudy import do_simulation_study
to_sample = dict(
data_distribution = ["complete", "sparse"],
no_instances = [100, 1000],
method = ['ols', 'lasso'],
random_seed = range(30),
)
def func(
data_distribution,
no_instances,
method,
random_seed,
):
np.random.seed(random_seed)
x = stats.norm.rvs(0, 2, size=(no_instances + 10000, 10))
beta = stats.norm.rvs(0, 2, size=(10, 1))
eps = stats.norm.rvs(0, 5, size=(no_instances + 10000, 1))
if data_distribution == "complete":
y = np.matmul(x, beta) + eps
elif data_distribution == "sparse":
y = np.matmul(x[:,:5], beta[:5]) + eps
else:
raise ValueError
y_train = y[:no_instances]
y_test = y[no_instances:]
x_train = x[:no_instances]
x_test = x[no_instances:]
start_time = time.time()
if method == 'ols':
reg = LinearRegression()
elif method == 'lasso':
reg = Lasso(alpha=0.1)
reg.fit(x_train, y_train)
score = reg.score(x_test, y_test)
elapsed_time = time.time() - start_time
return dict(
score = score,
elapsed_time = elapsed_time,
)
do_simulation_study(to_sample, func, db, Result3, max_count=1)
[15]:
%%capture
df3 = pd.DataFrame(list(Result3.select().dicts()))
df3.sort_values('score')
Let us now, delete the results, run the simulation again and show that the results do not change.
[16]:
%%capture
Result3.delete().execute()
do_simulation_study(to_sample, func, db, Result3, max_count=1)
[17]:
df3 = pd.DataFrame(list(Result3.select().dicts()))
df3.sort_values('score')
[17]:
id | data_distribution | method | no_instances | random_seed | score | elapsed_time | |
---|---|---|---|---|---|---|---|
15 | 16 | sparse | ols | 100 | 25 | 0.161505 | 0.001216 |
43 | 44 | sparse | lasso | 100 | 25 | 0.166967 | 0.001260 |
59 | 60 | sparse | ols | 100 | 12 | 0.359669 | 0.001216 |
213 | 214 | sparse | lasso | 100 | 12 | 0.368209 | 0.001302 |
128 | 129 | sparse | ols | 100 | 6 | 0.371189 | 0.001080 |
... | ... | ... | ... | ... | ... | ... | ... |
58 | 59 | complete | lasso | 1000 | 17 | 0.920788 | 0.002146 |
109 | 110 | complete | ols | 1000 | 9 | 0.922379 | 0.001233 |
158 | 159 | complete | lasso | 1000 | 9 | 0.922590 | 0.001191 |
155 | 156 | complete | lasso | 1000 | 22 | 0.947963 | 0.001266 |
198 | 199 | complete | ols | 1000 | 22 | 0.948076 | 0.002316 |
240 rows × 7 columns
9. Miscellaneous hints¶
[18]:
gdf = df.groupby(['data_distribution', 'no_instances', 'method']).agg(mpse)
[19]:
# You can export your table to latex using `to_latex` pandas method:
print(gdf.to_latex())
# Bonus hint: to enable enable multirow layout (see multirow Latex package)
# try print(gdf.to_latex(multirow=True))
\begin{tabular}{llllll}
\toprule
& & & id & score & elapsed\_time \\
data\_distribution & no\_instances & method & & & \\
\midrule
complete & 100 & lasso & 14.600 (4.065) & 0.803 (0.035) & 0.004 (0.001) \\
& & ols & 177.780 (13.974) & 0.838 (0.008) & 0.001 (0.000) \\
& 1000 & lasso & 20.600 (5.174) & 0.856 (0.011) & 0.004 (0.001) \\
& & ols & 155.320 (10.350) & 0.834 (0.010) & 0.001 (0.000) \\
& 10000 & lasso & 99.800 (7.623) & 0.842 (0.016) & 0.002 (0.000) \\
& & ols & 193.620 (13.599) & 0.825 (0.012) & 0.003 (0.000) \\
sparse & 100 & lasso & 29.200 (3.242) & 0.608 (0.073) & 0.003 (0.001) \\
& & ols & 164.900 (12.723) & 0.660 (0.022) & 0.002 (0.000) \\
& 1000 & lasso & 29.400 (4.540) & 0.747 (0.049) & 0.002 (0.000) \\
& & ols & 195.120 (13.249) & 0.702 (0.022) & 0.002 (0.000) \\
& 10000 & lasso & 71.400 (5.001) & 0.688 (0.040) & 0.002 (0.000) \\
& & ols & 179.060 (11.890) & 0.695 (0.021) & 0.003 (0.000) \\
\bottomrule
\end{tabular}
[20]:
# Add number of simulations as a column
to_group = ['data_distribution', 'no_instances', 'method']
count = df.groupby(to_group).count().iloc[:,-1]
gdf['no simulations'] = count
gdf
[20]:
id | score | elapsed_time | no simulations | |||
---|---|---|---|---|---|---|
data_distribution | no_instances | method | ||||
complete | 100 | lasso | 14.600 (4.065) | 0.803 (0.035) | 0.004 (0.001) | 5 |
ols | 177.780 (13.974) | 0.838 (0.008) | 0.001 (0.000) | 50 | ||
1000 | lasso | 20.600 (5.174) | 0.856 (0.011) | 0.004 (0.001) | 5 | |
ols | 155.320 (10.350) | 0.834 (0.010) | 0.001 (0.000) | 50 | ||
10000 | lasso | 99.800 (7.623) | 0.842 (0.016) | 0.002 (0.000) | 5 | |
ols | 193.620 (13.599) | 0.825 (0.012) | 0.003 (0.000) | 50 | ||
sparse | 100 | lasso | 29.200 (3.242) | 0.608 (0.073) | 0.003 (0.001) | 5 |
ols | 164.900 (12.723) | 0.660 (0.022) | 0.002 (0.000) | 50 | ||
1000 | lasso | 29.400 (4.540) | 0.747 (0.049) | 0.002 (0.000) | 5 | |
ols | 195.120 (13.249) | 0.702 (0.022) | 0.002 (0.000) | 50 | ||
10000 | lasso | 71.400 (5.001) | 0.688 (0.040) | 0.002 (0.000) | 5 | |
ols | 179.060 (11.890) | 0.695 (0.021) | 0.003 (0.000) | 50 |
[21]:
# Add the stardard deviation
# and the standard error of the standard deviation
def stdpse(data):
size = len(data)
mu = data.mean()
var = data.var()
std = np.sqrt(var)
mu4 = ((data - mu)**4).mean()
se_of_std = mu4 - (size-3)/(size-1)*var**2
se_of_std = np.sqrt(se_of_std/size) / 2 / std
return "{0:.3f} ({1:.3f})".format(std, se_of_std)
score_std_se = df.groupby(to_group).agg(stdpse).score
gdf['score std'] = score_std_se
gdf
[21]:
id | score | elapsed_time | no simulations | score std | |||
---|---|---|---|---|---|---|---|
data_distribution | no_instances | method | |||||
complete | 100 | lasso | 14.600 (4.065) | 0.803 (0.035) | 0.004 (0.001) | 5 | 0.089 (0.015) |
ols | 177.780 (13.974) | 0.838 (0.008) | 0.001 (0.000) | 50 | 0.054 (0.007) | ||
1000 | lasso | 20.600 (5.174) | 0.856 (0.011) | 0.004 (0.001) | 5 | 0.028 (0.005) | |
ols | 155.320 (10.350) | 0.834 (0.010) | 0.001 (0.000) | 50 | 0.074 (0.008) | ||
10000 | lasso | 99.800 (7.623) | 0.842 (0.016) | 0.002 (0.000) | 5 | 0.041 (0.006) | |
ols | 193.620 (13.599) | 0.825 (0.012) | 0.003 (0.000) | 50 | 0.089 (0.012) | ||
sparse | 100 | lasso | 29.200 (3.242) | 0.608 (0.073) | 0.003 (0.001) | 5 | 0.183 (0.040) |
ols | 164.900 (12.723) | 0.660 (0.022) | 0.002 (0.000) | 50 | 0.159 (0.025) | ||
1000 | lasso | 29.400 (4.540) | 0.747 (0.049) | 0.002 (0.000) | 5 | 0.122 (0.016) | |
ols | 195.120 (13.249) | 0.702 (0.022) | 0.002 (0.000) | 50 | 0.158 (0.021) | ||
10000 | lasso | 71.400 (5.001) | 0.688 (0.040) | 0.002 (0.000) | 5 | 0.101 (0.013) | |
ols | 179.060 (11.890) | 0.695 (0.021) | 0.003 (0.000) | 50 | 0.148 (0.020) |
[22]:
# Renaming grouped pandas datasets
gdf = gdf.reset_index()
gdf.loc[gdf.method=="ols", "method"] = 'linear'
gdf.index = pd.MultiIndex.from_frame(gdf[to_group])
gdf.drop(to_group, axis=1, inplace=True)
gdf
[22]:
id | score | elapsed_time | no simulations | score std | |||
---|---|---|---|---|---|---|---|
data_distribution | no_instances | method | |||||
complete | 100 | lasso | 14.600 (4.065) | 0.803 (0.035) | 0.004 (0.001) | 5 | 0.089 (0.015) |
linear | 177.780 (13.974) | 0.838 (0.008) | 0.001 (0.000) | 50 | 0.054 (0.007) | ||
1000 | lasso | 20.600 (5.174) | 0.856 (0.011) | 0.004 (0.001) | 5 | 0.028 (0.005) | |
linear | 155.320 (10.350) | 0.834 (0.010) | 0.001 (0.000) | 50 | 0.074 (0.008) | ||
10000 | lasso | 99.800 (7.623) | 0.842 (0.016) | 0.002 (0.000) | 5 | 0.041 (0.006) | |
linear | 193.620 (13.599) | 0.825 (0.012) | 0.003 (0.000) | 50 | 0.089 (0.012) | ||
sparse | 100 | lasso | 29.200 (3.242) | 0.608 (0.073) | 0.003 (0.001) | 5 | 0.183 (0.040) |
linear | 164.900 (12.723) | 0.660 (0.022) | 0.002 (0.000) | 50 | 0.159 (0.025) | ||
1000 | lasso | 29.400 (4.540) | 0.747 (0.049) | 0.002 (0.000) | 5 | 0.122 (0.016) | |
linear | 195.120 (13.249) | 0.702 (0.022) | 0.002 (0.000) | 50 | 0.158 (0.021) | ||
10000 | lasso | 71.400 (5.001) | 0.688 (0.040) | 0.002 (0.000) | 5 | 0.101 (0.013) | |
linear | 179.060 (11.890) | 0.695 (0.021) | 0.003 (0.000) | 50 | 0.148 (0.020) |
10. Multiple files management¶
In the examples so far, we have managed everything as a single file, but our recommended design is to have each setup separated in 3 files: one for database structure, another for running the simulations and another the explore/plot the results. As skeleton for this, we have an example folder distributed together with the package and also available at: https://gitlab.com/marcoinacio/sstudy/-/tree/master/examples.