-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce dataframe to ExperimentData (step1) #1133
Introduce dataframe to ExperimentData (step1) #1133
Conversation
raise AttributeError(f"'ThreadSafeDataFrame' object has no attribute '{item}'") | ||
|
||
def __json_encode__(self) -> Dict[str, Any]: | ||
return { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entire dataframe is JSON serializable with the experiment encoder/decoder. So you can call the save
API for the batch experiment results in one piece. This drastically improves the data saving performance once experiment service updates its API.
5824117
to
c073705
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the nice work, here's my first round of comments. I have some general thoughts as well:
- I think it would be helpful to add a label for the short analysis ID index column, otherwise it may look like a random string to a new user.
- Should the display style for
CurveFitResults
be updated? Right now the newlines are showing as\n
literals in the dataframe. If we keep the string representation the same, the default dataframe display style can be updated to render the newspaces (but then we should truncate the output or the cell will be very long). - This may be outside the scope of this PR, but when I run a BatchExperiment of ParalleExperiments of T1s, the results are difficult to distinguish from each other because the experiment IDs are all the same. What do you think about including an extra field that has the experiment ID of the innermost child experiment?
|
||
if results: | ||
for result in results: | ||
supplementary = result.extra |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, if the user sets one of these fields to a different value in the extra parameters, for example by adding experiment=my_experiment
, this will override the default value. Should we implement a check to disallow this since these should be reserved fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is intentional and composite analysis requires this logic to update entry experiment name. Since we should return AnalysisResultData
from the child experiment data, it updates experiment value through extra field. Alternatively, we can completely remove child data generation in the composite analysis, and generate them on the fly when user access ExperimentData.child_data
for backward compatibility. This drastically simplifies the code stack, but should be done in another PR because I don't want to add multiple logical changes in a single PR.
package_name="qiskit-experiments", | ||
pending=True, | ||
predicate=lambda dataframe: not dataframe, | ||
) | ||
def analysis_results( | ||
self, | ||
index: Optional[Union[int, slice, str]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we planning on keeping index
in the future? The user can retrieve specific rows from the dataframe themselves, though the syntax to retrieve from the dataframe by name would be more complicated than using index
, so maybe it's good to keep from the user's perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that what I discussed with Chris. We concluded to keep index, just as a syntactic sugar. Probably we can deprecate it if users are already familiar with the handling of dataframe, e.g.
df = exp_data.analysis_results()
df[df.name=="T1"]
I don't think this is complicated and we can remove hundreds lines of code by removing index. Let's set user feedback session after 0.6.
c073705
to
a6a0eab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @nkanazawa1989 This is the first step of great upgrades in experiment result data handling. The design doc is written well and it helps me a lot to understand the background of this PR.
-
Why don't we add a new method like
add_analysis_result
(withouts
) and deprecateadd_analysis_results
instead of upgradingadd_analysis_results
? (The new API seems to support only addition of single analysis result and have completely different arguments.) -
In my understanding,
result_id
is a unique key of the table. Do we really need to haveindex
as a yet-another key? (Note: I'm talking aboutindex
in pd.DataFrame notindex
argument ofanalysis_results
method.)
This PR changes the data structure of analysis results from list of dictionaries (JSON) to a flat table (pandas DataFrame). Table data would be easier to manipulate in most cases. It must increase the (re)usability of analysis result data very much. Instead, information on the structure of experiments will be dropped from AnalysisResult expecting it is not usually used during analysis. Also the information is still reconstructable from other data stored in an ExperimentData (is it correct? @nkanazawa1989 ) if necessary.
I also add several minor comments inline.
# Convert Dataframe Series back into AnalysisResultData | ||
# This is due to limitation that _run_analysis must return List[AnalysisResultData], | ||
# and some composite analysis such as TphiAnalysis overrides this method to | ||
# return extra quantity computed from sub analysis results. | ||
# This produces unnecessary data conversion. | ||
# The _run_analysis mechanism seems just complicating the entire logic. | ||
# Since it's impossible to deprecate the usage of this protected method, | ||
# we should implement new CompositeAnalysis class with much more efficient | ||
# internal logic. Note that the child data structure is no longer necessary | ||
# because dataframe offers more efficient data filtering mechanisms. | ||
analysis_table = sub_expdata.analysis_results(verbosity=3, dataframe=True) | ||
for _, series in analysis_table.iterrows(): | ||
data_dict = series.to_dict() | ||
primary_info = { | ||
"name": data_dict.pop("name"), | ||
"value": data_dict.pop("value"), | ||
"quality": data_dict.pop("quality"), | ||
"device_components": data_dict.pop("components"), | ||
} | ||
analysis_results.append(result) | ||
chisq = data_dict.pop("chisq", np.nan) | ||
if chisq: | ||
primary_info["chisq"] = chisq | ||
data_dict["experiment"] = sub_expdata.experiment_type | ||
if "experiment_id" in data_dict: | ||
# Use experiment ID of parent experiment data. | ||
# Sub experiment data is merged and discarded. | ||
del data_dict["experiment_id"] | ||
analysis_result = AnalysisResultData(**primary_info, extra=data_dict) | ||
analysis_results.append(analysis_result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm expecting this is a temporary conversion and will be removed after the breaking API change of the return type of _run_analysis
planned in one of the following PRs. (I know the change is not easy...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are removed with overhaul in 6ebd975.
6e30147
to
6ebd975
Compare
Thanks @coruscating @itoko for your review. I updated the PR in response to your comments. Please read following replies to your review comments.
The short UUID corresponds to a pandas dataframe index, and index doesn't have a column name. We could update the stylesheet to show some name, but I feel this is overengineering. By default index is incremental number, but I like current format by analogy of the github commit hash.
Yes, this object is very noisy in the table, but will be moved to artifact in the followup.
Good point. If I understand your request correctly, user doesn't need to distinguish the difference of entry itself. However, if an experiment generates multiple results, there is no convenient label to group the results from the same experiment instance together. Since this is sort of an edge-case, I'll probably go with current implementation, and update in followup if there is any user feedback.
I was thinking of the same thing, but I couldn't come up with reasonable method name. I think analysis_results vs analysis_result is very confusing and I can imagine a user always needs to check code or docstring to choose proper method. I think current implementation with kwargs is much more intuitive.
Yes, result ID is unique, but full hexadecimal UUID is too long. I was thinking to use result_id as a dataframe row index, but this creates very wide table which may hurt user experience (i.e. our browser is not friendly for horizontal scroll).
Yes, this is what I intended to. Because this becomes a massive logic change in composite analysis in addition to the introduction of dataframe, I decided to do this cleanup in followup PR. |
f5c2bab
to
ba431c8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the nice update @nkanazawa1989. There seems to be a bug right now where analysis results are not saved to ResultsDB. I ran a simple T1 experiment:
exp=T1(physical_qubits=[0], delays=np.arange(1e-6, 2e-4, 3e-5))
exp.run(backend).block_for_results()
exp.save()
and the resulting analysis result could not be serialized:
{'data': (AnalysisResultData(result_id='bb822f9486344a03bd7e4d00a145766f', experiment_id='b4c968e8-ab44-4e48-823e-b31180a5f87a', result_type='@Parameters_T1Analysis', result_data={'_value': <qiskit_experiments.curve_analysis.curve_data.CurveFitResult object at 0x298618280>, '_chisq': nan, '_extra': {'unit': None}, '_source': {'class': 'qiskit_experiments.framework.analysis_result.AnalysisResult', 'data_version': 1, 'qiskit_version': '0.42.1'}, 'value': '(CurveFitResult)'}, device_components=[<Qubit(Q0)>], quality='bad', verified=False, tags=[], backend_name='ibm_nazca', creation_datetime=Timestamp('2023-07-17 19:49:08.242003+0000', tz='UTC'), updated_datetime=None, chisq=nan),)
This yields the error Unexpected token N in JSON at position 1792
, but the save actually fails silently, as in the experiment saves without the analysis results and the user is not notified. I wonder, after the bug is fixed, if we should also change the default behavior so the user is aware of cases where the experiment did not fully save.
@property | ||
def end_datetime(self) -> datetime: | ||
"""Return the end datetime of this experiment data. | ||
|
||
The end datetime is the time the latest job data was |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
end_datetime
now seems almost redundant with running_time
. running_time
is the timestamp of the successful job completion from the job side, while end_datetime
sets the timestamp to be when the successful job is returned, so the only difference seems to be networking and processing latency. Perhaps we should change the behavior of end_datetime
to be when the experiment actually finishes all its processing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm this is indeed misleading... I was thinking that running_time is the time that the first circuit is triggered on the backend. Do you know where this is publicly defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I misunderstood, my bad. It looks like currently running_time
is the time the latest successful job started to run, and end_datetime
is when the latest successful job is returned, is that right? Maybe running_time
could be changed to the time the first job of the experiment started to run? But I think we can also keep it as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Done in f6e7294
This may return wrong datetime if server and client are in different timezone. | ||
|
||
""" | ||
return utc_to_local(self._running_time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to @kt474, job.time_per_step()
handles conversion and returns the timestamp in the user's local time, so I think we can keep all timestamps in the user's local time without the warning and conversion on our end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. I just removed all redundant conversions and use tzlocal for creation time of the analysis entry. Done in cc5cf41
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comments are about using sets, I think you can attach set to class for uniqueness. I only reviewed push that has not outdated label.
""" | ||
with self._lock: | ||
# Order sensitive | ||
new_columns = [c for c in new_columns if c not in self.get_columns()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we convert self.get_columns() statement to set from list that will be faster. Also, using set wont be a problem because, property of column names that about uniqueness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree set is faster but column name is order sensitive -- not only about uniqueness. This is basically designed so that important information must come on the left side on the pandas html table (and I also believe a user assumes the keys are added to the table in the same ordering with the add method call). Unfortunately Python doesn't provide ordered set in builtin (I don't want to add extra dependency without drastic performance gain) and just gave up using the set here.
Here is the result of very casual performance check on my laptop (typically columns size is at most 20, and new keys added by a heavily customized analysis might be ~3):
Indeed set operation is faster, but the difference is just less than 1 usec. Since this line is run only when the analysis class generates an extra key, when you run batch of 10 experiments (this is really rare case), the number of method call cannot exceed 10 times. So the difference that user may experience in some heavy setting would be something like few us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added new test case for key order preservation in 2cda8dc
ValueError: When index is not in this table. | ||
""" | ||
with self._lock: | ||
if index not in self._container.index: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned on my first review you can use set here too. Because indexes must be unique. I think, you can attach set to class blueprint for unique indexes according to my opinion it would not hold much space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed this is the pandas Index
object and this check is sufficiently performant. Typecast overhead is much expensive.
Note that index is not only added, but also being deleted. Adding new instance member of tracking the existing indices may make the code slightly faster, but this will add lines of code in multiple places and it increases maintenance overhead.
Thank you for the nice update @nkanazawa1989 . Deferring logic changes in composite analysis to a followup PR makes sense to me. Also I'm fine with upgrading Regarding To me, current design of What do you think? @nkanazawa1989 (I'm happy to discuss these point offline if you want.) |
This change decouples AnalysisResult from ExperimentData. Since AnalysisResult not only defines data model but also provide API for the IBM experiment service, this coupling limits capability of experiment data analysis. ExperimentData.analysis_results still returns AnalysisResult for backward compatibility, but these object is not identical object. Returned object is AnalysisResult which is newly generated from dataframe.
These test cases use MagicMock for convenience, rather than using a mocked AnalysisResult. However, now added result entry is internally converted into dataframe format and the input object is discarded after add_analysis_results call. So it's no longer possible to track method call or evaluate identity of input object. Since user will never store MagicMock to experiment data, this change to unittest should not introduce any edge case and should not decrease the practical test coverage.
Co-authored-by: Helena Zhang <[email protected]>
…ization." This reverts commit 6ebd975.
…, and it copies container into each process. Saving data in the copied container just discards added data when the process finishes. For now Qiskit Experiments don't use multi processing and we don't need to care these tests immediately.
…o provide convenient access to internal dataframe container so that ThreadSafeDataFrame becomes comparable with pandas DataFrame (indeed pandas implements lots of convenient method to manipulate data). However, this is dangerous in multithread environment, because lock is released after acquired item (this may be some method to manipulate data) is returned. The holder can still mutate the internal container asynchronously in this situation. To avoid this problem, __getattr__ is dropped and all necessary methods are implemented as method with reentrant lock.
- result id must be str(uuid) including "-" - np.nan must be replaced with None
…added to extra field since IBM Experiment Service doesn't have data field for them.
ba431c8
to
2cda8dc
Compare
…me with its experiment type, qubits, and experiment id.
@coruscating Thanks for testing save with the real service. Fix is here c2cd090. This is not a serialization issue. The code generated the analysis result id with It would be great if you could also check 665ff8f. This is related to figure name generation. @itoko We cannot create dataframe without index, and we cannot remove index column without hacking the html representation of the dataframe. Adding short id as a new column just consumes extra horizontal space in the web browser, and this doesn't make sense. One thing we could do would be creating index with Here is the result of very casual performance test. Time consumption in the unique index generation is just 1.75%, so this logic is not critical to the entire performance. exp_data = ExperimentData()
def add_100_entries():
for i in range(100):
exp_data.add_analysis_results(name=f"data_{i}", value=i) Regarding the base class design, I plan to use this dataframe for curve data. Currently the curve data is represented by a unwieldy CurveData object. This takes arbitrary circuit metadata in addition to the fixed columns of xdata, ydata, yerr, etc..., so current base class implementation also fit in with this case. |
…ary id as long as they don't use the save feature. The validation error is replaced with the user warning not to stop this workflow.
LGTM, thanks!
I thought it's sufficient to tweak FYI: During my profiling, I encountered a weird performance regression in Pandas Dataframe when adding rows including a None entry in a column (e.g. Script I used:
Good to know you designed it with two concrete subclasses, now I'm fine to have it. I see the doc says |
Thank you @itoko san! <1K entries is indeed a target of current implementation, i.e. running parallel experiment on few 100s qubits and batching them (e.g. combining T1 and T2) would generate ~1K entries. For future I also consider to switch to Polar. Just fyi: Main branch (I cannot run 3000 entries. Only 500 entries) import time
from qiskit_experiments.framework import ExperimentData
from qiskit_experiments.framework import AnalysisResult
def add_n_rows(nrows, none_quality=True):
exp_data = ExperimentData()
data_list = []
for i in range(nrows):
data = AnalysisResult(
name=f"data_{i}",
value=i,
quality=None if none_quality else i,
device_components=["Q0"],
experiment_id="",
result_id="",
tags="",
)
data_list.append(data)
exp_data.add_analysis_results(data_list)
%timeit add_n_rows(500, none_quality=True)
# 25.5 s ± 322 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In this branch with your code %timeit add_n_rows(3000, none_quality=True)
# 8.83 s ± 391 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) There is already a drastic performance improvement (more than expected). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you! 🚀
### Summary Follow-up of #1133 for data frame. ### Details and comments I realized I had some mistake when I wrote a base class of `ThreadSafeDataFrame`. We plan to move the curve data (XY data) to the artifact for better reusability of the data points. In addition, [CurveData](https://github.com/Qiskit-Extensions/qiskit-experiments/blob/c66034c90dad73d705af25be7e9ed9617e7eb2ef/qiskit_experiments/curve_analysis/curve_data.py#L88-L113) object which is a container of the XY data is also replaced with the data frame in a follow up PR. Since this table should contain predefined columns such as `xval`, `yval`, `yerr`, I was assuming this container could be a subclass of `ThreadSafeDataFrame` -- but this was not a right assumption. Since the curve analysis always runs on a single thread (analysis callbacks might be run on multiple threads and thus `AnalysisResultTable` must be thread-safe), this curve data frame doesn't need to be thread-safe object. In this PR, a functionality to define default columns of the table is reimplemented as a Python mixin class so that the function to add default columns will be used also for curve data frame without introducing the unnecessary thread safe mechanisms. `AnalysisResultTable` is a thread safe container but the functionality to manage default columns is delegated to the mixin.
### Summary Executing approved design proposal in https://github.com/Qiskit/rfcs/blob/master/0007-experiment-dataframe.md. Introduction of the `Artifact` will be done in the followup. This PR introduces dataframe to `ExperimentData` and replaces the conventional [AnalysisResult](https://github.com/Qiskit/qiskit-experiments/blob/main/qiskit_experiments/framework/analysis_result.py) with it. With this PR, experimentalist can visualize all analysis results with a single html table by `dataframe=True`. ![image](https://user-images.githubusercontent.com/39517270/230959307-d5a4d471-1659-4cf4-974e-cdb21e625ad8.png) Once can control the columns to show in the html table with `columns` option. ### Details and comments Experiment analysis is a local operation on client side, and data generated there must be agnostic to the payload format of remote service. In the current implementation, `AnalysisResult` appears in the `BaseAnalysis` and the data generated by the analysis is implicitly typecasted to this class. Note that `AnalysisResult` *not only* defines canonical data format *but also* implements the API for the IBM Experiment service. This tight coupling to the IBM Experiment service may limit data expressibility and an analysis class author always need to be aware of the data structure of the remote service. In this PR, internal data structure is replaced with the flat dataframe, and one should be able to add arbitrary key-value pair to the analysis result without concerning about data model. Added key creates a new column in the table, which must be internally formatted and absorbed by extra field of the payload at a later time when these results are saved. This is implemented by `qiskit_experiments.framework.experiment_data._series_to_service_result`. Bypassing the generation of `AnalysisResult` object also drastically improves the performance of heavily parallelized analysis that generates huge amount of data. --------- Co-authored-by: Helena Zhang <[email protected]>
### Summary Follow-up of qiskit-community#1133 for data frame. ### Details and comments I realized I had some mistake when I wrote a base class of `ThreadSafeDataFrame`. We plan to move the curve data (XY data) to the artifact for better reusability of the data points. In addition, [CurveData](https://github.com/Qiskit-Extensions/qiskit-experiments/blob/c66034c90dad73d705af25be7e9ed9617e7eb2ef/qiskit_experiments/curve_analysis/curve_data.py#L88-L113) object which is a container of the XY data is also replaced with the data frame in a follow up PR. Since this table should contain predefined columns such as `xval`, `yval`, `yerr`, I was assuming this container could be a subclass of `ThreadSafeDataFrame` -- but this was not a right assumption. Since the curve analysis always runs on a single thread (analysis callbacks might be run on multiple threads and thus `AnalysisResultTable` must be thread-safe), this curve data frame doesn't need to be thread-safe object. In this PR, a functionality to define default columns of the table is reimplemented as a Python mixin class so that the function to add default columns will be used also for curve data frame without introducing the unnecessary thread safe mechanisms. `AnalysisResultTable` is a thread safe container but the functionality to manage default columns is delegated to the mixin.
### Summary #1133 introduce a bug in `MitigatedTomographyAnalysis` that accidentally drops extra analysis metadata. This PR fixes the bug. ### Details and comments In the new data storage implementation with dataframe, `ExperimentData.analysis_results` returns a copy of the protected dataframe and mutating the returned object doesn't update the source. This PR introduces new option to `TomographyAnalysis` to inject extra metadata. <img width="993" alt="screenshot" src="https://github.com/Qiskit-Extensions/qiskit-experiments/assets/39517270/d8674af1-78ad-4870-bfea-d441f9dfd1e8"> (edit) note that reno is not necessary because 1133 is not released yet.
…ity#1344) ### Summary qiskit-community#1133 introduce a bug in `MitigatedTomographyAnalysis` that accidentally drops extra analysis metadata. This PR fixes the bug. ### Details and comments In the new data storage implementation with dataframe, `ExperimentData.analysis_results` returns a copy of the protected dataframe and mutating the returned object doesn't update the source. This PR introduces new option to `TomographyAnalysis` to inject extra metadata. <img width="993" alt="screenshot" src="https://github.com/Qiskit-Extensions/qiskit-experiments/assets/39517270/d8674af1-78ad-4870-bfea-d441f9dfd1e8"> (edit) note that reno is not necessary because 1133 is not released yet.
Summary
Executing approved design proposal in https://github.com/Qiskit/rfcs/blob/master/0007-experiment-dataframe.md. Introduction of the
Artifact
will be done in the followup.This PR introduces dataframe to
ExperimentData
and replaces the conventional AnalysisResult with it. With this PR, experimentalist can visualize all analysis results with a single html table bydataframe=True
.Once can control the columns to show in the html table with
columns
option.Details and comments
Experiment analysis is a local operation on client side, and data generated there must be agnostic to the payload format of remote service. In the current implementation,
AnalysisResult
appears in theBaseAnalysis
and the data generated by the analysis is implicitly typecasted to this class. Note thatAnalysisResult
not only defines canonical data format but also implements the API for the IBM Experiment service. This tight coupling to the IBM Experiment service may limit data expressibility and an analysis class author always need to be aware of the data structure of the remote service.In this PR, internal data structure is replaced with the flat dataframe, and one should be able to add arbitrary key-value pair to the analysis result without concerning about data model. Added key creates a new column in the table, which must be internally formatted and absorbed by extra field of the payload at a later time when these results are saved. This is implemented by
qiskit_experiments.framework.experiment_data._series_to_service_result
. Bypassing the generation ofAnalysisResult
object also drastically improves the performance of heavily parallelized analysis that generates huge amount of data.