Introduce dataframe to ExperimentData (step1) #1133

nkanazawa1989 · 2023-04-10T17:46:37Z

Summary

Executing approved design proposal in https://github.com/Qiskit/rfcs/blob/master/0007-experiment-dataframe.md. Introduction of the Artifact will be done in the followup.

This PR introduces dataframe to ExperimentData and replaces the conventional AnalysisResult with it. With this PR, experimentalist can visualize all analysis results with a single html table by dataframe=True.

Once can control the columns to show in the html table with columns option.

Details and comments

Experiment analysis is a local operation on client side, and data generated there must be agnostic to the payload format of remote service. In the current implementation, AnalysisResult appears in the BaseAnalysis and the data generated by the analysis is implicitly typecasted to this class. Note that AnalysisResult not only defines canonical data format but also implements the API for the IBM Experiment service. This tight coupling to the IBM Experiment service may limit data expressibility and an analysis class author always need to be aware of the data structure of the remote service.

In this PR, internal data structure is replaced with the flat dataframe, and one should be able to add arbitrary key-value pair to the analysis result without concerning about data model. Added key creates a new column in the table, which must be internally formatted and absorbed by extra field of the payload at a later time when these results are saved. This is implemented by qiskit_experiments.framework.experiment_data._series_to_service_result. Bypassing the generation of AnalysisResult object also drastically improves the performance of heavily parallelized analysis that generates huge amount of data.

nkanazawa1989 · 2023-04-10T18:58:31Z

qiskit_experiments/database_service/utils.py

+        raise AttributeError(f"'ThreadSafeDataFrame' object has no attribute '{item}'")
+
+    def __json_encode__(self) -> Dict[str, Any]:
+        return {


The entire dataframe is JSON serializable with the experiment encoder/decoder. So you can call the save API for the batch experiment results in one piece. This drastically improves the data saving performance once experiment service updates its API.

coruscating

Thanks for the nice work, here's my first round of comments. I have some general thoughts as well:

I think it would be helpful to add a label for the short analysis ID index column, otherwise it may look like a random string to a new user.
Should the display style for CurveFitResults be updated? Right now the newlines are showing as \n literals in the dataframe. If we keep the string representation the same, the default dataframe display style can be updated to render the newspaces (but then we should truncate the output or the cell will be very long).
This may be outside the scope of this PR, but when I run a BatchExperiment of ParalleExperiments of T1s, the results are difficult to distinguish from each other because the experiment IDs are all the same. What do you think about including an extra field that has the experiment ID of the innermost child experiment?

coruscating · 2023-05-04T21:51:19Z

qiskit_experiments/framework/base_analysis.py

+
+            if results:
+                for result in results:
+                    supplementary = result.extra


Right now, if the user sets one of these fields to a different value in the extra parameters, for example by adding experiment=my_experiment, this will override the default value. Should we implement a check to disallow this since these should be reserved fields?

This is intentional and composite analysis requires this logic to update entry experiment name. Since we should return AnalysisResultData from the child experiment data, it updates experiment value through extra field. Alternatively, we can completely remove child data generation in the composite analysis, and generate them on the fly when user access ExperimentData.child_data for backward compatibility. This drastically simplifies the code stack, but should be done in another PR because I don't want to add multiple logical changes in a single PR.

qiskit_experiments/framework/experiment_data.py

coruscating · 2023-05-08T19:44:14Z

qiskit_experiments/framework/experiment_data.py

+        package_name="qiskit-experiments",
+        pending=True,
+        predicate=lambda dataframe: not dataframe,
+    )
    def analysis_results(
        self,
        index: Optional[Union[int, slice, str]] = None,


Are we planning on keeping index in the future? The user can retrieve specific rows from the dataframe themselves, though the syntax to retrieve from the dataframe by name would be more complicated than using index, so maybe it's good to keep from the user's perspective.

Yeah, that what I discussed with Chris. We concluded to keep index, just as a syntactic sugar. Probably we can deprecate it if users are already familiar with the handling of dataframe, e.g.

df = exp_data.analysis_results() df[df.name=="T1"]

I don't think this is complicated and we can remove hundreds lines of code by removing index. Let's set user feedback session after 0.6.

qiskit_experiments/framework/experiment_data.py

itoko

Thank you, @nkanazawa1989 This is the first step of great upgrades in experiment result data handling. The design doc is written well and it helps me a lot to understand the background of this PR.

Why don't we add a new method like add_analysis_result (without s) and deprecate add_analysis_results instead of upgrading add_analysis_results? (The new API seems to support only addition of single analysis result and have completely different arguments.)
In my understanding, result_id is a unique key of the table. Do we really need to have index as a yet-another key? (Note: I'm talking about index in pd.DataFrame not index argument of analysis_results method.)

This PR changes the data structure of analysis results from list of dictionaries (JSON) to a flat table (pandas DataFrame). Table data would be easier to manipulate in most cases. It must increase the (re)usability of analysis result data very much. Instead, information on the structure of experiments will be dropped from AnalysisResult expecting it is not usually used during analysis. Also the information is still reconstructable from other data stored in an ExperimentData (is it correct? @nkanazawa1989 ) if necessary.

I also add several minor comments inline.

qiskit_experiments/framework/experiment_data.py

itoko · 2023-05-17T08:36:16Z

qiskit_experiments/framework/composite/composite_analysis.py

+            # Convert Dataframe Series back into AnalysisResultData
+            # This is due to limitation that _run_analysis must return List[AnalysisResultData],
+            # and some composite analysis such as TphiAnalysis overrides this method to
+            # return extra quantity computed from sub analysis results.
+            # This produces unnecessary data conversion.
+            # The _run_analysis mechanism seems just complicating the entire logic.
+            # Since it's impossible to deprecate the usage of this protected method,
+            # we should implement new CompositeAnalysis class with much more efficient
+            # internal logic. Note that the child data structure is no longer necessary
+            # because dataframe offers more efficient data filtering mechanisms.
+            analysis_table = sub_expdata.analysis_results(verbosity=3, dataframe=True)
+            for _, series in analysis_table.iterrows():
+                data_dict = series.to_dict()
+                primary_info = {
+                    "name": data_dict.pop("name"),
+                    "value": data_dict.pop("value"),
+                    "quality": data_dict.pop("quality"),
+                    "device_components": data_dict.pop("components"),
                }
-                analysis_results.append(result)
+                chisq = data_dict.pop("chisq", np.nan)
+                if chisq:
+                    primary_info["chisq"] = chisq
+                data_dict["experiment"] = sub_expdata.experiment_type
+                if "experiment_id" in data_dict:
+                    # Use experiment ID of parent experiment data.
+                    # Sub experiment data is merged and discarded.
+                    del data_dict["experiment_id"]
+                analysis_result = AnalysisResultData(**primary_info, extra=data_dict)
+                analysis_results.append(analysis_result)


I'm expecting this is a temporary conversion and will be removed after the breaking API change of the return type of _run_analysis planned in one of the following PRs. (I know the change is not easy...)

These are removed with overhaul in 6ebd975.

nkanazawa1989 · 2023-07-03T07:39:34Z

Thanks @coruscating @itoko for your review. I updated the PR in response to your comments. Please read following replies to your review comments.

@coruscating

I think it would be helpful to add a label for the short analysis ID index column, otherwise it may look like a random string to a new user.

The short UUID corresponds to a pandas dataframe index, and index doesn't have a column name. We could update the stylesheet to show some name, but I feel this is overengineering. By default index is incremental number, but I like current format by analogy of the github commit hash.

Should the display style for CurveFitResults be updated? Right now the newlines are showing as \n literals in the dataframe. If we keep the string representation the same, the default dataframe display style can be updated to render the newspaces (but then we should truncate the output or the cell will be very long).

Yes, this object is very noisy in the table, but will be moved to artifact in the followup.

This may be outside the scope of this PR, but when I run a BatchExperiment of ParalleExperiments of T1s, the results are difficult to distinguish from each other because the experiment IDs are all the same. What do you think about including an extra field that has the experiment ID of the innermost child experiment?

Good point. If I understand your request correctly, user doesn't need to distinguish the difference of entry itself. However, if an experiment generates multiple results, there is no convenient label to group the results from the same experiment instance together. Since this is sort of an edge-case, I'll probably go with current implementation, and update in followup if there is any user feedback.

@itoko

Why don't we add a new method like add_analysis_result (without s) and deprecate add_analysis_results instead of upgrading add_analysis_results? (The new API seems to support only addition of single analysis result and have completely different arguments.)

I was thinking of the same thing, but I couldn't come up with reasonable method name. I think analysis_results vs analysis_result is very confusing and I can imagine a user always needs to check code or docstring to choose proper method. I think current implementation with kwargs is much more intuitive.

In my understanding, result_id is a unique key of the table. Do we really need to have index as a yet-another key? (Note: I'm talking about index in pd.DataFrame not index argument of analysis_results method.)

Yes, result ID is unique, but full hexadecimal UUID is too long. I was thinking to use result_id as a dataframe row index, but this creates very wide table which may hurt user experience (i.e. our browser is not friendly for horizontal scroll).

information on the structure of experiments will be dropped from AnalysisResult expecting it is not usually used during analysis. Also the information is still reconstructable from other data stored in an ExperimentData

Yes, this is what I intended to. Because this becomes a massive logic change in composite analysis in addition to the introduction of dataframe, I decided to do this cleanup in followup PR.

coruscating

Thanks for the nice update @nkanazawa1989. There seems to be a bug right now where analysis results are not saved to ResultsDB. I ran a simple T1 experiment:

exp=T1(physical_qubits=[0], delays=np.arange(1e-6, 2e-4, 3e-5))
exp.run(backend).block_for_results()
exp.save()

and the resulting analysis result could not be serialized:

{'data': (AnalysisResultData(result_id='bb822f9486344a03bd7e4d00a145766f', experiment_id='b4c968e8-ab44-4e48-823e-b31180a5f87a', result_type='@Parameters_T1Analysis', result_data={'_value': <qiskit_experiments.curve_analysis.curve_data.CurveFitResult object at 0x298618280>, '_chisq': nan, '_extra': {'unit': None}, '_source': {'class': 'qiskit_experiments.framework.analysis_result.AnalysisResult', 'data_version': 1, 'qiskit_version': '0.42.1'}, 'value': '(CurveFitResult)'}, device_components=[<Qubit(Q0)>], quality='bad', verified=False, tags=[], backend_name='ibm_nazca', creation_datetime=Timestamp('2023-07-17 19:49:08.242003+0000', tz='UTC'), updated_datetime=None, chisq=nan),)

This yields the error Unexpected token N in JSON at position 1792, but the save actually fails silently, as in the experiment saves without the analysis results and the user is not notified. I wonder, after the bug is fixed, if we should also change the default behavior so the user is aware of cases where the experiment did not fully save.

coruscating · 2023-07-11T13:44:24Z

qiskit_experiments/framework/experiment_data.py

    @property
    def end_datetime(self) -> datetime:
        """Return the end datetime of this experiment data.
+
        The end datetime is the time the latest job data was


end_datetime now seems almost redundant with running_time. running_time is the timestamp of the successful job completion from the job side, while end_datetime sets the timestamp to be when the successful job is returned, so the only difference seems to be networking and processing latency. Perhaps we should change the behavior of end_datetime to be when the experiment actually finishes all its processing?

Hmmm this is indeed misleading... I was thinking that running_time is the time that the first circuit is triggered on the backend. Do you know where this is publicly defined?

Ah, I misunderstood, my bad. It looks like currently running_time is the time the latest successful job started to run, and end_datetime is when the latest successful job is returned, is that right? Maybe running_time could be changed to the time the first job of the experiment started to run? But I think we can also keep it as-is.

Thanks. Done in f6e7294

coruscating · 2023-07-17T04:39:00Z

qiskit_experiments/framework/experiment_data.py

+            This may return wrong datetime if server and client are in different timezone.
+
+        """
+        return utc_to_local(self._running_time)


According to @kt474, job.time_per_step() handles conversion and returns the timestamp in the user's local time, so I think we can keep all timestamps in the user's local time without the warning and conversion on our end.

Thanks for pointing this out. I just removed all redundant conversions and use tzlocal for creation time of the analysis entry. Done in cc5cf41

qiskit_experiments/framework/analysis_result_table.py

CLAassistant · 2023-07-18T13:26:03Z

All committers have signed the CLA.

Musa-Sina-Ertugrul

My comments are about using sets, I think you can attach set to class for uniqueness. I only reviewed push that has not outdated label.

Musa-Sina-Ertugrul · 2023-07-24T13:48:58Z

qiskit_experiments/database_service/utils.py

+        """
+        with self._lock:
+            # Order sensitive
+            new_columns = [c for c in new_columns if c not in self.get_columns()]


If we convert self.get_columns() statement to set from list that will be faster. Also, using set wont be a problem because, property of column names that about uniqueness.

I agree set is faster but column name is order sensitive -- not only about uniqueness. This is basically designed so that important information must come on the left side on the pandas html table (and I also believe a user assumes the keys are added to the table in the same ordering with the add method call). Unfortunately Python doesn't provide ordered set in builtin (I don't want to add extra dependency without drastic performance gain) and just gave up using the set here.

Here is the result of very casual performance check on my laptop (typically columns size is at most 20, and new keys added by a heavily customized analysis might be ~3):

Indeed set operation is faster, but the difference is just less than 1 usec. Since this line is run only when the analysis class generates an extra key, when you run batch of 10 experiments (this is really rare case), the number of method call cannot exceed 10 times. So the difference that user may experience in some heavy setting would be something like few us.

I added new test case for key order preservation in 2cda8dc

Musa-Sina-Ertugrul · 2023-07-24T13:55:59Z

qiskit_experiments/database_service/utils.py

+            ValueError: When index is not in this table.
+        """
+        with self._lock:
+            if index not in self._container.index:


As mentioned on my first review you can use set here too. Because indexes must be unique. I think, you can attach set to class blueprint for unique indexes according to my opinion it would not hold much space.

Indeed this is the pandas Index object and this check is sufficiently performant. Typecast overhead is much expensive.

Note that index is not only added, but also being deleted. Adding new instance member of tracking the existing indices may make the code slightly faster, but this will add lines of code in multiple places and it increases maintenance overhead.

qiskit_experiments/database_service/utils.py

itoko · 2023-07-24T17:06:38Z

Thank you for the nice update @nkanazawa1989 . Deferring logic changes in composite analysis to a followup PR makes sense to me. Also I'm fine with upgrading add_analysis_results. LGTM except for the following two points that could be improved (sorry, I missed the second point at the first round of my review).

Regarding index, if we want it as a short name of UUID mainly for viewing purpose, how about generating it as a column on the fly when analysis_results(dataframe=True) is called? Such a lazy evaluation avoids the constant overhead of 8-chars collision check. It would also allow a shorter string length, e.g. 4 chars would be sufficient in most cases (for the uniqueness, maybe we need to try 4, 8, 16,... in some cases). You can query an entry with the short id (e.g. 'ab12') by df[df['result_id'].str.startswith('ab12')].

To me, current design of ThreadSafeDataFrame feels too specific to AnalysisResultTable and hence not useful in general. For example, having _column, _extra and container method with collapse_extra. Having just AnalysisResultTable(ThreadSafeContainer) makes more sense to me this time (since we have no concrete subclass of ThreadSafeDataFrame other than AnalysisResultTable, suggesting it's difficult to think of what is a good design for a common interface ThreadSafeDataFrame at this moment).

What do you think? @nkanazawa1989 (I'm happy to discuss these point offline if you want.)

This change decouples AnalysisResult from ExperimentData. Since AnalysisResult not only defines data model but also provide API for the IBM experiment service, this coupling limits capability of experiment data analysis. ExperimentData.analysis_results still returns AnalysisResult for backward compatibility, but these object is not identical object. Returned object is AnalysisResult which is newly generated from dataframe.

These test cases use MagicMock for convenience, rather than using a mocked AnalysisResult. However, now added result entry is internally converted into dataframe format and the input object is discarded after add_analysis_results call. So it's no longer possible to track method call or evaluate identity of input object. Since user will never store MagicMock to experiment data, this change to unittest should not introduce any edge case and should not decrease the practical test coverage.

Co-authored-by: Helena Zhang <[email protected]>

…olumns

…ization." This reverts commit 6ebd975.

…, and it copies container into each process. Saving data in the copied container just discards added data when the process finishes. For now Qiskit Experiments don't use multi processing and we don't need to care these tests immediately.

…o provide convenient access to internal dataframe container so that ThreadSafeDataFrame becomes comparable with pandas DataFrame (indeed pandas implements lots of convenient method to manipulate data). However, this is dangerous in multithread environment, because lock is released after acquired item (this may be some method to manipulate data) is returned. The holder can still mutate the internal container asynchronously in this situation. To avoid this problem, __getattr__ is dropped and all necessary methods are implemented as method with reentrant lock.

- result id must be str(uuid) including "-" - np.nan must be replaced with None

…added to extra field since IBM Experiment Service doesn't have data field for them.

…me with its experiment type, qubits, and experiment id.

nkanazawa1989 · 2023-08-07T17:43:41Z

@coruscating Thanks for testing save with the real service. Fix is here c2cd090. This is not a serialization issue. The code generated the analysis result id with uuid4().hex and this generates a string without dash. However, likely IBM Experiment service expects str(uuid4()) which is a string including dash. I added regex to check this pattern. Another error of Unexpected token N in JSON at position 1792 is due to np.nan in the table. This must be replaced with None. These are very IBM specific but must be handled on client side. I also found a convenient function in the service, but this converter likely assumes column names in my draft branch, and doesn't work with current PR as-is. Anyways save/load work now.

It would be great if you could also check 665ff8f. This is related to figure name generation.

@itoko We cannot create dataframe without index, and we cannot remove index column without hacking the html representation of the dataframe. Adding short id as a new column just consumes extra horizontal space in the web browser, and this doesn't make sense. One thing we could do would be creating index with dataframe.loc[len(dataframe)], but we still need to consider the uniqueness of the index because ExperimentData allows users to delete entries, and hence the table length doesn't guarantee the uniqueness of index. Alternatively we can just give up the uniqueness of the index and just always use first 8 digit regardless of collision, but this will eventually require that user should check the length of returned entry (if there is collision they may get multiple unrelated entries) even though they filtered the table with a particular ID.

Here is the result of very casual performance test. Time consumption in the unique index generation is just 1.75%, so this logic is not critical to the entire performance.

exp_data = ExperimentData()

def add_100_entries():
    for i in range(100):
        exp_data.add_analysis_results(name=f"data_{i}", value=i)

Regarding the base class design, I plan to use this dataframe for curve data. Currently the curve data is represented by a unwieldy CurveData object. This takes arbitrary circuit metadata in addition to the fixed columns of xdata, ydata, yerr, etc..., so current base class implementation also fit in with this case.

…ary id as long as they don't use the save feature. The validation error is replaced with the user warning not to stop this workflow.

itoko · 2023-08-09T05:30:06Z

LGTM, thanks!

index or on-demand short_id column

I thought it's sufficient to tweak _repr_html_ to return self._container.to_html(index=False). That said, looking at the profiling results, as you pointed out, the unique index generation seems not a bottleneck in the expecting regime (~10K). The test adding 1000 (10K) rows takes 1.4 (18.4) seconds in my env while _unique_table_index takes 2.2% (7.3%) of the time. So I don't stick to using a column for short ID for now.

FYI: During my profiling, I encountered a weird performance regression in Pandas Dataframe when adding rows including a None entry in a column (e.g. quality in my example below). If rows have a column with None, the time scales super-linear to n_rows while it scales almost linear if they have no None. I don't think this is an issue for now since we are assuming ~3000 rows (~ batched three parallel experiments for a 1000-qubit system) for analysis result data size, right? But if we want to handle more rows in future, we may need to revisit this issue.

Script I used:

import time
from qiskit_experiments.framework import ExperimentData
def add_n_rows(nrows, none_quality=True):
    exp_data = ExperimentData()
    for i in range(nrows):
        exp_data.add_analysis_results(name=f"data_{i}", value=i, 
                        quality=None if none_quality else i,
                        components="",
                        experiment=i,
                        experiment_id="",
                        result_id="",
                        tags="",
                        backend=i,
                        run_time=i,
                        created_time=i
                )
times_nona=[]
for nrows in [1000, 2000, 3000, 4000, 5000]:
    elapsed_time = time.time()
    add_n_rows(nrows, none_quality=False)
    elapsed_time = time.time() - elapsed_time
    times_nona.append(elapsed_time)
times_withna=[]
for nrows in [1000, 2000, 3000, 4000, 5000]:
    elapsed_time = time.time()
    add_n_rows(nrows, none_quality=True)
    elapsed_time = time.time() - elapsed_time
    times_withna.append(elapsed_time)

ThreadSafeDataFrame

Good to know you designed it with two concrete subclasses, now I'm fine to have it. I see the doc says
This object is expected to be used internally in the ExperimentData.

nkanazawa1989 · 2023-08-09T08:15:40Z

Thank you @itoko san! <1K entries is indeed a target of current implementation, i.e. running parallel experiment on few 100s qubits and batching them (e.g. combining T1 and T2) would generate ~1K entries. For future I also consider to switch to Polar.

Just fyi:

Main branch (I cannot run 3000 entries. Only 500 entries)

import time
from qiskit_experiments.framework import ExperimentData
from qiskit_experiments.framework import AnalysisResult

def add_n_rows(nrows, none_quality=True):
    exp_data = ExperimentData()
    data_list = []
    for i in range(nrows):
        data = AnalysisResult(
            name=f"data_{i}", 
            value=i, 
            quality=None if none_quality else i,
            device_components=["Q0"],
            experiment_id="",
            result_id="",
            tags="",
        )
        data_list.append(data)
    exp_data.add_analysis_results(data_list)

%timeit add_n_rows(500, none_quality=True)  
# 25.5 s ± 322 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In this branch with your code

%timeit add_n_rows(3000, none_quality=True) 
# 8.83 s ± 391 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

There is already a drastic performance improvement (more than expected).

coruscating

LGTM, thank you! 🚀

### Summary Follow-up of #1133 for data frame. ### Details and comments I realized I had some mistake when I wrote a base class of `ThreadSafeDataFrame`. We plan to move the curve data (XY data) to the artifact for better reusability of the data points. In addition, [CurveData](https://github.com/Qiskit-Extensions/qiskit-experiments/blob/c66034c90dad73d705af25be7e9ed9617e7eb2ef/qiskit_experiments/curve_analysis/curve_data.py#L88-L113) object which is a container of the XY data is also replaced with the data frame in a follow up PR. Since this table should contain predefined columns such as `xval`, `yval`, `yerr`, I was assuming this container could be a subclass of `ThreadSafeDataFrame` -- but this was not a right assumption. Since the curve analysis always runs on a single thread (analysis callbacks might be run on multiple threads and thus `AnalysisResultTable` must be thread-safe), this curve data frame doesn't need to be thread-safe object. In this PR, a functionality to define default columns of the table is reimplemented as a Python mixin class so that the function to add default columns will be used also for curve data frame without introducing the unnecessary thread safe mechanisms. `AnalysisResultTable` is a thread safe container but the functionality to manage default columns is delegated to the mixin.

### Summary Executing approved design proposal in https://github.com/Qiskit/rfcs/blob/master/0007-experiment-dataframe.md. Introduction of the `Artifact` will be done in the followup. This PR introduces dataframe to `ExperimentData` and replaces the conventional [AnalysisResult](https://github.com/Qiskit/qiskit-experiments/blob/main/qiskit_experiments/framework/analysis_result.py) with it. With this PR, experimentalist can visualize all analysis results with a single html table by `dataframe=True`. ![image](https://user-images.githubusercontent.com/39517270/230959307-d5a4d471-1659-4cf4-974e-cdb21e625ad8.png) Once can control the columns to show in the html table with `columns` option. ### Details and comments Experiment analysis is a local operation on client side, and data generated there must be agnostic to the payload format of remote service. In the current implementation, `AnalysisResult` appears in the `BaseAnalysis` and the data generated by the analysis is implicitly typecasted to this class. Note that `AnalysisResult` *not only* defines canonical data format *but also* implements the API for the IBM Experiment service. This tight coupling to the IBM Experiment service may limit data expressibility and an analysis class author always need to be aware of the data structure of the remote service. In this PR, internal data structure is replaced with the flat dataframe, and one should be able to add arbitrary key-value pair to the analysis result without concerning about data model. Added key creates a new column in the table, which must be internally formatted and absorbed by extra field of the payload at a later time when these results are saved. This is implemented by `qiskit_experiments.framework.experiment_data._series_to_service_result`. Bypassing the generation of `AnalysisResult` object also drastically improves the performance of heavily parallelized analysis that generates huge amount of data. --------- Co-authored-by: Helena Zhang <[email protected]>

### Summary Follow-up of qiskit-community#1133 for data frame. ### Details and comments I realized I had some mistake when I wrote a base class of `ThreadSafeDataFrame`. We plan to move the curve data (XY data) to the artifact for better reusability of the data points. In addition, [CurveData](https://github.com/Qiskit-Extensions/qiskit-experiments/blob/c66034c90dad73d705af25be7e9ed9617e7eb2ef/qiskit_experiments/curve_analysis/curve_data.py#L88-L113) object which is a container of the XY data is also replaced with the data frame in a follow up PR. Since this table should contain predefined columns such as `xval`, `yval`, `yerr`, I was assuming this container could be a subclass of `ThreadSafeDataFrame` -- but this was not a right assumption. Since the curve analysis always runs on a single thread (analysis callbacks might be run on multiple threads and thus `AnalysisResultTable` must be thread-safe), this curve data frame doesn't need to be thread-safe object. In this PR, a functionality to define default columns of the table is reimplemented as a Python mixin class so that the function to add default columns will be used also for curve data frame without introducing the unnecessary thread safe mechanisms. `AnalysisResultTable` is a thread safe container but the functionality to manage default columns is delegated to the mixin.

### Summary #1133 introduce a bug in `MitigatedTomographyAnalysis` that accidentally drops extra analysis metadata. This PR fixes the bug. ### Details and comments In the new data storage implementation with dataframe, `ExperimentData.analysis_results` returns a copy of the protected dataframe and mutating the returned object doesn't update the source. This PR introduces new option to `TomographyAnalysis` to inject extra metadata. <img width="993" alt="screenshot" src="https://github.com/Qiskit-Extensions/qiskit-experiments/assets/39517270/d8674af1-78ad-4870-bfea-d441f9dfd1e8"> (edit) note that reno is not necessary because 1133 is not released yet.

…ity#1344) ### Summary qiskit-community#1133 introduce a bug in `MitigatedTomographyAnalysis` that accidentally drops extra analysis metadata. This PR fixes the bug. ### Details and comments In the new data storage implementation with dataframe, `ExperimentData.analysis_results` returns a copy of the protected dataframe and mutating the returned object doesn't update the source. This PR introduces new option to `TomographyAnalysis` to inject extra metadata. <img width="993" alt="screenshot" src="https://github.com/Qiskit-Extensions/qiskit-experiments/assets/39517270/d8674af1-78ad-4870-bfea-d441f9dfd1e8"> (edit) note that reno is not necessary because 1133 is not released yet.

nkanazawa1989 commented Apr 10, 2023

View reviewed changes

nkanazawa1989 force-pushed the feature/dataframe-pr1 branch from 5824117 to c073705 Compare April 11, 2023 15:44

nkanazawa1989 requested a review from coruscating April 19, 2023 01:28

coruscating reviewed May 8, 2023

View reviewed changes

nkanazawa1989 force-pushed the feature/dataframe-pr1 branch from c073705 to a6a0eab Compare May 12, 2023 06:25

itoko reviewed May 17, 2023

View reviewed changes

nkanazawa1989 added this to the Release 0.6 milestone Jun 13, 2023

nkanazawa1989 added the Changelog: New Feature Include in the "Added" section of the changelog label Jun 13, 2023

nkanazawa1989 mentioned this pull request Jun 21, 2023

Proposal for deprecation of flatten_result=False #1207

Closed

nkanazawa1989 force-pushed the feature/dataframe-pr1 branch from 6e30147 to 6ebd975 Compare June 21, 2023 19:16

nkanazawa1989 marked this pull request as ready for review July 3, 2023 04:56

nkanazawa1989 force-pushed the feature/dataframe-pr1 branch 2 times, most recently from f5c2bab to ba431c8 Compare July 3, 2023 17:45

coruscating reviewed Jul 17, 2023

View reviewed changes

Musa-Sina-Ertugrul reviewed Jul 24, 2023

View reviewed changes

nkanazawa1989 and others added 12 commits August 8, 2023 01:04

Add dataframe support for extended equality

6d53507

Docs update

323a0d9

Co-authored-by: Helena Zhang <[email protected]>

Replace verbosity with explicit column names

7b5f901

Add test for dataframe classes

be515f7

Add code comment for short uuid

4675324

Add job running time to ExperimentData

20760f7

Extend AnalysisResultData to be comparable with AnalysisResultTable c…

b84dfa1

…olumns

Add type alias for figure data

f549642

Overhaul composite analysis. Simplified sub container initialization.

3c154d4

Revert "Overhaul composite analysis. Simplified sub container initial…

c9de11c

…ization." This reverts commit 6ebd975.

nkanazawa1989 added 9 commits August 8, 2023 01:08

Add reno

cddcaab

fix test

b2579d8

more threadsafe

adfd41e

Fix analysis save bug

c2cd090

- result id must be str(uuid) including "-" - np.nan must be replaced with None

Remove redundant timezone conversion and use tzlocal

cc5cf41

Fix missing experiment and run_time column in loaded data. These are …

b00945b

…added to extra field since IBM Experiment Service doesn't have data field for them.

Add test for key order preservation and remove redundant code

2cda8dc

nkanazawa1989 force-pushed the feature/dataframe-pr1 branch from ba431c8 to 2cda8dc Compare August 7, 2023 16:09

Code fix for auto figure name. Innermost experiment creates figure na…

665ff8f

…me with its experiment type, qubits, and experiment id.

Relax the validation for result id. User should be able to use arbitr…

2bdd09d

…ary id as long as they don't use the save feature. The validation error is replaced with the user warning not to stop this workflow.

itoko approved these changes Aug 9, 2023

View reviewed changes

Update running_time doc

f6e7294

coruscating approved these changes Aug 11, 2023

View reviewed changes

nkanazawa1989 enabled auto-merge August 11, 2023 00:47

nkanazawa1989 added this pull request to the merge queue Aug 11, 2023

Merged via the queue into qiskit-community:main with commit 06f0718 Aug 11, 2023
16 checks passed

nkanazawa1989 mentioned this pull request Aug 18, 2023

Reimplement AnalysisResultTable #1252

Merged

nkanazawa1989 mentioned this pull request Nov 22, 2023

Epic - Implementation of RFC 0007: Dataframe for Qiskit Experiments Qiskit/RFCs#62

Closed

5 tasks

nkanazawa1989 mentioned this pull request Dec 14, 2023

Bugfix: missing extra in mitigated tomography analysis #1344

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce dataframe to ExperimentData (step1) #1133

Introduce dataframe to ExperimentData (step1) #1133

nkanazawa1989 commented Apr 10, 2023 •

edited

Loading

nkanazawa1989 Apr 10, 2023 •

edited

Loading

coruscating left a comment

coruscating May 4, 2023

nkanazawa1989 May 12, 2023

coruscating May 8, 2023

nkanazawa1989 May 12, 2023 •

edited

Loading

itoko left a comment •

edited

Loading

itoko May 17, 2023

nkanazawa1989 Jun 21, 2023

nkanazawa1989 commented Jul 3, 2023

coruscating left a comment •

edited

Loading

coruscating Jul 11, 2023

nkanazawa1989 Aug 7, 2023

coruscating Aug 9, 2023

nkanazawa1989 Aug 11, 2023

coruscating Jul 17, 2023

nkanazawa1989 Aug 7, 2023

CLAassistant commented Jul 18, 2023 •

edited

Loading

Musa-Sina-Ertugrul left a comment

Musa-Sina-Ertugrul Jul 24, 2023

nkanazawa1989 Aug 7, 2023 •

edited

Loading

nkanazawa1989 Aug 7, 2023

Musa-Sina-Ertugrul Jul 24, 2023

nkanazawa1989 Aug 7, 2023 •

edited

Loading

itoko commented Jul 24, 2023 •

edited

Loading

nkanazawa1989 commented Aug 7, 2023 •

edited

Loading

itoko commented Aug 9, 2023 •

edited

Loading

nkanazawa1989 commented Aug 9, 2023

coruscating left a comment

Introduce dataframe to ExperimentData (step1) #1133

Introduce dataframe to ExperimentData (step1) #1133

Conversation

nkanazawa1989 commented Apr 10, 2023 • edited Loading

Summary

Details and comments

nkanazawa1989 Apr 10, 2023 • edited Loading

Choose a reason for hiding this comment

coruscating left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nkanazawa1989 May 12, 2023 • edited Loading

Choose a reason for hiding this comment

itoko left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nkanazawa1989 commented Jul 3, 2023

coruscating left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CLAassistant commented Jul 18, 2023 • edited Loading

Musa-Sina-Ertugrul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nkanazawa1989 Aug 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nkanazawa1989 Aug 7, 2023 • edited Loading

Choose a reason for hiding this comment

itoko commented Jul 24, 2023 • edited Loading

nkanazawa1989 commented Aug 7, 2023 • edited Loading

itoko commented Aug 9, 2023 • edited Loading

nkanazawa1989 commented Aug 9, 2023

coruscating left a comment

Choose a reason for hiding this comment

nkanazawa1989 commented Apr 10, 2023 •

edited

Loading

nkanazawa1989 Apr 10, 2023 •

edited

Loading

nkanazawa1989 May 12, 2023 •

edited

Loading

itoko left a comment •

edited

Loading

coruscating left a comment •

edited

Loading

CLAassistant commented Jul 18, 2023 •

edited

Loading

nkanazawa1989 Aug 7, 2023 •

edited

Loading

nkanazawa1989 Aug 7, 2023 •

edited

Loading

itoko commented Jul 24, 2023 •

edited

Loading

nkanazawa1989 commented Aug 7, 2023 •

edited

Loading

itoko commented Aug 9, 2023 •

edited

Loading