1 A PROOF interface to the AliEn file catalog
2 ===========================================
7 Datasets have been invented to provide PROOF users a cleaner access to
8 sets of uniform data: each dataset has a name which helps identifying
9 the kind of data stored, plus some meta-information, such as:
13 - number of events in the default tree
17 - integrity information: *is my file corrupted?*
19 - locality information: *is my remote file available on a local
22 Datasets are also used by the [staging daemon
23 afdsmgrd](http://afdsmgrd.googlecode.com/) to trigger data staging,
24 *i.e.* to request some data from being transferred from a remote storage
25 to the local analysis facility disks.
27 PROOF datasets are handled by the *dataset manager*, a generic catalog of
28 datasets which has been historically implemented by the class
29 `TDataSetManagerFile`, which stored each dataset inside a ROOT file.
31 This dataset manager has been conceived for a small *(i.e., hundreds)*
32 number of datasets which reflected data stored on the local analysis facility
33 disks. As the PROOF analysis model became popular in ALICE, the number
34 of datasets grew posing many problems.
36 - To give the possibility to process remote data, current datasets
37 mimick file catalog functionalities by including also lists of files
38 currently not staged on the local analysis facility.
40 - Since users can create their own datasets, in many cases containing
41 duplicate data, it has become demanding to provide maintenance and
44 - Locality information in datasets is static: this means that, if a
45 file gets deleted from a disk, the corresponding dataset(s) must be
46 synchronized manually.
48 ### An interface to the AliEn file catalog
50 The new `TDataSetManagerAliEn` class is a new dataset manager which acts
51 as an intermediate layer between PROOF datasets and the AliEn file
54 Dataset names do not represent any longer a static list of files:
55 instead, it represents a **query string** to the AliEn file catalog that
56 creates a dataset dynamically.
58 **Locality information** is also filled on the fly by contacting the local
59 file server: for instance, in case a *xrootd* pool of disks is used,
60 fresh online information along with the exact host (endpoint) where each
61 file is located is provided dynamically in a reasonable amount of time.
63 Both file catalog queries and locality information are cached on ROOT
64 files: cache is shared between users and its expiration time is
67 Since dataset information is now volatile, a separate and more
68 straightforward method for issuing staging requests has also been
76 Using the new dataset manager requires the `xpd.datasetsrc` directive in
77 the xproofd configuration file:
79 xpd.datasetsrc alien cache:/path/to/dataset/cache urltemplate:http://myserver:1234/data<path> cacheexpiresecs:86400
82 : Tells PROOF that the dataset manager is the AliEn interface (as
86 : Specify a path *on the local filesystem* of the host running user's
89 > This path is not a URL but just a local path. Moreover, the path
90 > must be visible from the host that will run each user's master,
91 > since a separate dataset manager instance is created per user.
93 > If the cache directory does not exist, it is created, if possible,
94 > with open permissions (`rwxrwxrwx`). On a production environment
95 > it is advisable to create the cache directory manually beforehand
96 > with the same permissions.
99 : Template used for translating between an `alien://` URL and the
102 `<path>` is written literally and will be substituted with the full
103 AliEn path without the protocol.
105 > An example on how URL translation works:
109 > root://alice-caf.cern.ch/<path>
113 > alien:///alice/data/2012/LHC12b/000178209/ESDs/pass1/12000178209061.17/AliESDs.root
117 > root://alice-caf.cern.ch//alice/data/2012/LHC12b/000178209/ESDs/pass1/12000178209061.17/AliESDs.root
120 : Number of seconds before cached information is considered expired
121 and refetched *(e.g., 86400 for one day)*.
125 One of the advantages of such a dynamic AliEn catalog interface is that
126 it is possible to use it with PROOF-Lite.
128 By default, PROOF-Lite creates on the client session (which acts as a
129 master as well) a file-based dataset manager. To enable the AliEn
130 dataset manager in a PROOF-Lite session, run:
133 gEnv->SetValue("Proof.DataSetManager",
134 "alien cache:/path/to/dataset/cache "
135 "urltemplate:root://alice-caf.cern.ch/<path> "
136 "cacheexpiresecs:86400");
140 where the parameters meaning has been described in the previous section.
142 > Please note that the environment must be set **before** opening the
143 > PROOF-Lite session!
148 The new dataset manager is backwards-compatible with the legacy
149 interface: each time you want to process or obtain a dataset, instead of
150 specifying a string containing a dataset name you will specify a query
151 string to the file catalog.
153 ### Query string format
155 The query string is the string you will use in place of the dataset
156 name. It does not correspond to a static dataset: instead it represents
157 a virtual dataset whose information is filled in on the fly.
159 There are two different formats you can use:
161 - specify data features (such as period and run numbers) for **official
162 data or Monte Carlo**
164 - specify the **AliEn find** command parameters directly
166 In the query string it is also possible to specify if you want to
167 process data from AliEn, only staged data or data from AliEn in "cache
170 #### Official data and Monte Carlo format
172 These are the string formats to be used respectively for official data
173 and official Monte Carlo productions:
175 Data;Period=<LHCPERIOD>;Variant=[ESDs|AODXXX];Run=<RUNLIST>;Pass=<PASS>
177 Sim;Period=<LHCPERIOD>;Variant=[ESDs|AODXXX];Run=<RUNLIST>;
182 Example of valid values: `LHC10h`, `LHC11h_2`, `LHC11f_Technical`
185 : Data variant, which might be `ESDs` (or `ESD`) for ESDs and `AODXXX`
186 for AODs corresponding to the *XXX* set.
188 Example of valid values: `ESDs`, `AOD073`, `AOD086`
191 : Runs to be processed, in the form of a single run (`130831`), an
192 inclusive range (`130831-130833`), or a list of runs and/or ranges
193 (`130831-130835,130840,130842`).
195 Duplicate runs are automatically removed, so in case you specify
196 `130831-130835,130833` run number 130833 will be processed only
199 Pass *(only for data, not for Monte Carlo)*
200 : The pass number or name. In case you specify only a number `X`, it
201 will be expanded to `passX`.
203 Example of valid values: `1`, `pass1`, `pass2`, `cpass1_muon`
205 This is an example of a full valid string:
207 Data;Period=LHC10h;Variant=AOD086;Run=130831-130833;Pass=pass1
209 #### AliEn find format
211 Whenever a user would like to process data which has not been produced
212 officially, or whose directory structure in the AliEn file catalog is
213 non-standard, an interface to the AliEn shell's `find` command is
216 This is the command format:
218 Find;BasePath=<BASEPATH>;FileName=<FILENAME>;Anchor=<ANCHOR>;TreeName=<TREENAME>;Regexp=<REGEXP>
220 Parameters `BasePath` and `FileName` are passed as-is to the AliEn [find
221 command](http://alien2.cern.ch/index.php?option=com_content&view=article&id=53&Itemid=99#Searching_for_files),
224 Parameters `Anchor`, `TreeName` and `Regexp` are optional.
226 Here's a detailed description of the parameters.
229 : Start search under the specified path on the AliEn file catalog.
231 Jolly characters are supported: the asterisk (`*`) and the
232 percentage sign (`%`) are interchangeable.
234 Examples of valid values are:
236 /alice/data/2010/LHC10h/000123456/*.*
237 /alice/cern.ch/user/d/dummy/my_pp_production/%.%
240 : File name to look for.
242 Examples of valid values are: `root_archive.zip`, `aod_archive.zip`,
243 `custom_archive.zip`, `AliAOD.root.
246 : In case `FileName` is a zip archive, the anchor is the name of a
247 ROOT file inside the archive to point to.
249 Examples of valid values are: `AliAOD.root`, `AliESDs.root`,
252 > Using the AliEn file catalog it is possible to point directly to a
253 > ROOT file stored in an archive without using the anchor.
255 > There is however a substantial difference in how data is
256 > retrieved, especially during staging: auxiliary ROOT files
257 > *(friends)* are stored inside the archive along with the "main"
258 > file, so that when you use the archive as `FileName` with the
259 > proper `Anchor` you are still referencing to the same file, but
260 > you are giving instructions of downloading the archive.
262 > Using the ROOT file name directly must be done in very special
263 > cases (*i.e.*, to save space) and only when one is completely sure
264 > that no external files in the archive are required for analysis.
266 TreeName *(optional)*
267 : Name of each file's default tree.
269 Examples of valid values are: `/aodTree`, `/esdTree`, `/myCustomTree`,
270 `/TheDirectory/TheTree`.
273 : Additional extended regular expression applied after find command is
274 run, to fine-grain search results.
276 Only `alien://` paths matching the regular expression are
277 considered, others are discarded.
279 Examples of valid values are:
285 > [TPMERegexp](http://root.cern.ch/root/html/TPMERegexp.html) is
286 > used to perform regular expression matching.
289 Example of an AliEn raw find dataset string:
291 Find;BasePath=/alice/data/2010/LHC10h/000139505/ESDs/pass1/*.*;FileName=root_archive.zip;Anchor=AliESDs.root
293 #### Data access modes
295 It is possible to append to the format string the `Mode` specifier that
296 affects the way URLs are generated.
298 Mode=[local|remote|cache]
300 This parameter is optional and defaults to `local`. Description of each
301 possible value follows:
304 : Local storage is checked for the presence of data you requested.
305 Output URLs will be relative to your local storage. Also, locality
306 information *(i.e., is your file staged?)* is filled.
308 If you run a PROOF analysis on a dataset with this mode specified,
309 only data marked as "staged" will be processed.
311 This method is the preferred one, since it does not overload the
312 remote storage, and it enables users to process partially-staged
313 datasets, or partially-reconstructed runs, without the need to
314 manually update static datasets.
316 > This is the default if no mode is specified, and it is also the
317 > most efficient one.
319 > Despite it might take some time (up to a couple of minutes to
320 > locate ~4000 files), returned information is always reliable
321 > (because it's dynamic) and speeds up analysis (because analysis
322 > will always be run only on files having local copies).
324 > Moreover this information is cached for a configurable period of
325 > time, so that subsequent calls to the same dataset will be faster.
328 : Only AliEn URLs are returned.
330 A PROOF analysis run on a dataset with this mode specified will
331 always obtain data from a remote storage, according to the AliEn
334 > Tasks run on remote data are usually much slower than using local
338 : URLs pointing to local copies of files are returned, but does not check
339 whether the file is locally present or not.
341 If local storage is configured for retrieving from AliEn files that
342 are not available locally (which is the case of xrootd with vMSS),
343 then data will be downloaded *while analysis is running*.
345 It is called *cache mode* because it treats the local storage as a
346 cache for the remote storage.
348 > This mode is usually very slow on a busy analysis facility since
349 > retrieving data in real time without any kind of scheduling is
350 > inefficient. It also conflicts with the preferred method, which is
351 > to stage data asynchronously using the [stager
352 > daemon](http://afdsmgrd.googlecode.com/).
354 #### Force cache refresh
356 If the cached information for a certain AliEn file catalog query is wrong,
357 it is possible to force querying the catalog again by using the keyword
360 Data;ForceUpdate;Period=LHC10h;Variant=AOD086;Run=130831-130833;Pass=pass1
364 Issuing staging requests and keeping track of them requires an auxiliary
365 database that can be read and updated by the [data stager
366 daemon](http://afdsmgrd.googlecode.com/).
368 Whenever a staging request is issued, a ROOT file containing the dataset
369 is saved in a special directory on the master's filesystem, monitored by
372 #### PROOF configuration
374 In the xproofd configuration file, there is a directive to specify the
375 directory used as repository for staging requests:
377 xpd.stagereqrepo [dir:]/path/to/local/directory
379 > The literal `dir:` prefix is optional.
381 This directive is shared between PROOF and the stager daemon, so that the
382 same configuration file can be used for both.
384 Permissions on this directory must be kept open.
386 > Versions of the stager daemon prior to v1.0.7 do not support open
387 > permissions and the staging repository directive.
389 #### Request and monitor staging
391 Staging requests and monitoring can be done from within a PROOF session.
393 `gProof->RequestStagingDataSet("QueryString")`
394 : Requests staging of the dataset specified via the query string.
396 Staging request is honored if the stager daemon is running.
398 > In order to avoid requesting to stage undesired data, it is
399 > advisable to check in advance the results of your query string:
401 > `gProof->ShowDataSet("QueryString")`
403 `TProof->ShowStagingStatusDataSet("QueryString"[, "opts"])`
404 : Shows progress status of a previously given staging request with
405 data specified by the query string.
407 Options are optional, and passed as-is to the `::Print()` method.
409 > It is possible to show all the files marked as corrupted by the
412 > gProof->ShowStagingStatusDataSet("QueryString", "C")
414 > Or all the files successfully staged and not corrupted:
416 > gProof->ShowStagingStatusDataSet("QueryString", "Sc")
418 `gProof->GetStagingStatusDataSet("QueryString")`
419 : Gets a `TFileCollection` containing information on the staging
420 request specified by the query string.
422 Works exactly like `ShowStagingStatusDataSet()` but returns an
423 object instead of displaying information on the screen.
425 `gProof->CancelStagingDataSet("QueryString")`
426 : Removes a dataset from the list of staging requests. Datasets used
427 as staging requests are usually removed automatically by the staging
428 daemon if everything went right, so this command is used mostly to
429 purge a completed staging request when it has some corrupted files.