Hex, Bugs and More Physics | Emre S. Tasci

a blog about physics, computation, computational physics and materials…

Specification of an extensible and portable file format for electronic structure and crystallographic data

X. Gonzea, C.-O. Almbladh, A. Cucca, D. Caliste, C. Freysoldt, M.A.L. Marques, V. Olevano, Y. Pouillon and M.J. Verstraet

Comput. Mater. Sci. (2008) doi:10.1016/j.commatsci.2008.02.023

Abstract

In order to allow different software applications, in constant evolution, to interact and exchange data, flexible file formats are needed. A file format specification for different types of content has been elaborated to allow communication of data for the software developed within the European Network of Excellence “NANOQUANTA”, focusing on first-principles calculations of materials and nanosystems. It might be used by other software as well, and is described here in detail. The format relies on the NetCDF binary input/output library, already used in many different scientific communities, that provides flexibility as well as portability across languages and platforms. Thanks to NetCDF, the content can be accessed by keywords, ensuring the file format is extensible and backward compatible.

Keywords

Electronic structure; File format standardization; Crystallographic datafiles; Density datafiles; Wavefunctions datafiles; NetCDF

PACS classification codes

61.68.+n; 63.20.dk; 71.15.Ap

Problems with the binary representations:

  1. lack of portability between big-endian and little-endian platforms (and vice-versa), or between 32-bit and 64-bit platforms;
  2. difficulties to read the files written by F77/90 codes from C/C++ software (and vice-versa)
  3. lack of extensibility, as one file produced for one version of the software might not be readable by a past/forthcoming version

(Network Common Data Form) library solves these issues. (Alas, it is binary and in the examples presented in  http://www.unidata.ucar.edu/software/netcdf/examples/files.html, it seems like the CDF representations are larger in size than those of the CDL metadata representations)

The idea of standardization of file formats is not new in the electronic structure community [X. Gonze, G. Zerah, K.W. Jakobsen, and K. Hinsen. Psi-k Newsletter 55, February 2003, pp. 129-134. URL: <http://psi-k.dl.ac.uk>] (This article is a must read article, by the way. It’s a very insightful and concerning overview of the good natured developments in the atomistic/molecular software domain. It is purifying in the attempt to produce something good..). However, it proved difficult to achieve without a formal organization, gathering code developers involved in different software project, with a sufficient incentive to realize effective file exchange between such software.

HDF (Hierarchical Data Format) is also an alternative. NetCDF is simpler to use if the data formats are flat, while HDF has definite advantages if one is dealing with hierarchical formats. Typically, we will need to describe many different multi-dimensional arrays of real or complex numbers, for which NetCDF is an adequate tool.

Although a data specification might be presented irrespective of the library actually used for the implementation, such a freedom might lead to implementations using incompatible file formats (like NetCDF and XML for instance). This possibility would in effect block the expected standardization gain. Thus, as a part of the standardization, we require the future implementations of our specification to rely on the NetCDF library only. (Alternative to independent schemas is presented as a mandatory format which is against the whole idea of development. The reason for NetCDF being incompatible with XML lies solely on the inflexibility/inextensibility of the prior format. Although such a readily defined and built format is adventageous considering the huge numerical data and parameters the ab inito software uses, it is very immobile and unnecessary for compound and element data.)

The compact representation brought by NetCDF can be by-passed in favour of text encoding (very agreable given the usage purposes: XML type extensible schema usage is much more adequate for structure/composition properties). We are aware of several standardization efforts [J. Junquera, M. Verstraete, X. Gonze, unpublished] and [J. Mortensen, F. Jollet, unpublished. Also check Minutes of the discussion on XML format for PAW setups], relying on XML, which emphasize addressing by content to represent such atomic data.

4 types of data types are distinguished:

  • (A) The actual numerical data (which defines whether a file contains wavefunctions, a density, etc), for which a name must have been agreed in the specification.
  • (B) The auxiliary data that is mandatory to make proper usage of the actual numerical data of A-type. The name and description of this auxiliary information is also agreed.
  • (C) The auxiliary data that is not mandatory to make proper usage of the A-type numerical data, but for which a name and description has been agreed in the specification.
  • (D) Other data, typically code-dependent, whose availability might help the use of the file for a specific code. The name of these variables should be different from the names chosen for agreed variables of A–C types. Such type D data might even be redundant with type A–C data.

The NetCDF interface adapts the dimension ordering to the programming language used. The notation here is C-like, i.e. row-major storage, the last index varying the fastest. In FORTRAN, a similar memory mapping is obtained by reversing the order of the indices. (So, the ordering/reverse-ordering is handled by the interface/library)

Concluding Remarks

We presented the specifications for a file format relying on the NetCDF I/O library with a content related to electronic structure and crystallographic data. This specification takes advantage of all the interesting properties of NetCDF-based files, in particular portability and extensibility. It is designed for both serial and distributed usage, although the latter characteristics was not presented here.

Several software in the Nanoquanta and ETSF [15] context can produce or read this file format: ABINIT [16] and [17], DP [18], GWST, SELF [19], V_Sim [20]. In order to further encourage its use, a library of Fortran routines [5] has been set up, and is available under the GNU LGPL licence.

Additional Information:

Announcement for a past event
( http://www.tddft.org/pipermail/fsatom/2003-February/000004.html )

CECAM – psi-k – SIMU joint Tutorial

1) Title : Software solutions for data exchange and code gluing.

Location : Lyon Dates : 8-10 october, 2003

Purpose : In this tutorial, we will teach software tools and standards that have recently emerged in view of the exchange of data (text and binary) and gluing of codes : (1) Python, as scripting language, its interfaces with C and FORTRAN ; (2) XML, a standard for representing structured data in text files ; (3) netCDF, a library and file format for the exchange and storage of binary data, and its interfaces with C, Fortran, and Python

Organizers : X. Gonze gonze@pcpm.ucl.ac.be

K. Hinsen hinsen@cnrs-orleans.fr

2) Scientific content

Recent discussions, related to the CECAM workshop on "Open Source Software for Microscopic Simulations", June 19-21, 2002, to the GRID concept (http://www.gridcomputing.com), as well as to the future Integrated Infrastructure Initiative proposal linked to the European psi-k network (http://psi-k.dl.ac.uk), have made clear that one challenge for the coming years is the ability to establish standards for accessing codes, transferring data between codes, testing codes against each other, and become able to "glue" them (this being facilitated by the Free Software concept).

In the present tutorial, we would like to teach three "software solutions" to face this challenge : Python, XML and netCDF.

Python is now the de facto "scripting langage" standard in the computational physics and chemistry community. XML (eXtended Markup Language) is a framework for building mark-up languages, allowing to set-up self-describing documents, readable by humans and machines. netCDF allows binary files to be portable accross platforms. It is not our aim to cover all possible solutions to the above-mentioned challenges (e.g. PERL, Tcl, or HDF), but these three have proven suitable for atomic-scale simulations, in the framework of leading projects like CAMPOS (http://www.fysik.dtu.dk/campos), MMTK (http://dirac.cnrs-orleans.fr/MMTK), and GROMACS (http://www.gromacs.org). Other software projects like ABINIT (http://www.abinit.org) and PWSCF (http://www.pwscf.org – in the DEMOCRITOS context), among others, have made clear their interest for these. All of these software solutions can be used without having to buy a licence.

Tentative program of the tutorial. Lectures in the morning, hands-on training in the afternoon.

1st day ——- 2h Python basics 1h Interface : Python/C or FORTRAN 1h XML basics Afternoon Training with Python, and interfaces with C and FORTRAN

2nd day ——- 2h Python : object oriented (+ an application to GUI and Tk) 1h Interface : Python/XML 1h Interface : XML + C or FORTRAN Afternoon Training with XML + interfaces

3rd day ——- 1h Python : numerical 1h netCDF basics 1h Interface : netCDF/Python 1h Interface : netCDF/C or FORTRAN Afternoon Training with netCDF + interfaces

3) List of lecturers

K. Hinsen (Orleans, France), organizer X. Gonze (Louvain-la-Neuve, Belgium), organizer K. Jakobsen (Lyngby, Denmark), instructor J. Schiotz (Lyngby, Denmark), instructor J. Van Der Spoel (Groningen, The Netherlands), instructor M. van Loewis (Berlin, Germany), instructor

4) Number of participants : around 20, Most of the participants should be PhD students, postdoc or young permanent scientists, involved in code development. It is assumed that the attendants have a good knowledge of UNIX, and C or FORTRAN.

Our budget will allow contributing to travel and local expenses of up to 20 participants.

XML and NetCDF:

from Specification of file formats for NANOQUANTA/ETSF Specification – www.etsf.eu/research/software/nq_specff_v2.1_final.pdf

Section 1. General considerations concerning the present file format specifications.

One has to consider separately the set of data to be included in each of different types of files, from their representation. Concerning the latter, one encounters simple text files, binary files, XML-structured files, NetCDF files, etc … It was already decided previously (Nanoquanta meeting Maratea Sept. 2004) to evolve towards formats that deal appropriately with the self- description issue, i.e. XML and NetCDF. The inherent flexibility of these representations will also allow to evolve specific versions of each type of files progressively, and refine earlier working proposals. The same direction has been adopted by several groups of code developers that we know of.

Information on NetCDF and XML can be obtained from the official Web sites,

http://www.unidata.ucar.edu/software/netcdf/ and

http://www.w3.org/XML/

There are numerous other presentations of these formats on the Web, or in books.

The elaboration of file formats based on NetCDF has advanced a lot during the Louvain-la- Neuve mini-workshop. There has been also some remarks about XML.

Concerning XML :

(A) The XML format is most adapted for the structured representation of relatively small quantity of data, as it is not compressed.

(B) It is a very flexible format, but hard to read in Fortran (no problem in C, C++ or Python). Recently, Alberto Garcia has set up a XMLF90 library of routines to read XML from Fortran. http://lcdx00.wm.lc.ehu.es/~wdpgaara/xml/index.html Other efforts exists in this direction http://nn-online.org/code/xml/

Concerning NetCDF

  • (A) Several groups of developers inside NQ have already a good experience of using it, for the representation of binary data (large files).
  • (B) Although there is no clear advantage of NetCDF compared to HDF (another possibility for large binary files), this experience inside the NQ network is the main reason for preferring it. By the way, NetCDF and HDF are willing to merge (this might take a few years, though).
  • (C) File size limitations of NetCDF exist, see appendix D, but should be overcome in the future.

Thanks to the flexibility of NetCDF, the content of a NetCDF file format suitable for use for NQ softwares might be of four different types :

(1) The actual numerical data (that defines a file for wavefunctions, or a density file, etc …), whose NetCDF description would have been agreed.

(2) The auxiliary data that are mandatory to make proper usage of the actual numerical data. The NetCDF description of these auxiliary data should also be agreed.

(3) The auxiliary data that are not mandatory, but whose NetCDF description has been agreed, in a larger context.

(4) Other data, typically code-dependent, whose existence might help the use of the file for a specific code.

References:
[5] URL: <http://www.etsf.eu/index.php?page=tools>.

[15] URL: <http://www.etsf.eu/>.

[16] X. Gonze, J.-M. Beuken, R. Caracas, F. Detraux, M. Fuchs, G.-M. Rignanese, L. Sindic, M. Verstraete, G. Zerah, F. Jollet, M. Torrent, A. Roy, M. Mikami, Ph. Ghosez, J.-Y. Raty and D.C. Allan, Comput. Mater. Sci. 25 (2002), pp. 478–492.

[17] X. Gonze, G.-M. Rignanese, M. Verstraete, J.-M. Beuken, Y. Pouillon, R. Caracas, F. Jollet, M. Torrent, G. Zérah, M. Mikami, Ph. Ghosez, M. Veithen, J.-Y. Raty, V. Olevano, F. Bruneval, L. Reining, R. Godby, G. Onida, D.R. Hamann and D.C. Allan, Zeit. Kristall. 220 (2005), pp. 558–562.

[18] URL: <http://dp-code.org>.

[19] URL: <http://www.bethe-salpeter.org>.

[20] URL: http://www-drfmc.cea.fr/sp2m/L_Sim/V_Sim/index.en.html.

Lewis “Powering the Planet” MRS Bulletin 32 808 2007

November 27, 2007 Posted by Emre S. Tasci

Powering the Planet / Nathan S. Lewis
MRS Bulletin 32 808 2007

(Recommended by Prof. Thijsse)

This article is an edited transcript based on the plenary presentation given by Nathan S. Lewis (California Institute of Technology) on April 11, 2007, at the Materials Research Society Spring Meeting in San Fransisco.

  • Richard Smalley’s collaborator.
  • The problem is not the extinction of fossil based fuels, they will be available for the future (roughly 50-150 years for oil / 200-600 years of natural gas / 2000 years of coal). "This does not even include the methane clathrates, off the continental shelves, which are estimated to exist in comparable quantities to all of the oil, coal and gas on our planet combined."
  • The article focuses on the planet in the year 2050:

    I chose 2050 for two reasons. First, achieving results in the energy industry is a much longer-term endeavor than, say, achieving results in the information technology business. In IT, for example, you can build a Web site and only a few years later become a Google. If you build a coalfired power plant, however, it will take about 40 years to pay itself off and deliver a reasonable return on investment. The energy infrastructure that we build in the next 10 years, therefore, is going to determine, by and large, what our planet’s energy mix is going to look like in 2050. The second reason for choosing 2050 is that today’s population wants to know what our planet’s energy picture is going to look like within a timeframe meaningful to them—the next 30 to 40 years.

  • The real problem is the CO2 emission. If we stop emitting carbondioxide as of this moment, it will be about 350 ppm by 2050. Even 550 ppm rate looks pretty hard to obtain. "We do not know, except through climate models, what the implications of driving the atmospheric CO2 concentrations to any of these levels will be. There are about six major climate models, all differing from each other in detail."
  • What can be done? 
    1 – Nuclear Power – Insufficient. 
    2 – Carbon sequestration – Very insufficient. "The U.S. department of Energy is doing work on carbon sequestration, with the goal of creating 1 gigaton of storage by 2025 and 4 gigatons total by 2050. Since the United State’s annual carbon emissions are 1.5 gigaton per year, the total DOE goal for 50 years from now is commensurate with a few years’ worth of current emissions.
    3 – Renewable Carbon-Neutral Energy Sources – Highly insufficient. Not the time, nor the areas are available for this to work. But we should head to the sun. "To put that another way, more energy from the sun hits the earth in one hour than all of the energy consumed on our planet in an entire year." Also there is some debate about using geothermal energy but the the low temperature difference that will practically let us extract much less than a few terrawats sustainably over the 11.6 TW of sustainable global heat energy.
  • Solar cells / sun energy collectors must be cheaper about 1-10$/m2. Also the energy can not be stored efficiently. We can store it by pumping water uphill for example, but by comparison, we will need to pump 50000 gallons of water uphill for 100 m to replace 1 gallon of gasoline.
  • Energy solutions for the future: 
    1 – Photovoltaics – Efficient but highly expensive. 
    2 – Semiconductor/Liquid Junctions – Promising.
    3 – Photosynthesis – Must be researched for better efficiency.
  • Energy saving : "Any realistic energy program would start with energy efficiency, because saving energy costs much less than making energy. Because of all the inefficiencies in the energy supply chain, for every 1 J of energy that is saved at the end, 4-5 J is avoided from being produced."
  • The problem is severe and critical and there isn’t much time to act:

    Advocates of developing carbon-free energy alternatives believe that this is a project at which we cannot afford to fail because there is only one chance to get it right. For them, the question is whether or not, if the project went ahead, it could be completed in the time we have remaining. Because CO2 is extremely long-lived, there are not actually 50 years left to deal with the problem. To put this in perspective, consider the following comparisons. If we do not build the next "nano-widget," the world is going to stay the same overt he next 50 years – it will not be better, perhaps, but it will not be worse, either. Even if we do not develop a cure for cancer in 50 years, the world is going to stay basically the same, in spite of the tragedy caused by that disease. If we do not fix our energy problem within the next 20 years, however, we can, as scientists, say with absolute certainty that the world will simply not be the same, and that it will change in a way that, to our best knowledge, will affect life on our planet for the next 3000 years. What this cange will be, we do not precisely know. That is a risk management question. We simply know that no human will ever have experienced what we will within those 50 years, and the unmitigated results will last for a time scale comparable to modern human history.

    If, on the other hand, we decided to do something about our energy problem, I am fairly optimistic we could succeed. As I have outlined, there are no new principles at play here. This challenge is not like trying to figure out how to build an atomic bomb, when we did not know the physics of bomb-building in the first place—which was the situation at the start of the Manhattan Project. We know how to build solar cells; they have a 30-year warranty. We have an existence proof with photosynthesis. We know the components of how to capture and store sunlight. We simply do not yet know how to make these processes cost-effective, over this scale. 

    Here, our funding priorities also come into the picture. In the United States, we spend $28 billion on health, but only about $28 million on basic solar research. Currently, we spend more money buying gas at the pump in one hour than we spend funding basic solar research in our country over an entire year. Yet, in that same hour, more energy from the sun is hitting the Earth than all of the energy consumed on our planet in that year. The same cannot be said of any other energy source. On the other hand, we need to explore all credible energy options that we believe could work at scale because we do not know which ones will work yet. In the end, we will need a mix of energy sources to meet the 10–20 TW demand, and we should be doing all we can to see that it works and works at scale, now and in the future. We have established that, as time goes on, we are going to require energy and we are going to require it in increasing amounts. I can say with confidence therefore, as Dr. Smalley did, that energy is the biggest scientific and technological problem facing our planet in the next 50 years.