Submitted by:
Bireshewar Roy
M.Sc.(1st year), Physics
Indian Institute of Technology, Kanpur – 208016
Guided by:
Prof. Prasenjit Sen
Harish-Chandra Research Institute, Allahabad –
211019
MAY-JULY,2018
1
CERTIFICATE
2
DEDICATED TO MY PARENTS
3
CONTENTS
1. Abstract……………………………………………………….. 5
2. Introduction…………………………………………………. 6
3. Methods………………………………………………………. 7
3.1 Materials Fingerprints
3.2 Theory of Similarity and Distance Measures
3.3 Similarity Search in the Materials Space
3.4 AFLOWLIB Material Repository and Data
5. Programming Algorithm..……………………………. 21
6. Acknowledgement………………………………………. 23
7. References…………………………………………………… 24
8. APPENDIX…………………………………………………….. 25
4
1. Abstract:
Cheminformatics is generation of data or retrieval of data from
repositories to transform data into information and information into
knowledge for the intended purpose of making better decisions faster
in the area of promising compound identification and optimization. As
the proliferation of high-throughput computing in materials science is
increasing the wealth of data in the field, the gap between
accumulated-information and derived knowledge widens. We address
the issue of discovery in material science by introducing novel analytic
approaches based on electronic materials fingerprints. The framework
is employed to – (i) query large databases of materials using similarity
concept, (ii) map the connectivity of materials space (i.e., as a
materials cartograms) for rapidly identifying regions with unique
organizations/properties. In this study, we have used only the “band
structure symmetry dependent fingerprint (B-fingerprint)” and
“density of states symmetry independent fingerprint (D-fingerprint)”
to study the similarity of materials. This materials fingerprinting and
materials cartography approaches contribute to the emerging field of
materials informatics by enabling effective computational tools to
analyse, visualize, model, and design new materials.
A large number of molecular representations exist, and there are
several methods (similarity and distance metrics) to quantify the
similarity of material representation. “Tanimoto Similarity index” is an
appropriate choice for fingerprint-based similarity calculations. So, for
our purpose, we have used only the “Tanimoto similarity index” to
compare the similarity of few materials and diversity of material
space. First, we have chosen the reference material as ‘Gallium
Arsenide’ (GaAs) and compared the B-fingerprints and D-fingerprints
of some elements and binary compounds with GaAs using python
script that we developed. We also searched the AFLOWLIB database
for materials similar to ‘Ytterbium Selenide’ (YbSe).
5
2. Introduction:
Quantifying the similarity of two materials is a key concept and routine
task in cheminformatics. Design of materials with desired physical and
chemical properties are vital challenges in the field of materials
research.1-3 Material properties directly depend on a large number of
key variables, often making the property prediction complex. These
variables include constitutive elements, crystal forms, and
geometrical and electronic characteristics; among others. The rapid
growth of materials research led to accumulation of vast amounts of
data. For example, the Inorganic Crystal Structure Database (ICSD)
includes more than 1,70,000 entries.4 Experimental data are also
included in other databases, such as Matweb and Matbase. In
addition, there are several large databases, such as AFLOWLIB,
Materials Project, Nomad Repository, and Harvard Clean Energy
Project that contain thousands of unique materials and their
theoretically (using DFT) calculated properties.6 These properties
include electronic structure profiles estimated with quantum
mechanical methods. The latter databases have great potential to
serve as a source of novel functional materials. Promising candidates
from these databases may in turn be selected for experimental
confirmation using rational design approaches.
The rapidly growing compendium of experimental and theoretical
materials data offers an unique opportunity for scientific discovery in
materials databases. Specialized data mining and data visualization
methods are being developed within the nascest field of materials
informatics.
Similar approaches have been extensively used in cheminformatics
with resounding success. For example, in many cases, these
approaches have served to help identify and design small organic
molecules with desired biological activity and acceptable
6
environmental/human-health safety profiles. Application of
cheminformatics approaches to material science would allow
researchers to – (i) define, visualize, and navigate through the material
space, (ii) analyze and model structural and electronic characteristics
of materials with regard to a particular physical or chemical property,
and (iii) employ predictive materials informatics models to forecast
the experimental properties of untested materials. Thus, rational
design approaches in materials science constitute a rapidly growing
field.
Herein, we use a novel materials fingerprinting approach recently
proposed in the literature [4]. We use fingerprints that encode
information about the band structure and density of states (DOS), i.e.,
electronic structure of the materials. We show that known materials
with similar properties turn out to have high similarity in their
electronic fingerprints, thus suggesting that this method can be used
to scout for materials with desired properties in existing materials
databases.
3. Methods:
3.1 Materials Fingerprints:
It is well known that material properties depend on geometrical
and electronic structure. In comparing the properties of materials,
two important assumptions in the present approach are that-
(i) properties of materials are direct functions of ‘structures’,
(ii) materials with similar ‘structures’ (as determined by constitutional,
topological, spatial and electronic structures) are likely to have similar
physical and chemical properties.
Thus, encoding material characteristics in the
form of numerical arrays of descriptors, or fingerprints, enables the
7
use of classical cheminformatics and machine-learning approaches to
mine, visualize, and model any set of materials. We have encoded the
electronic structure diagram for each material as two distinct types of
arrays: a symmetry dependent fingerprint (band structure based
B-fingerprint) and a symmetry independent fingerprint (density of
state based D-fingerprint).
B-Fingerprint: At every special high-symmetry point of the
Brillouin zone (BZ), the band energy scaled in the range -10 eV to 10
eV around band gap (or Fermi energy for metals) has been discretized
into 32 bins to serve as our fingerprint array. The set of high-symmetry
k-points in a Brillouin Zone (BZ) depends on the crystal symmetry. For
example, BZ of a simple cubic crystal has four high symmetry points
(Γ, M, R, X) and will give a B-fingerprint array of length 128. The body
centered orthorhombic lattice, on the other hand has 13 high
symmetry k-points (Γ, L, L1, L2, R, S, T, W, X, X1, Y, Y1, Z) and will lead to
a B-fingerprint array of length 416. The special symmetry Γ-point is
common to all lattice types, and does not depend on the symmetry of
crystal structure. Therefore, to keep the analysis simple, we have
calculated and compared B-fingerprints of materials only at the Γ-
point as in Ref.- [4]. The construction of B-fingerprint of a band
structure (Figure 1) is shown by histogram plot (Figure 2).
D-Fingerprint: A similar idea can be implemented for the DOS of
materials, which are sampled in 256 bins (from -10 eV to 10 eV). Each
bin contains the average value of DOS, in the energy interval of the
same bin. Due to the complexity and limitations of the symmetry-
dependent B-fingerprints, it is suggested to use the concept of
symmetry-independent D-fingerprints. The length of these
fingerprints is adjustable depending on the objects, applications, and
other factors. The domain space and length of these fingerprints have
been carefully designed to keep away the issues of enhancing
boundary effects or discarding important features. The construction
of D-fingerprint is shown by Figure 3.
8
Figure 1. Band structure of Sb2Te3 (ICSD No. 262171). (taken from
www.aflow.org)
9
Figure 3. Construction of D-fingerprint (shown by colour diagram) from the
density of states of Bi1I1Te1 (ICSD No. 10500). We illustrate the idea of
D-fingerprint with 256 bins. (taken from ref.- [4])
1
Similarity (S) = ; where d = distance ………….. (1)
1+𝑑
i.e. every similarity metric corresponds to a distance metric and vice
versa. Since distances are always non-negative (R ∈ [0; + ∞]), similarity
values calculated with this equation will always have a value between
0 and 1 (with 1 corresponding to identical objects, where the distance
is 0).
10
Some of similarity/distance metrics are: Manhattan distance,
Euclidian distance, Cosine coefficient, Dice coefficient, Tanimoto
index, Soergel distance.5 For our purpose, we will only discuss the
“Tanimoto Similarity Index”.
𝑋⋅𝑌
S(X,Y) = ………………………… (2)
|𝑋|2 +|𝑌|2 −𝑋⋅𝑌
11
Now, if the vectors X and Y are bit-vectors ( Where value of each
dimension is binary digit, i.e. either 0 or 1), then the the ‘Tanimoto
Similarity Index’ takes the simple form :
𝑐
S(X,Y) = …………………………………….. (4)
𝑎+𝑏−𝑐
Where, ‘S’ denotes the similarity between two bit-vectors X and Y.
‘𝑎’ is the number of on bits (i.e. 1) in X.
‘𝑏’ is the number of on bits (i.e. 1) in Y.
‘𝑐 ’ is the number of bits that are on in both X and Y.
In our case, the threshold value of the ‘Tanimoto Similarity Index’ is
ST = 0.7, i.e., If the electronic fingerprint similarity value between any
two chosen compounds is greater than or equal to 0.7, then the two
compounds are considered to have similar property. Otherwise they
are considered to have dissimilar property.
13
3.4 AFLOWLIB Material Repository and Data:
AFLOWLIB is a material repository of density functional theory
(DFT) calculations managed by the software package AFLOW. At the
time of the study, the AFLOWLIB.org database contains nearly 1.8
million compounds, each characterized by about 100 different
properties. Of the characterized systems, roughly half are metallic and
half are insulating. AFLOW leverages the VASP Package to calculate
the total energy of a given crystal structure with PAW
pseudopotentials and PBE exchange-correlation functional.
For our purpose, we need the OUTCAR file which contains the
information about band energies of a particular compound and the
DOSCAR file which contains the information about density of states of
a particular compound, from the AFLOWLIB material repository.
14
We have done our first test case for pairwise similarity searches
between GaAs and some of the semiconductor materials (GaP, GaSb,
Si, SnP, GeAs, InTe, InP, InAs, Ge).
The B-fingerprint and D-fingerprint similarity values are tabulated
below:
Reference Compound: GaAs (ICSD No. 41674) [FCC]
TABLE - I
15
Then we have taken the reference material as ‘GaAs’ (ICSD No. 41674)
and some binary compounds having chemical formula A1B1 as test
materials.
Here, A= Alkali metals, Alkaline Earth metals, Transition metals (only
3d & 4d), Group 3A (excluding Boron); B= Group 5A, Group 6A
(Chalcogens),Halogens.
We found possible 157 different compounds of that type in the
AFLOWLIB database. We operated our programming code for pairwise
B & D-fingerprint similarity searches between GaAs and those A1B1
type compounds. As a result, we found 16 different compounds
having high similarity values with GaAs.
The compounds which have high similarity values with GaAs are
tabulated below:
TABLE - II
16
Serial Compound ICSD No. Structure B- Fingerprint D-Fingerprint
No. Name Similarity Similarity
14. InS 409645 MCL 0.212 0.821
660105 ORC 0.212 0.771
15. InTe 169425 FCC 0.056 0.842
640610 CUB 0.162 0.769
16. TlBi 53967 CUB 0.132 0.743
4.2 Discussion:
Both the B-fingerprint and D-fingerprint similarity values are very
high between GaAs and GaP (from Table-I).
Again, pairwise similarity values (based on D-fingerprints) between
GaAs and any of the semiconductor materials (GaP, GaSb, Si, SnP,
GeAs, InTe, InP, InAs, Ge) are very high (S > 0.7).
So, for any need of material which has a similar property of
semiconductor (e.g. band gap), we can initially choose our test
material in such a way that the D-fingerprint of test material has a high
similarity value (S > 0.7) with our known semiconductors (e.g. GaAs).
17
The Band structures, B-fingerprint histogram plots and Density of
states vs. Energy plots for GaAs and GaP are shown by Figure 4,5,6 and
7.
18
Figure 6. Construction of B-fingerprints (at Γ point) from the band structures of
GaAs (Left histogram) and GaP (Right histogram). The B-fingerprint similarity
between these two materials is 0.833.
Figure 7. The “DOS vs. Energy” plot for GaAs (Red Curve) and GaP (Green
Curve) using the AFLOWLIB data. The D-fingerprint similarity between these
two materials is 0.894.
19
From Table-I, we see that, for different crystal structure of a particular
compound, the variation of B-fingerprint similarity is high whereas the
variation of D-fingerprint similarity is very less. So, it is clear that B-
fingerprint is highly symmetry dependent but D-fingerprint is almost
independent of the symmetry of crystal structure.
From table-I and table-II, there are 20 compounds very similar to
GaAs. Out of these 20 materials, 17 materials are used as
semiconductor. We did not find the experimental band gap values of
TlP, TlAs and TlBi in our search of the literature.
The experimental band gap of some compounds are given below.
TABLE- III
Serial Compound Structure Band Gap at
No. Name 300 K (eV)
1. BeTe FCC 3.0
2. AlAs FCC 2.16
3. GaP FCC 2.26
4. GaSb FCC 0.72
5. InP FCC 1.35
6. InAs FCC 0.36
7. InS ORC 2.0
8. InTe FCC 0.6
9. Si FCC 1.12
10. GeAs BCT 1.64
11. Ge FCC 0.66
20
fingerprint similarity (S > 0.7) for EuS (ICSD No. 631599) and YbS (ICSD
No. 651441) with cubic YbSe. One can therefore formulate a testable
hypothesis suggesting that these two materials may also be
ferroelectric or piezoelectric.
5. Programming Algorithm:
We developed the script in Python language to create the B-
Fingerprint and D-fingerprint, and calculate the B-fingerprint similarity
and D-Fingerprint similarity between any two chosen materials, using
the AFLOWLIB data. The algorithm for the programming is described
below:
STEP 1: Download the OUTCAR and DOSCAR files of two chosen
materials from the AFLOWLIB.org database.
STEP 2: Read the OUTCAR files of both materials.
STEP 3: Store all the Band Energies at Γ-point (K-point 1) for both
materials’ OUTCAR file.
STEP 4: Find the Maximum Valance Band Energy for each material and
store it.
STEP 5: Set the Maximum Valance Band Energy of each material as
zero energy level and shift all the Γ point (K-point 1) Band Energies of
each material by subtracting the respective Maximum Valance Band
Energy.
STEP 6: Choose the Band Energy range from -10.0 eV to 10.0 eV.
STEP 7: Divide the Band Energy range (-10.0 eV, 10.0 eV) in 32 bins, so
that the energy interval of each bin becomes 0.625 eV.
STEP 8: Find the number of band in each bin using the Band Energy
data (after shifting in Step-5) for each material and store it.
STEP 9: Convert the number of bands of each bin from decimal to 8-
bits binary number. So, for 32 bins of each material, we will get the
21
256 bits binary number, which is the B-Fingerprint of the material.
Store the 256 bits binary number (B-Fingerprint) in an array. Do this
step for both the materials.
STEP 10: Use the definition of Tanimoto Similarity Index (given by
equation-3) to find the similarity between two fingerprints. It will give
the B-Fingerprint Similarity between two chosen materials.
STEP 11: Plot the Histogram of ‘Number of bands vs. Bin energy’ for
both the materials.
STEP 12: Read the DOSCAR files of both materials.
STEP 13: Store all the energies and corresponding density of states
from the DOSCAR file. Do this step for each of the material.
STEP 14: Shift the energies by subtracting the Maximum Valance Band
Energy (obtained in Step-4) for each material.
STEP 15: Choose the Energy range from -10.0 eV to 10.0 eV.
STEP 16: Divide the Energy range (-10.0 eV, 10.0 eV) in 256 bins, so
that the energy interval of each bin becomes 0.078125 eV.
STEP 17: Find the Density of States in each bin. Do this for each of the
material and store it.
STEP 18: Convert the Density of States of each bin from decimal to 16-
bits binary number. So, for 256 bins of each material, we will get the
4096 bits binary number, which is the D-Fingerprint of the material.
Store the 4096 bits binary number (D-Fingerprint) in an array. Do this
step for both the materials.
STEP 19: Use the definition of Tanimoto Similarity Index (given by
equation-3) to find the similarity between two fingerprints. It will give
the D-Fingerprint Similarity between two chosen materials.
STEP 20: Plot the ‘Energy vs. Density of states’ for each material in a
single frame.
22
Acknowledgement
23
7. References:
[1] Rajan, K. Mater. Today 2005; 8: 38−45.
[2] Curtarolo, S.; Hart, G. L. W.; Buongiorno Nardelli, M.; Mingo, N.; Sanvito,
S.; Levy, O. Nat. Mater. 2013; 12: 191−201.
[3] Potyrailo, R.; Rajan, K.; Takeuchi, I.; Chisholm, B.; Lam, H. ACS Comb. Sci.
2011; 13: 579−633.
[4] O. Isayev, D. Fourches, E. N. Muratov, C. Oses, K. Rasch, A. Tropsha and
S. Curtarolo. Materials Cartography: Representing and Mining Materials Space
Using Structural and Electronic Fingerprints. Chem. Mater. 2015; 27: 735−743.
[5] D. Bajusz, A. Rácz and K. Héberger. Why is Tanimoto index an
appropriate choice for fingerprint-based similarity calculations?. Journal of
Cheminformatics. 2015; 7:20.
[6] C. Oses, C. Toher and S. Curtarolo. Autonomous data-driven design of
inorganic materials with AFLOW, submitted arXiv: 1803.05035v1 [cond-
mat.mtr1-sci], 2018.
[7] Maggiora, G.; Vogt, M.; Stumpfe, D.; Bajorath, J. J. Med. Chem. 2014; 57:
3186−3204.
[8] Bhalla, A. S.; Guo, R.; Roy, R. Mater. Res. Innovat. 2000; 4: 3−26.
[9] Rabe, K. M.; Ahn, C. H.; Triscone, J. M. Physics of Ferroelectrics: A
Modern Perspective; Speinger: New York, 2010.
[10] Neil W. Ashcroft, N. David Mermin. Solid State Physics; Harcourt
College Publishers: San Diego, U.S.A., 1976.
24
8. APPENDIX:
25