DITIGIZATION OF GOVERNMENT INFORMATION RESOURCES : A CASE STUDY OF CSL

 

Dr. S. MAJUMDAR

DIRECTOR,

CENTRAL SECRETARIAT LIBRARY

NEW DELHI

 

In the history of world civilizations, the course of human development has taken new dimension with the introduction of information and communication technology (ICT). All most all the activities especially the academic environment  began much more focused and the work of science has been comprehensively impacted.  Undoubtedly, ICT has changed the course of century and will be the promising player in the future centuries also. It has changed the human life in every aspect, be it - communications, trade, manufacturing, services, culture, entertainment, education, research, national defense or the library services. It is breaking the old barriers and building new interconnections in the emerging globalisation. ICT has also become the chief determinant of the progress of nations, communities and individuals.

 

For India, the rise of Information Technology is an opportunity to overcome historical disabilities and once again it will become the master of one's own national destiny. IT is a tool that will enable India to achieve the goal of becoming a strong, prosperous and self-confident nation. In doing so, IT promises to compress the time it would otherwise take for India to advance rapidly in the march of development and occupy a position of honor and pride in the community of nations.

 

The Government of India has recognised the potential of Information Technology for rapid and all-round national development. The National Agenda for Governance, which is the Government's policy blueprint, has taken due note of the Information and Communication Revolution that is sweeping the globe. Accordingly, it has mandated the Government to take necessary policy and programmatic initiatives that would facilitate India's emergence as an Information Technology Superpower in the shortest possible time.

 

This commitment to Information and Communication Technology in the National Agenda for Governance has been forcefully articulated by Hon’ble Prime Minister Shri Atal Bihari Vajpayee on a number of occasions. 

 

Taken together, the recommendations in the Information Technology Action Plan significantly broaden and deepen the process of economic reforms by encouraging competition, entrepreneurship and innovation -- the three principles which are cardinal for India’s progress in the emerging knowledge-driven global economy.

 

Indian libraries have also being geared to undertake the new challenges which is being taken by other sectors of society. The major issues before the Indian libraries are to

 

1.      Develop the machine readable catalogue and make it available on the network  for wider accessibility;

2.      Develop machine readable full text documents to provide greater accessibility to full text databases of different nature;

3.      Standardize such development by adopting international standards;

4.      Explore ways and means to preserve the machine readable information in a way that  can sustain the future requirements of ICT.;

5.      Changing  the focus of  government libraries from providing  library services to the  fulfilling the long pending focus of right to information.

 

Though the concerted efforts are being made in Universities and college libraries to introduce the ICT culture, the government libraries lacked behind. However, the government libraries in India is making all efforts to achieve the goal and the guiding principles has been that :

 

·        Indians have the right to information.

·        Libraries facilitate equitable access to information by the means of standardised data into machine readable form which is well organized, comprehensive, accurate and are provided in a timely, affordable and efficient manner.

·        Indian libraries need to play a significant role to provide access to the Information Gateway or subject based Portal with Indian contents on it.

·        Indian libraries must work with other information generators by outsourcing the activities so that technological potentialities are utilised optimally . This will ensure the principles of equality, universality and affordability.

India has to oblige the international community by making their information resources available through networks.

 

GOVERNMENTAL PUBLICATIONS IN INDIA

 

The sphere of governmental activity in India has expanded considerably. Many burning issues like population control, health management, economic and social condition of rural and urban masses, education, basic requirements. It every field the government has intervened. This very fact can be seen if we glance through the current Allocation of Business Rules brought by the Central Government. Such expansion has resulted into flow of large number of information, textual and statistical both. All types of documents are being published by the government which can be categorized into following :

 

1.      Administrative Reports in the form of Annual Report which provides summary and sometime detailed information about the policy initiatives, functioning and activities carried out by an organization;

2.      Governmental Notifications through Office Memorandums, Gazette Notifications, Circulars, Notices etc.

3.      Statistical Reports providing data on different segments of population and its socio-economic and cultural condition;

4.      Budget Documents in the form of Speech of the Finance Minister on the floor of the Parliament, Demands for Grants, Performance Budget etc.

5.      Committee and Commission Reports on all the issues wherever governmental machinery has sought expert opinion or enquired on various problems and issues; and

6.      Research Reports which provides an instrument to develop policies and plan the activity.

 

In addition, there are large number of judicial and legislative information in the form of Bills, Acts, Laws, Codes, Rules and Regulations, Law Reports, Digests and Parliamentary Debates of the upper house and lower house of the Parliaments, Reports of various Parliamentary Committees. In nutshell, there is a plethora of primary information being generated from various government organizations which needs to brought to the attention of large segment of civil society.

 

There is no dearth of secondary publications being brought out by government. It is in the form of Yearbooks, Digests, Compendiums of Statistical Information and Major Reports, Biographical accounts of national personalities and freedom fighters, booklets and leaflets providing information regarding government initiatives for the society. Periodicals, maps and charts, catalogues and bibliographies also being brought out by some governmental organizations. These publications are made available in printed form and being distributed by individual departments and no centralized efforts have been made so far. One such initiatives has been taken by CSL to bring out a Report on this issue and make it a centralized venture with total bibliographical control.

 

TECHNOLOGICAL INITIATIVES

 

The activities of the government are conventional available in a printed form, though scattered ( I have not seen one single institution which may be having all the government published resources under one roof ), there is need to catch up with 21st century environment. This century has been considered to be a century of information and communication technology (ICT) and access to information has been its buzz word. Therefore, the governmental machinery also needs to be geared up with such eventualities.

 

Initiatives have been taken by National Informatics Centre to provide most of the information through a portals providing different kinds of digital documents on the governmental activities. But these efforts also has not been properly coordinated and digital documents brought out by different departments do not form part of the Web site. A study has been made in this direction and the results of the same is placed as Annexure I to this paper.

 

Analyzing the Web Site of the NIC with regard to the availability of digital documents on various sites of the Government Department we reach to the conclusion that many of the Web Sites do not contain the Annual Reports, which is considered to be most important document depicting the objectives and activities carried out by the Government Departments in a particular year. None of the Survey Reports brought out by government departments could find place in the electronic environments. Many of the Departments which are well known for their statistical publications has not provided the information on the Web Site. Budget documents are also indiscriminately being made available on the web site. However, some of the  Departments have provided the full text documents of their study reports, handbooks, magazines and journals, policy documents, monographs, guidelines for various schemes launched, news bulletins and bibliographical information about the electronic documents as priced publications.

 

Development of full text documents in digital form requires total coordination within an organization or a single window department which could convert the government information being generated by various departments into digital document and launch it on the respective web site.

 

The Central Secretariat Library, which is the largest resource centre in the government libraries sectors, has undertaken several new technological initiatives to achieve the Mission.  The Mission is to :

 

“ take an Indian Initiative to ICT through their libraries which will promote, facilitate the development of  Indian tangible heritage from printed form to machine readable collections and provide services in order to utilise the resources optimally and provide life long accessibility of information through vast library resources.”

 

In order to achieve the Mission, Central Secretariat Library ventured into the development of machine readable database of following nature:

 

1.      Development of Machine readable catalogue of bibliographical information for document resources;

2.      Creating digital documents of the annual reports, budget documents  of the parent Ministry;

3.      Creating digital documents of the Government of India Gazette from inception to date;

4.      Developing  machine readable annotated bibliography of rare book documents available in the library;

5.      Developing authority file of Indian names and making them available on the network

 
Most of the government libraries in India have partially taken initiatives in fulfilling the dream of making their libraries a forum for providing right to information. The reasons are shortage of professional staff; lack of motivation; lack of using information technology for wider accessibility to users; lack of technological aptitude; lack of proper infrastructure; lack of initiatives from the decision makers. It became difficult for any promising professional to bring the government libraries in the forefront with other library initiatives in the country.  Hence, a thought was given to utilize the services of professional and technical agencies to carry out the development of various electronic resources to meet the requirement of Information Technology initiatives. In other words, it was thought over that outsourcing the activity will provide time bound results.
 
 OUTSOURCING THE DEVELOPMENT OF FULL TEXT DATABASES BY USING THE DIGITAL TECHNOLOGY
 

 One of the most important element of outsourcing the work of such gigantic nature is to work out the standard tender document which could be transparent and do not leave any room for ambiguity. The Tender document prepared by CSL has been in use by most of the institutions within Department of Culture and has been very successful. The features of the tender document are :

 

THE STANDARD TENDER DOCUMENT

 

  The tender document addressed many of the technical, professional, financial, legal issues concerning the work and has been described in different sections. 

Section one contained invitation for an open bid to undertake the work of digitizing the Government of India Gazettes from inception. In addition, the invitation was also to create full text databases of Annual Reports, Performance Budgets and Demands for Grants documents of the Department of Culture, GOI.   It contemplated that on completion of the proposed work, the library shall be in a position to make available all the textual information of the above documents through Internet and thus fulfilling the long standing demand of the Indian citizen ‘ right to information’. The tender document  contemplated to have machine readable archive of the electronic document, create a strong searching mechanism using Dublin Core Meta Data Elements with UNIMARC data tags in XML format. It  had  followed two-bid system. The Technical bid and the Financial bid.

Section two contained the background, Scope of Work and criteria to be an eligible bidder. 

Section three contained information for prospective bidders.  They are expected to examine all instructions, forms, terms, specifications and other information in the bidding document.  It cautioned that failure to furnish all information required by the bidding document or submission of a bid not substantially responsive to the bidding document in every respect will be at the bidder’s risk and may result in rejection of the bid.   The bidders were asked to furnish, technical bids on a specified form with information on the legal status of the firm/institution; works of similar nature undertaken/ performed in the past; copies of the Annual Report, Balance Sheet and audited accounts; Income tax clearance certificates; a certificate from client whose work has been undertaken by the firm/institution in the recent past; existing availability of hardware, software and human resources; profile of proposed project manager; list of qualified librarians along with their curriculum vitae who will be associated with the firm/institution in this work; and the price bid in a specified form.  

It stated that a Bid Evaluation Committee shall undertake the scrutiny of the technical bids to determine whether the bid is of acceptable quality and is substantially responsive. It defined the substantially responsive bid as one that conforms to all terms and conditions without material deviation, objections, conditionality or reservations and is complete in all respects in terms of the information sought along with the Bid form.  A material deviation, objection, conditionality or reservation is one that affects in any substantial way the scope, quality or performance of the contract or whose rectification would unfairly affect the competitive position of other bidders who are presenting substantially responsive bids.   It clearly stated that the Library’s determination of a bid’s responsiveness is to be based on the contents of the bid itself without recourse to extrinsic evidence. The Bid Evaluation Committee shall follow objective criteria for evaluation of technical bids by assigning marks to assess the prior library experience, financial and logistic capacity and proposed work plan.      

DIGITIZATION OF GOVERNMENT OF INDIA GAZETTE

Central Secretariat Library has very rich collection of Indian Official Documents. One of the very widely used category is Gazette of India (Central Government). The collection dates back to 1922, however complete sets are available from 1950. Of the total usage of Indian Official Publications more than 70% queries pertain to Gazette of India. This is the only library where users have free access to this type of resources. Every effort is made by the library to acquire the Gazettes so that the sets are complete.

Gazettes are the official publications of Governments of India in which, the information is notified by the Government of India, which can be to as the authentic source by the people of India. The following are the main sub-division:

 

PART – I SECTION – I: Notification relating to Non-statutory Rules, Regulations, Orders, and Resolutions issued by the Ministries of the Government of India (other than the Ministry of Defence) and by the Supreme Court.

 

  • PART – I SECTION – 2: Notification regarding Appointment, Promotion, Leave etc. of Government Officer issued by the Ministries of the Government of India (other than the Ministry of Defence) and by the Supreme Court.

 

  • PART – I – SECTION 3 : Notification relating to Resolution and Non-Statutory Order issued by Ministry of Defence.

 

  • PART – I SECTION 4: Notification regarding Appointments, Promotions, Leave etc. of Government Officers issued by the Ministry of Defence.

 

  • PART – II – SECTION – I:  acts, Ordinances Resolutions.

 

  • PART – II – SECTION 1A: Authorities texts in Hindi languages of Acts, Ordinances and Regulations.

 

  • PART – II – SECTION – 2: Bill and reports of select Committee on Bills.

 

  • Part – I -- SECTION – 3: Sub-Section I – General statutory rules including Orders Byelaws Etc. of general character issued by the Ministries of the Government of India (other than the Ministry of Defence) and the Central Authorities (other than the Administration of Union Territories).

 

  • PART - II – SECTION – 3: Sub-Section II – Statutory Orders and Notifications issued by the Ministries of the Government of India (other than the Ministry of Defence) and by the Central authorities (other than the Administration of Union Territories).

 

§         PART – II SECTION –3: Sub-Section III – Authorities texts in Hindi (other than such texts, published in section 3 or Section 4 of the Gazette of India) of General Statutory rules and statutory order (including Bye-laws of general character) issued by the Ministries of the Government of India (including the Ministry of Defence) and by Central authorities (other than administration of Union Territories).

 

  • PART – II – SECTION 4: Statutory rules and Orders issued by the Ministry of Defence.

 

  • PART – III – SECTION – I: Notification issued by the High Courts, the comptroller and Auditor General, Union Public Service Commission, the Indian Government Railways and by Attached and Subordinate Offices of the Government of India.

 

  • PART III – SECTION – 2: Notification and Notices issued by the Patent Office, relating to Patents and Designs.

 

  • PART – III – SECTION – 3: Notification issued by or under the authority of Chief Commissioners.

 

  • PART – III – SECTION–4: Miscellaneous Notifications including Notifications, Orders, Advertisements and Notices issued by Statutory Bodies.

 

  • PART – IV: Advertisements and Notices issued by Private Individuals and Private Bodies.

 

  • PART V: Supplements showing statistics of births and Deaths etc. both in English and Hindi.

 

The Central Secretariat Library  has undertaken a very important, relevant and multi-crore project of developing electronic document of Government of India Gazette of the post-independence period since 1950’s by digitizing them and developing a IT based retrieval database. There are about 18 lakh pages of Gazette which need to be converted into electronic form. It will be worthwhile to mention in detail the activities related Digitization using best practices and the standards for metadata creation using Dublin Core meta data elements with UNIMARC tags in XML format.

The work has been outsourced to a private agency and initial job of scanningabout 18 lakh pages has been completed. Legally vetted contract has been signed. The total project is likely to be completed within 24 months w.e.f. February 2003.

It has been envisaged that digitization work of the Government of India Gazette once completed will give new fillip to the information retrieval in the IT environment especially through the Internet (subject to approval from the concerned authorities). It will also meet the requirement of the policy planners of the GOI and also fulfill the requirement of right to information, wherein the general mass is ensured speedy access to information on different orders issued on various issues by different governmental organizations.

1.      In order to see that the data base is developed based on  the methodology and best practices being adopted for digitization work internationally and based on the benchmark (such benchmarking should be based on the current IT scenario, futuristic vision, extraction of the best methodology adopted by different agencies during the demonstration and cost effectiveness), the scope of activity involved  has been worked out  which had many important and nascent features and activities to be performed. The total scope of work as envisaged for Government of India Gazette are as follows :

2.      The detailed workflow providing each stage of work has been worked out :

 

1

Preparation of the document for scanning, including cleaning, numbering, etc.

2

Opening the binding to create loose leaves, for scanning

3

Scan the loose leaves/pages to create TIFF images, with appropriate resolution.

4

Scan separately for charts, diagrams, photos, etc (where applicable)

5

Closing the binding to bring the book back to its original state, after scanning.

6

Organise the Images (step 3/4) in  a tree structure following the YYYY/MM/DD convention.

7

Archive the resultant files (step 6) in CD/DVD and its delivery.

8

Convert the TIFF images (3) to PDF without text

9

Archive the resultant files (step 8) in CD/DVD and its delivery.

10

OCR the organised TIFF images (step 6) with manual zoning, for accuracy.

11

Manually correct, proof, and run Quality Control checks on the resultant full text file (step 10), and save the text file into RTF (Rich Text Format)

12

Archive the resultant RTF file, in appropriate tree structure, along with the images, and deliver to CSL on CD/DVD

13

Convert the TIFF image (step 6) into PDF WITH text underlying (after step 11)

14

Archive the resultant PDF, in appropriate tree structure, and deliver to CSL on CD/DVD

15

Completely analyse the content of each gazette document and capture the metadata, as per a combined criteria consisting of fields from Dublin Core and UNIMARC.

16

Convert the RTF file (step 11) into XML format with appropriate metadata (15) embedded.

17

Archive the resultant XML files (16) along with the other corresponding files.

18

Create an ODBC database with all the metadata and corresponding  data-file- identifiers (for the XML and PDF files). One complete and individual database entry per document, with all the appropriate fields for which information is available in the document. Calibrate, Validate, QC check.

19

Archive this Database and deliver to CSL on CD/DVD

20

Build Index/Indices for search on various set parametres: Full Text Search as well as Retrieval by Metadata and Different Combinations of Metadata.

21

Integrate the Indices and data with a web-server application for web-enabling the final database.

22

Installation of the Portal Application Infrastructure- Portal platform, applications, integrator, data agents on the hardware provided by CSL in a platform specific to the Portal Application Infrastructure. Integrate it with the larger software and hardware platform of the CSL. Calibration, Testing, Final Quality Checks.

23

Install/Load the data of the Annual Reports database and the Gazettes database on a data server (this means, a storage server where the large volume of data consisting of the various actual data files - PDF, XML, etc.). Calibrate, Test, Final Quality Checks.

24

Integrate the Portal Application (22) with data server/source (23). Calibrate, Test, Final Quality Checks.

25

Design of the start-pages of the portal. After approval of the design, integrate it with the search interface. Quality Checks.

26

Hand over the Portal to CSL.

27

Deliver the tools required for managing and updating the portal, along with complete documentation. Also train the CSL personnel.

28

Manage and maintain the databases for two years from completion. Inform CSL from time to time about usage statistics, load on the server, bandwidth usage, etc.

 

3.      Following deliverables have been envisaged :

A.      TIFF images to PDF with text ;

B.     PDF on CD-ROM after OCRing in RTF ( Rich Text Format) for full text search;

C.    TIFF on CD-ROM in raw form;

D.    XML based ODBC compliant database archives on CD-ROM;

E.     Installation of Portal Application Infrastructure;

F.     Hand over the Portal to CSL;

G.    Deliver the tools required for managing and updating the Portal;

H.     Any other deliverables as per the total scope of work envisaged by CSL

 It may be noted that the very elaborated technological solutions has been envisaged  using the international standards. They are:

1.      Using the OCRing technology for data proofing and data cleaning;

2.      Using XML format for data formatting; using Dublin Core metadata tools for developing the indexes;

3.      Using UNIMARC tags to individualize the information component for each peace of notification.

4.      Internationally acclaimed Content Management System viz. ISYS is being used to provide access points to the e-document

Analyzing the database and capturing the metadata as per Dublin Core and using UNIMARC tags will give the complete project an international recognition. Using full text XML format in it self is cost and time intensive job and advantage will be to bring the database at par with new technology and can be modified easily as and when the technology changes. This will also ensure that the database does not become obsolete as and when there is a change in the platform. The deliverability which has been conceived include the raw data, PDF on CD-ROM after OCRing for full text search through RTF; TIFF on CD-ROM/ DVD; XML file on CD-ROM/ DVD; XML and PDF on Web site so developed with the standard DBMS used. It will ensure that when ever any changes are to be brought in the database, CSL does not make any additional expenditure. The methodology of capturing the content and developing the data base is the same as is available in printed volumes. 

MAPPING FROM DUBLIN CORE TO UNIMARC FOR GAZETTE DOCUMENTS

 

DUBLIN CORE ELEMENTS

RELEVANT UNIMARC TAGS

(indicators relevant to the tag needs to be provided every where)

USE OF INFORMATION IN GAZETTE

 

Title

 

200$a, $e : Title & Sub-title

510$a, $e : Title & Sub-title

517$a, $e : other variant titles

1.      Specific title of the gazette notification if available along with subtitles to be placed at tag 200 $a and $e

2.      In case specific titles are not available then it has to be created based on the gist of the notification and placed at tag 200 $a

3.      In case of Hindi version the tag 510$a and $e may be used

4.      In case it has more than one title. Tag 517$a and $e should be used by providing information about such titles

Creator

 

710$a : Main Corporate Body under India

200$f : Specific Corporate body under Main Body

1.      Name of the corporate body who is responsible for the notification is to be given at tag 710$a, e.g. 710$aIndia. Ministry of Tourism and Culture;

2.      Name of the specific body within the main body who has issued the notification should be given at tag 200$f. e.g. 210$f Department of Culture, Ministry of Tourism and Culture, GOI

Subject

610$a : Uncontrolled subject terms

 We recommend that open key words derived from the content of the notification should be incorporated at tag 610$a. It can be any number of key words

Description

330$a : Summary

A brief description with 10 to 15 words on the content of the notification should be incorporated

Publisher

210$c : Name of the Publisher

By default it should contain the name of ‘ Controller of Publications, GOI, New Delhi’

Contributor

711$a : Specific Corporate Body

The name of the specific corporate body who acts as a contributor to the thought content of the notification should be given. e.g. 711$aDepartment of Culture, Ministry of Tourism and Culture, GOI

Date

210$d : Date of issue of the order by the contributor

The date of issue of the order by the contributor should be provided as YYYY-MM-DD.  at tag 210$d e.g. $d2000-12-14

Type

608$a : general nature information providing clue to the document’s physical location

Information regarding the Sections/ sub-sections/ type like : ordinary/extraodinary in which the notification has been published in print media by the publisher should be provided at tag 608$a

Format

336^a

The type of computer file need to be mentioned. E.g. pdf, tiff, xml, doc etc.with complete address of the file including its links

Indentifier

 

300^a

All the references concerning the location of the notification like GSR no., SO no. etc. are required to be given including all the linkages in the form of URL  for formal identification also required to be given

Source

 

324$a

The date of notification published by the publisher is the source to locate the content of the notification. Hence this required to be given as a Source element as YYYY-MM-DD at tag 324$a and the link should be provided with the date element.

Language

101$a

Language of the source item should be given by default. E.g. Hindi notification should be given as Hindi and similarly for English also.

Relation

300$a

All the references which could link the information with formal identification numbers/concept/word should be incorporated at tag 300$a

Coverage

300$a

Scope of the content should be provided in tag 300$a. e.g. the nature of information like appointments, policy decision etc. the geographical area which is covered in the content should also be provided

Rights

300$a

There could three possible information. 1. The rights of the content, i.e. the contributor; 2. The rights of printing, i.e Controller of Publications, GOI; and 3. The rights of digital material, i.e. CSL, Department of Culture

 

The work began in the month of February 2003 and as a first step towards the digitization, the image scanning of the gazette documents began with the latest years print documents available in CSL. However, before the scanning process began, the file structure on which the images are likely to be stored which will be commensurate with the requirement of gazette documents was worked out. Such file structure were based on the Sections and sub-sections and were done following tree-structure system of data storage beginning with the year 2002. About 18,00,000 pages were scanned covering the period 1940-2002 within a period of 60 days. The outsourcing agency had installed a high capacity scanner which could scan about 80 images per minute and scanning resolution was decided based on the quality of the paper used. It varied from 200 to 600 dpi. The first deliverables in the form of raw TIFF images based on the file structure worked out for the purpose, were received by CSL in DVD which contained 90 GB data.