How topic modeling helped us restructure the blog Abierto al Público and increase our search visibility
When it comes toknowledge management,and in the case ofopen knowledgein particular,the main challenge is arguably nolongerabouta lack of information. For those with access to it, the Internet has the potential to connect us with an abundanceof knowledge, andincreasingly viaformats that are free to access.That being said,the ongoing curation, navigation, and synthesis of so muchinformationis one of thecurrent dilemmas related toconnectingthe most relevantand actionableresourcestothose who search for them.This goes beyond apursuit that ismerelyforaesthetics, promotion,or marketing. As wehave observed recentlyaround the world,the so-called“infodemic”urgently demands new ways of supporting peopletofind the knowledge they are looking for, and that content creatorsassume an increased responsibility forpresenting knowledge and informationclearlyandcomprehensivelytotheirreaders.
For thesereasons,the IDB is constantly exploring and refining techniques tobetterconnecttheLatin American and Caribbeanregion withqualityopen knowledge. Oneveryparticular example of this, out of manyongoing efforts, includes workour teamhas donetoimprovethecurationand organization of thecontentpublishedhereat Abierto alPúblico.In this article, we share some of the learnings about how we have usedtechniques liketopic modeling and SEOto approach content management more efficiently,withtheguidingmotivationtobettersupportreadersin finding thecontentand learning resourcesthataremostmeaningful and practical to them.We hope that you can use these techniques tobetterorganize andshare your knowledge,too.
A big milestone, new emerging topics - and a lot of content
Abierto alPublicois first and foremost an IDB resource to share learnings about open knowledge.As we recently celebrate overfiveyears of being online,the bloghaspublishedmore than500 articles related to all things openin connection with economic and social development in Latin America and the Caribbean:includingopen knowledge, open data, open government, open innovation, and more recently, opensource technology.
But how to navigateand make sense ofit all-- especially for our first-time visitors?It wasanimportant time for us to reflect on this question, for various reasons. For one,our coverage related to theopen movementcontinued toevolvebeyond the blog’s original categories.We needed a new method for grouping content in a way that would make senseto readerswhile also offering flexibility to incorporate future contentas we continue to growand follow new lines of conversation.Second, the volume of content discouraged too much manual sorting and rearranging.This is an important consideration becausewewant to be efficient with our use of time and resources.
With this in mind, we wanted tosee howAIand Natural Language Processingcould play a role in complementing our strategy and streamlining the otherwise manual task of sorting and categorizing our content in a balanced and consistent manner.
Centering on SEO:Mapping content for the benefit of both people and search engines
Similar to the discussion aroundgood practices for open data, it isalsobroadly essentialfor goodknowledge and content managementthat related subject matter can be found and followed by both people and machines.
For this reason, understanding the science behind search engine optimization became an important focal point of ourcontent managementstrategy.In order to improve how your content appears in search results, search engines like Google constantly scan the web to evaluate the sitemaps of different content providers and try to understand what that content is about, while also making a determination about the quality and relevance of that information to a user’s search. Because of this, we learned about how important it is tomaintain consistent categoriesand tags as well as maintaining relevant links between related content.
When it comes to categories, each article should only belong to one, like the branch of a tree or the hub at the center of a wheel. The number of categories should be roughly balanced in terms of the amount of content in each, and a clear logic should connect the content to its category while also making it distinct it from the other categories.
Learn more hereaboutcategorization and topic clusters.
But how many categories would we needto organize so much content? This was ournextquestion. We needed to compare and evaluate our options without too much manual sorting. It isin this context where Topic Modeling becomeshighly relevant.
How we used Topic Modeling toidentify andcreate categories of content
Topic modeling is one of severalNatural Language Processing techniqueswithin the wider field of artificial intelligence.It can be applied toautomatically identify underlying, hidden or latent themes, patternsorgroupingswithin large volumes of text, also known as the “corpus”.As we have learned and shared from previous experiences involving Artificial Intelligence, it is key to remember that success depends largely on the quantity and quality of datathat will be used. In the case of Topic Modeling, that same reminder also holds true.
In the case ofAbierto alPúblico,firstwe gatheredthe500+ articles(the corpus)into a single csv file for analysis. This can be achieved using web scraping techniques or otherwise depending on your access to the original file sourcesand their formats.
The next step was to clean the datato maximize the emphasis on the thematic content. For example, weremoved punctuation and words that did not provide muchcomparativeinformation about the contents of the text such as prepositions, conjunctions,etc.Programming techniques in python can help facilitate this process.
After the data set was cleaned andprepared, we started the iterative process of training thetopic modelingalgorithm. This meant running the cleaned corpus dataset through an engine.Each iteration consistedofassigninga differentarbitrarynumber of buckets, or topics, in which to classify the termsfound in the corpus. The output would provide the groupings of each individual articlealong with a probability of confidence about how well that content matched the rest of the information in the same grouping.
What tools are available to implement topic modeling?There are multiple tools that can help you run the Topic Modeling exercise, such as:- For working in open source, the Gensim library developed for python or the topicmodels package for R.
- Even though they are not open, here are several other services available that let you perform Topic Modeling, even with limited coding experience and a reasonable cost. Two examples of these alternatives are the Amazon Comprehend AWS Service and the LDA module (LatentDirichtletAllocation) included in the Azure Machine Learning Studio.
Interpreting the results
Analyzing the results of a topic modeling exercise can be a very subjective task, so involving subject matter experts in the process is important. It is important to cross-validate the potential patterns that the machine has interpreted with a more human validation.We played with combinations ranging from 3 topics to 10 topics, and carefully compared the results of eachoutput, until we finally homed in toward thebalance offered in theresultsofthe 5 topic range, which came to be interpreted as these categories:
Once we reachedthat point, we then repeated the topic modeling processwith the contentinside eachof thecategories to identify more specific sub-themes or clusters. Thissecond roundhelped us to build out new content that couldhighlight thecontent within eachcategoryand their related subtopics.From there, we couldalsomake the final validations and adjustments regarding specific tags or incorporating specifickeyphrasesin relation to SEO.
Applying and implementing the results into our strategy for improved search visibility
Thisclassificationstructurehashelped usexpand ourcontentcoveragewhile also maintaining specific points of focus. It has also helped us with common legacy issues, such as avoiding the duplication of existing content by having a clear mapping and awareness at hand, and to continue building constructively on the existing conversations where we have invested before in different topics of conversation.ThishelpsAbierto alPúblicorespondto users’ interestswithcontent that is betterstructured and connected. It has also contributed tomakethecontentmorevisible andattractive to search engines.
As a result of this and a few other editorial changes, Abierto al Público has more than doubled the visibility of its content via organic search over the past year.
And you? How do you think topic modeling can benefit knowledge resources for your work, community or government?